18 April 2022
How Retool upgraded our 4 TB main application PostgreSQL database
What: Eng blog post
Written for: Retool blog
Links: Blog post (editor byline here), Twitter thread, Hacker News post — this post hit the front page of Hacker News.
My role: Edited and significantly refactored the original draft, after consulting with the author. Turned a play-by-play, chronological story into the three parts you see here: high-level design, migration strategy, and then technical details.
Retool’s cloud-hosted product is backed by a single beefy 4 TB Postgres database running in Microsoft’s Azure cloud. Last fall, we migrated this database from Postgres version 9.6 to version 13 with minimal downtime.
How did we do it? To be frank, it wasn’t a completely straight path from point A to point B. In this post, we’ll tell the story and share tips to help you with a similar upgrade.
Motivation
For those of you new to Retool, we’re a platform for building internal tools fast. You can use a drag-and-drop editor to build UIs, and easily hook them up to your own data sources, including databases, APIs, and third-party tools. You can use Retool as a cloud-hosted product (which is backed by the database we’re talking about in this post), or you can host it yourself. As the 4 TB database size suggests, many Retool customers are building many apps in the cloud.
Last fall, we decided to upgrade our main Postgres database for a compelling reason: Postgres 9.6 was reaching end-of-life on November 11, 2021, which meant it would no longer receive bug fixes or security updates. We didn’t want to take any chances with our customers’ data, so we couldn’t stay on that version. It was that simple.
Technical design
This upgrade involved a few high-level decisions:
What version of Postgres should we upgrade to?
What strategy do we use to do the upgrade?
How do we test the upgrade?
Before we dive in, let’s review our constraints and goals. There were just a few.
Complete the upgrade before November 11, 2021.
Minimize downtime, especially during Monday-Friday business hours worldwide. This was the most important consideration after the hard deadline, because Retool is critical to many of our customers.
Downtime is especially a factor when considering operating on 4 TB. At this scale, easy things become harder.
We wanted our maintenance window to be about one hour max.
Maximize the amount of time this upgrade buys us before we have to upgrade again.
PostgreSQL version 13
We decided to upgrade to Postgres 13, because it fit all of the above criteria, and particularly the last one: buying us the most time before the next upgrade.
Postgres 13 was the highest released version of Postgres when we began preparing for the upgrade, with a support window through November 2025. We anticipate we’ll have sharded our database by the end of that support window, and be performing our next substantial version upgrades incrementally.
Postgres 13 also comes with several features not available in prior versions. Here is the full list, and here are a few we were most excited about:
Major performance improvements, including in parallel query execution.
The ability to add columns with non-null defaults safely, which eliminates a common footgun. In earlier Postgres versions, adding a column with a non-null default causes Postgres to perform a table re-write while blocking concurrent reads and writes— which can lead to downtime.
Parallelized vacuuming of indices. (Retool has several tables with high write traffic, and we care a lot about vacuuming.)
Upgrade strategy
Great, we’d picked a target version. Now, how were we going to get there?
In general, the easiest way to upgrade Postgres database versions is to do a pg_dump and pg_restore. You take down your app, wait for all connections to terminate, then take down the database. With the database in a frozen state, you dump its contents to disk, then restore the contents to a fresh database server running at the target Postgres version. Once the restore is complete, you point your app at the new database and bring the app back.
This upgrade option was appealing because it was both simple, and completely ensures that data will not be out-of-sync between the old database and new database. But we eliminated this option right away because we wanted to minimize downtime—and doing a dump and restore on 4 TB would require downtime in days, not hours or minutes.
We instead settled on a strategy based on logical replication. With this approach, you run two copies of your database in parallel: the primary database you’re upgrading, and a secondary “follower” database running at the target Postgres version. The primary publishes changes to its persistent storage (by decoding its write-ahead log) to the secondary database, allowing the secondary database to quickly replicate the primary’s state. This effectively eliminates the wait to restore the database at the target Postgres version: instead, the target database is always up to date.
Notably, this approach requires much less downtime than the “dump and restore” strategy. Instead of having to rebuild the entire database, we simply needed to stop the app, wait for all transactions at the old v9.6 primary to complete, wait for the v13 secondary to catch up, and then point the app at the secondary. Instead of days, this could take place within a few minutes.
Testing strategy
We maintain a staging environment of our cloud Retool instance. Our testing strategy was to do multiple test runs on this staging environment, and create and iterate on a detailed runbook through that process.
The test runs and runbook served us well. As you’ll see in the section below, we performed many manual steps during the maintenance window. During the final cutover, these steps went off largely without a hitch because of the multiple dress rehearsals we’d had in the prior weeks, which helped us build a very detailed runbook.
Our main initial oversight was not testing with a representative workload in staging. The staging database was smaller than the production one, and even though the logical replication strategy should have enabled us to handle the larger production workload, we missed details that led to an outage for Retool’s cloud service. We’ll outline those details in the section below, but this is the biggest lesson we hope to convey: the importance of testing with a representative workload.
Plan in practice: technical details
Implementing logical replication
We ended up using Warp. Notably, Azure’s Single Server Postgres product does not support the pglogical Postgres extension, which our research led us to believe is the best-supported option for logical replication on Postgres versions before version 10.
One early detour we took was trying out Azure’s Database Migration Service (DMS). Under the hood, DMS first takes a snapshot of the source database and then restores it into the target database server. Once the initial dump and restore completes, DMS turns on logical decoding, a Postgres feature that streams persistent database changes to external subscribers.
However, on our 4 TB production database, the initial dump and restore never completed: DMS encountered an error but failed to report the error to us. Meanwhile, despite making no forward progress, DMS held transactions open at our 9.6 primary. These long-running transactions in turn blocked Postgres’s autovacuum function, as the vacuum processes cannot clean up dead tuples created after a long-running transaction begins. As dead tuples piled up, the 9.6 primary’s performance began to suffer. This led to the outage we referenced above. (We have since added monitoring to keep track of Postgres’s unvacuumed tuple count, allowing us to proactively detect dangerous scenarios.)
Warp functions similarly to DMS but offers far more configuration options. In particular, Warp supports parallel processing to accelerate the initial dump and restore.
We had to do a bit of finagling to coax Warp into processing our database. Warp expects all tables to have a single column primary key, so we had to convert compound primary keys into unique constraints and add scalar primary keys. Otherwise, Warp was very straightforward to use.
Skipping replication of large tables
We further optimized our approach by having Warp skip two particularly massive tables that dominated the dump and restore runtime. We did this because pg_dump can’t operate in parallel on a single table, so the largest table will determine the shortest possible migration time.
To handle the two massive tables we skipped in Warp, we wrote a Python script to bulk transfer data from the old database server to the new. The larger 2 TB table, an append-only table of audit events in the app, was easy to transfer: we waited until after the cutover to migrate the contents, as the Retool product functions just fine even if that table is empty. We also chose to move very old audit events to a backup storage solution, to cut down on the table size.
The other table, a hundreds-of-gigabytes append-only log of all edits to all Retool apps called page_saves, was trickier. This table serves as the source of truth for all Retool apps, so it needed to be up-to-date the moment we came back from maintenance. To solve this, we migrated most of its contents in the days leading up to our maintenance window, and migrated the remainder during the window itself. Although this worked, we note that it did add additional risk, since we now had more work to complete during the limited maintenance window.
Creating a runbook
These were, at a high level, the steps in our runbook during the maintenance window:
Stop the Retool service and let all outstanding database transactions commit.
Wait for the follower Postgres 13 database to catch up on logical decoding.
In parallel, copy over the remaining page_saves rows.
Once all data is in the Postgres 13 server, enable primary key constraint enforcement. (Warp requires these constraints to be disabled).
Enable triggers (Warp requires triggers to be disabled.)
Reset all sequence values, so that sequential integer primary key allocation would work once the app came back online.
Slowly bring back the Retool service, pointing at the new database instead of the old, performing health checks.
Enabling foreign key constraint enforcement
As you can see from the runbook above, one of the steps we had to do was to turn off and then re-enable foreign key constraint checks. The complication is that, by default, Postgres runs a full table scan when enabling foreign key constraints, to verify that all existing rows are valid according to the new constraint. For our large database, this was a problem: Postgres simply couldn’t scan terabytes of data in our one-hour maintenance window.
To resolve this, we ended up choosing to leave foreign key constraints unenforced on a few large tables. We reasoned this was likely safe, as Retool’s product logic performs its own consistency checks, and also doesn’t delete from the referenced tables, meaning it was unlikely we’d be left with a dangling reference. Nonetheless, this was a risk; if our reasoning was incorrect, we’d end up with a pile of invalid data to clean up.
Later, in post-maintenance cleanup where we restored the missing foreign key constraints, we discovered that Postgres offers a tidy solution to our problem: the NOT VALID option to ALTER TABLE. Adding a constraint with NOT VALID causes Postgres to enforce the constraint against new data but not existing data, thus bypassing the costly full table scan. Later, you simply need to run ALTER TABLE … VALIDATE CONSTRAINT, which runs the full table scan and removes the “not valid” flag from the constraint. When we did so, we discovered no invalid data in our table, which was a great relief! We wish we had known about this option before the maintenance window.
Results
We scheduled the maintenance window late on Saturday, October 23rd, at the lowest period of Retool cloud traffic. With the configuration described above, we were able to bring up a new database server at version 13 in around 15 minutes, subscribed to changes at our 9.6 primary with logical decoding.
To conclude, a logical replication strategy—aided by Warp—together with dress rehearsals in a staging environment during which we built a robust runbook, enabled us to migrate our 4 TB database from Postgres 9.6 to 13. In the process, we learned the importance of testing on real workloads, made creative use of skipping large, less-critical tables, and learned (a bit late) that Postgres lets you enforce foreign key constraints selectively on new data rather than all data. We hope you learned something from our experience too.