How OpenAI Scaled PostgreSQL to Power 800 Million ChatGPT Users

OpenAI recently published "Scaling PostgreSQL to power 800 million ChatGPT users," an article detailing how PostgreSQL—the core database system behind ChatGPT—powers their platform at massive scale.

This article explores how OpenAI extended PostgreSQL to handle over 800 million users, a remarkable achievement. We'll examine ChatGPT's relationship with PostgreSQL, the scaling challenges they faced as traffic exploded, and the innovative solutions OpenAI's engineering team implemented. Finally, we'll dive into the key insights the team shared about overcoming these obstacles.

OpenAI and Postgres

First, what's the relationship between ChatGPT and Postgres? While most users interact with ChatGPT for its core generative AI capabilities, the product itself needs to store vast amounts of user data beyond just powering the AI model. This is why PostgreSQL became essential to ChatGPT's infrastructure.

Consider what happens when a user registers for ChatGPT. The system stores user account data so that when they return, the platform can retrieve their information. This data includes user preferences—for instance, the custom system prompts users configure. Nobody wants to reconfigure their system prompt every time they use ChatGPT, so the platform needs a database to persist this information.

Similarly, users expect their conversation history to be preserved. This requires a database to store chat records. While OpenAI uses various databases including Redis and NoSQL solutions, PostgreSQL remains their primary data store.

As ChatGPT's user base exploded, by 2026 when this article was published, the platform had grown to over 800 million users in just three years. OpenAI's demand for PostgreSQL capacity grew more than tenfold. With this surge came the classic scaling challenge: the existing infrastructure couldn't keep up.

Typically, two approaches address this problem. The first is vertical scaling—upgrading to more powerful hardware. However, there's always a ceiling to how much you can upgrade a single machine. The second approach is horizontal scaling—adding more machines to distribute the load.

OpenAI pursued both strategies. While not detailed in this article, the same author previously mentioned that OpenAI uses Azure's highest-tier machines, having pushed vertical scaling to its limit. Beyond that, they implemented horizontal scaling through two key techniques: sharding and replication.

Sharding involves partitioning the database into separate shards, each running on different machine instances. For example, write requests A and C might go to the first shard, while requests B and D go to the second. If a single machine handles 1,000 writes per second, distributing across multiple shards allows handling significantly more total writes.

Replication is another horizontal scaling technique. The approach uses one primary database handling writes and multiple replica databases that synchronize with the primary. Read requests can be handled by these replicas rather than the primary, distributing the read load.

With these foundational concepts in place, let's examine the scaling challenges OpenAI faced as ChatGPT's traffic exploded.

Scaling Challenges as ChatGPT Traffic Exploded

OpenAI's architecture uses a single primary database for all write requests, paired with approximately 50 replica databases handling reads. This design makes sense because ChatGPT and OpenAI's APIs have read-heavy workloads—most operations are reads rather than writes. A single primary with regional replicas efficiently handles massive read traffic.

However, a critical problem emerged. While reads outnumber writes, ChatGPT's 800 million global users generate enormous write volume. Despite being the minority, the sheer quantity of writes overwhelms PostgreSQL's write performance. The database system wasn't optimized for such high write throughput.

The article references another collaboration between OpenAI and CMU database professor Andy Pavlo, discussing how PostgreSQL's Multiversion Concurrency Control (MVCC) design limits write efficiency. Rather than dive deeper here, readers interested in the technical details should consult that paper.

When PostgreSQL struggles with high write volume and OpenAI has only a single primary database handling writes, primary database overload becomes catastrophic. The article includes a diagram showing several scenarios triggering massive write spikes: cache misses, expensive queries, or new features attracting sudden user surges.

When the primary database becomes overloaded, requests slow down or timeout. Users retry, sending more writes. This creates a vicious cycle: the primary is already overwhelmed by the original spike; new retry traffic pushes it over the edge, causing a complete outage.

At this point, a reasonable question arises: OpenAI uses sharding for horizontal scaling, and sharding effectively distributes write load. Why not shard the primary database itself?

Two factors explain this decision. First, OpenAI does shard writes, but not on the PostgreSQL primary—we'll discuss their approach shortly. But this doesn't fully answer why they don't shard the primary. The fundamental reason is ChatGPT's explosive growth. At this scale, sharding the primary would be extraordinarily time-consuming. Implementing sharding might require modifying hundreds of applications consuming this database. A safe, comprehensive migration could take months or even years.

The article mentions OpenAI considered primary sharding, but not immediately since they couldn't complete it quickly. With short-term sharding infeasible, the challenge becomes: how do we handle growing traffic without sharding the primary? How do we scale effectively within these constraints?

Reducing Load on the Primary Database

Now let's explore how OpenAI addressed these challenges. We'll start with the first obstacle: managing an ever-growing write load on a single unsharded primary database.

Recall that OpenAI's architecture relies on a single primary database for writes. Sharding this primary would require extensive time, so they pursued other strategies instead. This single primary represents a fundamental bottleneck—no matter how much hardware capacity increases, CPU and other resources have physical limits. When massive write spikes occur, the primary becomes overloaded, degrading ChatGPT and OpenAI's other API services.

How would you solve this in practice?

OpenAI's team shared three key approaches. First, reduce the request volume hitting the primary. As mentioned earlier, the primary handles writes while replicas handle reads through replication. By directing all read traffic to replicas rather than the primary, they dramatically reduced the primary's load.

Second, some new write demands can be sharded elsewhere. While sharding the PostgreSQL primary is too time-consuming, OpenAI uses Microsoft's Cosmos DB for write workloads that support sharding. This distinction is important: the PostgreSQL primary remains unsharded, but shardable writes move to Cosmos DB.

Third, OpenAI implemented lazy writes—avoiding the temptation to process all writes immediately during traffic spikes. Analysis showed that some writes don't require immediate processing. By deferring non-urgent writes, handling critical writes first, and processing deferred writes once the traffic spike subsides, they smoothed out the load on the primary. This prevents the database from becoming overwhelmed and keeps request latency manageable.

Query Optimization

The second operational challenge involves expensive queries. When traffic surges, costly queries consume excessive CPU, slowing down ChatGPT and API responses. How should you handle this?

The obvious answer: optimize queries. OpenAI focused on three optimization areas: reducing slow queries, carefully reviewing query statements, and cleaning up idle queries.

First, preventing slow queries requires avoiding operations unsuitable for an OLTP system. The database industry typically distinguishes two system types: OLTP and OLAP. OLTP systems handle transactional workloads—small, focused writes and queries. OLAP systems handle analytical queries—large-scale data aggregations, filtering, and complex operations.

In practice, when a data system lacks this separation, problems emerge. Backend engineers might need to ask data analysts: "Can you pause your analysis work? Your queries are slowing down production." This conflict occurs naturally without deliberate separation. While small-scale systems rarely face this problem, growing data systems inevitably do.

When OpenAI examined their PostgreSQL queries, they discovered some joined 12 tables simultaneously—clearly unsuitable for an OLTP system designed for transactional work. Identifying and eliminating these queries improved performance significantly.

Second, avoid over-relying on ORMs. Object-Relational Mapping tools let developers write application logic that the ORM translates to SQL. However, ORMs often generate suboptimal SQL. Classic problems like N+1 queries plague ORM usage. The article emphasizes inspecting the actual SQL generated by your ORM rather than blindly trusting it. Generated queries often benefit from further optimization.

Finally, OpenAI's team cleaned up idle queries using PostgreSQL's Idle in Transaction Session Timeout feature. This automatically terminates long-running or blocked queries. Through these three approaches—eliminating inappropriate queries, optimizing ORM-generated SQL, and removing stuck queries—they significantly improved overall query performance and reduced database load.

Avoiding Single Points of Failure

The third operational problem stems from their architecture: a single primary database handles all writes while approximately 50 replica databases handle reads. If a replica fails, traffic routes to another replica. But if the primary fails, the entire system collapses.

How would you solve this?

OpenAI's solution uses a hot standby. Alongside the primary database, a secondary database sits ready, continuously syncing all data from the primary. Should the primary fail, the hot standby immediately takes over, maintaining service availability.

This seems straightforward in concept but presents significant operational challenges. First, reliably detecting primary failure is technically non-trivial. Second, when the primary truly fails and the hot standby promotes, all replica databases must reconnect to what was previously the standby. This process involves countless failure points.

OpenAI credits Microsoft's Azure team with executing this flawlessly. Azure invested substantial effort into handling the complexities. However, this comes with a cost: operating a hot standby requires an additional machine matching the primary's specifications. Given OpenAI uses Azure's most powerful available machines, maintaining a second equally powerful hot standby represents significant expense. It's operationally necessary but financially non-trivial.

Workload Isolation

The fourth operational challenge involves resource contention. Some requests consume disproportionate PostgreSQL resources, degrading other services. For example, inefficient new features might pass testing because staging lacks production traffic volume. When launched with millions of users, these features suddenly consume heavy CPU, slowing critical functions.

How do you prevent resource contention from degrading important services?

OpenAI implemented workload isolation. They partitioned replica databases into separate instances, dedicating certain replicas to high-priority requests. This containment prevents slow or inefficient queries from affecting critical operations.

High-priority requests get dedicated database instances. When inefficient queries from new features slow down services, only lower-priority services are affected. Through this isolation strategy, OpenAI ensures their most critical systems remain unaffected even when other system components encounter problems.