What is the N + 1 Query Problem? How to Avoid It Effectively?

When working with database queries in backend development (especially with ORMs), developers often inadvertently fall into the N + 1 problem. This article explores what the N + 1 problem is, its performance impact, and how to refactor code to avoid it.

What is the N + 1 Problem?

The N + 1 problem is a common query performance issue. If you notice that queries become significantly slower with larger datasets, the N + 1 problem is a worthwhile direction to investigate first (Note: Database slowness can have other causes. The N + 1 problem is just one aspect worth checking. It's recommended to investigate multiple angles to isolate the issue).

Let's understand the N + 1 problem through a concrete example. The following uses Drizzle ORM to fetch all posts from a community and retrieve the author information for each post so both can be displayed together.

To retrieve all posts along with their corresponding authors, an intuitive approach is to first query all posts into allPosts, then iterate through each post with for (const post of allPosts), and fetch the complete author information based on the post's author id.

const allPosts = await db.select().from(posts);

for (const post of allPosts) {
  const author = await db
    .select()
    .from(authors)
    .where(eq(authors.id, post.authorId))
    .limit(1);
}

In this query pattern, the 1 represents the initial query fetching all posts, while N represents the subsequent queries for each post to retrieve its author's details. This typical N + 1 problem has an obvious drawback: as post count increases, queries for authors scale linearly.

The SQL queries executed behind the ORM roughly look like this (with 5 posts as an example). With few posts, this might not incur significant performance costs. However, with 10 posts, you'd need 10 additional queries to fetch author information (n = 10). With 1,000 posts, you'd need 1,000 extra queries (n = 1,000). In such cases, query performance naturally degrades significantly as posts accumulate.

Post Load (0.5ms)  SELECT "posts".* FROM "posts"
Author Load (0.3ms)  SELECT "authors".* FROM "authors" WHERE "authors"."id" = 1
Author Load (0.3ms)  SELECT "authors".* FROM "authors" WHERE "authors"."id" = 2
Author Load (0.3ms)  SELECT "authors".* FROM "authors" WHERE "authors"."id" = 5
Author Load (0.3ms)  SELECT "authors".* FROM "authors" WHERE "authors"."id" = 5
Author Load (0.3ms)  SELECT "authors".* FROM "authors" WHERE "authors"."id" = 3

How to Refactor and Avoid the N + 1 Problem?

The N + 1 pattern emerges naturally because it's intuitive to think of and technically correct. Because it's both simple and correct, choosing this approach feels unsurprising. However, as a backend engineer, simply writing correct queries isn't enough. You must ensure queries are efficient and data retrieval is fast, preventing queries from becoming a performance bottleneck.

How to Identify the N + 1 Problem?

To effectively avoid the N + 1 problem, you first need to identify it. As shown above, N + 1 issues typically appear within loops. If you find code with a loop where each iteration executes a query, that warrants closer inspection.

Another indicator is examining the actual SQL queries executed. As demonstrated earlier, N + 1 queries characteristically have many SELECT statements with WHERE clauses, but the WHERE conditions are largely repetitive, differing only in ID. This pattern often signals a refactoring opportunity.

Beyond code inspection, performance monitoring shows another telltale sign. If performance is acceptable with small datasets but degrades linearly as data volume increases, it likely indicates an N + 1 problem.

While these patterns don't necessarily confirm an N + 1 problem, if you notice similar issues, it's worth investigating. Once confirmed, refactoring is recommended. Let's explore several refactoring approaches.

Refactoring with `JOIN`

From an SQL perspective, to fetch both posts and author information, a single JOIN query suffices, like this:

SELECT
    posts.id,
    posts.title,
    posts.content,
    authors.id as author_id,
    authors.name as author_name,
    authors.email as author_email
FROM posts
INNER JOIN authors ON posts.author_id = authors.id;

Using JOIN reduces query count from N + 1 to 1. However, this approach has a potential drawback. Observe the following result set. Can you spot what could be improved if we retrieve data via JOIN like this?

post_id | title           | content      | author_id | author_name | author_email
--------|-----------------|--------------|-----------|-------------|------------------
1       | "First Post"    | "..."        | 5         | "Alice"     | "alice@..."
2       | "Second Post"   | "..."        | 5         | "Alice"     | "alice@..."
3       | "Third Post"    | "..."        | 12        | "Bob"       | "bob@..."
4       | "Fourth Post"   | "..."        | 7         | "Charlie"   | "charlie@..."

Careful observers might notice that since the first and second posts were both authored by the user with author_id 5, the database returns that author's information twice. If posts tend to be authored by a small group of users, the database would return substantial duplicate data, consuming bandwidth and adding transmission overhead (Note: From SQL and database perspectives, JOIN is a reasonable and common approach. The concern here primarily addresses data transfer volume and application-layer processing costs).

Refactoring with `IN` Clause

To avoid returning duplicate data with JOIN, you can refactor using an IN clause to query while avoiding the N + 1 problem and retrieving unique, non-duplicate data.

This approach works like this:

-- First query to fetch posts
SELECT id, title, content, author_id
FROM posts;

-- Fetch author information based on post author IDs
SELECT id, name, email
FROM authors
WHERE id IN (5, 12, 7, 23, 34);

-- Note: After fetching data, the application layer needs additional logic in code to combine the results from both queries

While the IN approach requires two queries compared to JOIN, it fundamentally avoids the N + 1 problem since regardless of post or author count increases, this method always requires just two queries.

Both approaches solve the N + 1 problem, but choosing between them depends on your context. For instance, with small datasets, the duplicate data from JOIN is typically acceptable overhead, and JOIN query performance often outperforms the two-query approach.

Conclusion

Through this article, we've gained concrete understanding of the N + 1 problem. As mentioned, when you discover unusually slow queries during development, checking for the N + 1 problem is worthwhile. Simple refactoring can yield significant performance improvements.