Back to all posts

Sometimes, four queries are faster than one.

By Maverick ChungMay 13, 2024
CI Analytics

We use a lot of SQL at Trunk. A lot! No surprise, it forms the backbone of our data analytics pipeline. This means that performance is always top of mind, so I periodically review our most used queries to see if there’s something I can improve.

The other day I checked the RDS performance insights on our AWS Performance Insights tab and I could see that the highest entry under Top SQL was a query on the CheckRun table. Hmm. Interesting.

Investigating the Query

After identifying a slow query I checked the RDS logs in Datadog. Sorting by @query_time_s descending, I saw that the offending SQL was routinely taking 15+ seconds (and even up to 100s sometimes). Here’s the query:

1SELECT `main`.`CheckRun`.`id`,
2    `main`.`CheckRun`.`repoId`,
3    `main`.`CheckRun`.`commitId`,
4    `main`.`CheckRun`.`series`,
5    `main`.`CheckRun`.`status`,
6    `main`.`CheckRun`.`createdAt`,
7    `main`.`CheckRun`.`updatedAt`,
8    `main`.`CheckRun`.`clientVersion`,
9    `main`.`CheckRun`.`clientLatestVersion`,
10    `main`.`CheckRun`.`statusMessage`,
11    `aggr_selection_0_LinterFailure`.`_aggr_count_linterFailures`,
12    `aggr_selection_1_CheckRunFailure`.`_aggr_count_checkRunFailures`,
13    `aggr_selection_2_CheckIssue`.`_aggr_count_Issues`
14FROM `main`.`CheckRun`
15    LEFT JOIN (SELECT `main`.`LinterFailure`.`checkRunId`,
16                      COUNT(*) AS `_aggr_count_linterFailures`
17               FROM   `main`.`LinterFailure`
18               WHERE  1 = 1
19               GROUP  BY `main`.`LinterFailure`.`checkRunId`) AS
20                                       `aggr_selection_0_LinterFailure`
21           ON ( `main`.`CheckRun`.`id` =
22                `aggr_selection_0_LinterFailure`.`checkRunId` )
23    LEFT JOIN (SELECT `main`.`CheckRunFailure`.`checkRunId`,
24                      COUNT(*) AS `_aggr_count_checkRunFailures`
25               FROM   `main`.`CheckRunFailure`
26               WHERE  1 = 1
27               GROUP  BY `main`.`CheckRunFailure`.`checkRunId`) AS
28           `aggr_selection_1_CheckRunFailure`
29           ON ( `main`.`CheckRun`.`id` =
30    `aggr_selection_1_CheckRunFailure`.`checkRunId` )
31    LEFT JOIN (SELECT `main`.`CheckIssue`.`checkRunId`,
32                      COUNT(*) AS `_aggr_count_Issues`
33               FROM   `main`.`CheckIssue`
34               WHERE  1 = 1
35               GROUP  BY `main`.`CheckIssue`.`checkRunId`) AS
36           `aggr_selection_2_CheckIssue`
37           ON ( `main`.`CheckRun`.`id` =
38    `aggr_selection_2_CheckIssue`.`checkRunId` )
39WHERE  ( `main`.`CheckRun`.`id` = '70e1c6c5-bce5-4be5-a0dd-d2ad6672afb2'
40      AND 1 = 1 )
41LIMIT  1 OFFSET 0; 

Oof. That’s a doozy. To debug this, I used EXPLAIN to examine the query, which produced this table:

Looking at the table and the MySQL EXPLAIN docs, we can see that this query is examining the entirety of the CheckIssue, CheckRunFailure, and LinterFailure tables. That’s Bad™

Tracking Down the Source

Let me explain. This query examines all three tables because it is trying to count the number of check issues, failures, and linter failures for a given check run. However, it does this by counting the number of check issues on every check run, and only then using the relevant count at the end of the query. We use Prisma as our ORM. It turns out this generated query is affected by a Prisma bug where counting relations loads the entire table.

To find the code that makes the evil query, I did a global code search for prisma.checkRun.find in our monorepo. There's probably a better way to figure this out, but fortunately there were only a handful of hits.

Fixing the Query

The getCheckRun function is called whenever the user tries to delete a check run. I fixed the query by separating out the counts; instead of making one query for all the data, I made 4 parallel queries for the check run and the 3 counts we wanted. Four queries isn’t necessarily faster than one, but it is in this case. The evil query roughly does this:

  • Count the number of linter failures on every check run in the database

  • Count the number of check run failures on every check run in the database

  • Count the number of issues on every check run in the database

  • Return the check run with the requested ID, along with counts of the above.

This is obviously a bad query because it has to scan the entire table of linter failures even though it only needs the ones associated with the particular check run the user is deleting! My solution was to split it into separate queries so the database only needs to scan tables for records that match the requested check run id. So now the process looks like this:

  • Return the check run with the requested ID

  • Count and return the number of linter failures on the check run with the requested ID

  • Count and return the number of check run failures on the check run with the requested ID

  • Count and return the number of issues on the check run with the requested ID

This set of queries is faster because it’s only looking at the relevant rows in each table instead of scanning the entire thing.

Alternatives

I could have written this as a single custom query, and it would be roughly the same speed but avoid making three extra network requests, resulting in a slightly faster overall experience, however we try to avoid writing queries by hand. We want to keep all database requests going through Prisma, and since Prisma currently has a bug, the four request solution was a better workaround. When the Prisma bug is fixed I’ll probably revert to a single request again.

After these PRs were deployed to prod, I tested deleting a check-run, which took 400 ms (previously 16s), confirming that the changes worked.  

Lessons Learned

The lesson here isn’t that you shouldn’t use an ORM like Prisma. We love Prisma. It’s saved us tons of hours of work.  The lesson is that you should always track your performance and investigate when things suddenly become slow. Sometimes your libraries have bugs.  Sometimes you introduce a bug on your own. The only way you’ll find it is by continually measuring performance and spending some time investigating.



Try it yourself or
request a demo

Get started for free

Try it yourself or
Request a Demo

Free for first 5 users