We use a lot of SQL at Trunk. A lot! No surprise, it forms the backbone of our data analytics pipeline. This means that performance is always top of mind, so I periodically review our most used queries to see if there’s something I can improve.
The other day I checked the RDS performance insights on our AWS Performance Insights tab and I could see that the highest entry under Top SQL was a query on the CheckRun
table. Hmm. Interesting.
Investigating the Query
After identifying a slow query I checked the RDS logs in Datadog. Sorting by @query_time_s
descending, I saw that the offending SQL was routinely taking 15+ seconds (and even up to 100s sometimes). Here’s the query:
1SELECT `main`.`CheckRun`.`id`,2 `main`.`CheckRun`.`repoId`,3 `main`.`CheckRun`.`commitId`,4 `main`.`CheckRun`.`series`,5 `main`.`CheckRun`.`status`,6 `main`.`CheckRun`.`createdAt`,7 `main`.`CheckRun`.`updatedAt`,8 `main`.`CheckRun`.`clientVersion`,9 `main`.`CheckRun`.`clientLatestVersion`,10 `main`.`CheckRun`.`statusMessage`,11 `aggr_selection_0_LinterFailure`.`_aggr_count_linterFailures`,12 `aggr_selection_1_CheckRunFailure`.`_aggr_count_checkRunFailures`,13 `aggr_selection_2_CheckIssue`.`_aggr_count_Issues`14FROM `main`.`CheckRun`15 LEFT JOIN (SELECT `main`.`LinterFailure`.`checkRunId`,16 COUNT(*) AS `_aggr_count_linterFailures`17 FROM `main`.`LinterFailure`18 WHERE 1 = 119 GROUP BY `main`.`LinterFailure`.`checkRunId`) AS20 `aggr_selection_0_LinterFailure`21 ON ( `main`.`CheckRun`.`id` =22 `aggr_selection_0_LinterFailure`.`checkRunId` )23 LEFT JOIN (SELECT `main`.`CheckRunFailure`.`checkRunId`,24 COUNT(*) AS `_aggr_count_checkRunFailures`25 FROM `main`.`CheckRunFailure`26 WHERE 1 = 127 GROUP BY `main`.`CheckRunFailure`.`checkRunId`) AS28 `aggr_selection_1_CheckRunFailure`29 ON ( `main`.`CheckRun`.`id` =30 `aggr_selection_1_CheckRunFailure`.`checkRunId` )31 LEFT JOIN (SELECT `main`.`CheckIssue`.`checkRunId`,32 COUNT(*) AS `_aggr_count_Issues`33 FROM `main`.`CheckIssue`34 WHERE 1 = 135 GROUP BY `main`.`CheckIssue`.`checkRunId`) AS36 `aggr_selection_2_CheckIssue`37 ON ( `main`.`CheckRun`.`id` =38 `aggr_selection_2_CheckIssue`.`checkRunId` )39WHERE ( `main`.`CheckRun`.`id` = '70e1c6c5-bce5-4be5-a0dd-d2ad6672afb2'40 AND 1 = 1 )41LIMIT 1 OFFSET 0;
Oof. That’s a doozy. To debug this, I used EXPLAIN
to examine the query, which produced this table:
Looking at the table and the MySQL EXPLAIN
docs, we can see that this query is examining the entirety of the CheckIssue
, CheckRunFailure
, and LinterFailure
tables. That’s Bad™.
Tracking Down the Source
Let me explain. This query examines all three tables because it is trying to count the number of check issues, failures, and linter failures for a given check run. However, it does this by counting the number of check issues on every check run, and only then using the relevant count at the end of the query. We use Prisma as our ORM. It turns out this generated query is affected by a Prisma bug where counting relations loads the entire table.
To find the code that makes the evil query, I did a global code search for prisma.checkRun.find
in our monorepo. There's probably a better way to figure this out, but fortunately there were only a handful of hits.
Fixing the Query
The getCheckRun
function is called whenever the user tries to delete a check run. I fixed the query by separating out the counts; instead of making one query for all the data, I made 4 parallel queries for the check run and the 3 counts we wanted. Four queries isn’t necessarily faster than one, but it is in this case. The evil query roughly does this:
Count the number of linter failures on every check run in the database
Count the number of check run failures on every check run in the database
Count the number of issues on every check run in the database
Return the check run with the requested ID, along with counts of the above.
This is obviously a bad query because it has to scan the entire table of linter failures even though it only needs the ones associated with the particular check run the user is deleting! My solution was to split it into separate queries so the database only needs to scan tables for records that match the requested check run id. So now the process looks like this:
Return the check run with the requested ID
Count and return the number of linter failures on the check run with the requested ID
Count and return the number of check run failures on the check run with the requested ID
Count and return the number of issues on the check run with the requested ID
This set of queries is faster because it’s only looking at the relevant rows in each table instead of scanning the entire thing.
Alternatives
I could have written this as a single custom query, and it would be roughly the same speed but avoid making three extra network requests, resulting in a slightly faster overall experience, however we try to avoid writing queries by hand. We want to keep all database requests going through Prisma, and since Prisma currently has a bug, the four request solution was a better workaround. When the Prisma bug is fixed I’ll probably revert to a single request again.
After these PRs were deployed to prod, I tested deleting a check-run, which took 400 ms (previously 16s), confirming that the changes worked.
Lessons Learned
The lesson here isn’t that you shouldn’t use an ORM like Prisma. We love Prisma. It’s saved us tons of hours of work. The lesson is that you should always track your performance and investigate when things suddenly become slow. Sometimes your libraries have bugs. Sometimes you introduce a bug on your own. The only way you’ll find it is by continually measuring performance and spending some time investigating.