Andrew Vit · 2020-06-24 · rails, sql

Transaction deadlocks on ActiveRecord associations

Everyone is thrilled with the new feature you’ve just deployed! But as it starts to gain popularity, you wonder if there might be a bug despite all the testing and code review that you and your team have done. The more it gets used, the more you start to notice some frustrating, hard-to-reproduce errors: database deadlocks.

This class of errors is often discovered only after heavy usage in production — seemingly rare and unlikely to start with — but can become crippling under a critical mass of simultaneous web users and parallel background jobs.

I’d like to share a technique for identifying the root cause of a deadlock, how Ruby on Rails can sometimes be a confounding factor, and a solution I’ve contributed that has all but eliminated deadlocks at Aha!

Understanding deadlocks

To understand how deadlocks happen and where to start looking, let’s review the minimum ingredients needed to cause the most basic scenario:

Two sessions each open a new transaction with BEGIN
Two records are updated within transaction A
The same two records are updated within transaction B
One transaction is aborted with a ROLLBACK
One transaction completes with a COMMIT

We’ll examine specifically how a deadlock happens in more detail next, but it helps to keep these points in mind like actors in a play — since it’s easy to forget that there’s always a second transaction, and at least one other record involved.

When troubleshooting a real-world deadlock, the error message we get only shows half the problem, and we actually need to find out what led up to it. But it does give some clues:

ERROR: deadlock detected
DETAIL: Process 65998 waits for ShareLock on transaction 1370564288; blocked by process 82788.
        Process 82788 waits for ShareLock on transaction 1370564343; blocked by process 65998.
HINT: See server log for query details.
CONTEXT: while updating tuple (737645,60) in relation "features" :
         UPDATE "features" SET "updated_at" = '2019-12-17 16:19:59.476689' WHERE "features"."id" = $1

It tells us there are two processes here — each one waiting on the other to release a lock. This also shows us the query that caused the rollback to happen so we can find it in our logs. But we still need to gather 3 critical pieces of information:

What was the other transaction that successfully updated this record at the same time?
What other records were updated in this transaction before the rollback?
What record updates are in common with the other transaction?

Identifying the cause

If you’re able to get full query logs from the database, this will be much easier. You will need to capture any updates to the same record around the time of the error. Here are some tips:

Search for log entries that UPDATE the table name and primary key from the error message. (Make sure to consider UPDATE statements, as well as any explicit locks. These show up as SELECT ... FOR UPDATE.)
Identify the process running the UPDATE statement that appears right before the ROLLBACK.
Identify the other processes running an UPDATE on the same record right around that time.

By copying the log statements for each process side-by-side in a spreadsheet, we can sort them by timestamp to reveal the actual sequence of events. Seeing it like this really helps to reveal what’s happening:

deadlocks-1

The arrow points from the aborted update that caused the error message, back to the same record being updated in a concurrent transaction. Here, process 1 needs to wait for process 2 to commit and release the lock. Normally, it would do that.

But if we peek between these two timestamps, there was another record also being updated by both transactions, also waiting for a lock to be released:

deadlocks-2

This "X" where the arrows cross is what a classic deadlock looks like: process 1 was already waiting on process 2 to commit! The situation will never resolve itself by waiting any longer, and as the two transactions are waiting on each other, the database has no choice but to abort one of them.

The lesson here is to avoid updating records in opposite order. Making sure to consistently apply updates in the same order throughout your code means there would be no deadlock: a well-behaved transaction would wait on the first lock it encounters, and because it hasn’t touched anything else out of order in the meantime, it doesn’t block the other transaction from finishing.

Back in ActiveRecord

Equipped with this information, we can now go back to our Rails application and simply examine the two transaction blocks and arrange them to perform updates in the same order — and we’re done!

But not so fast... what if there is nothing to rearrange, and after eliminating every possible conflict, all we find is something like this:

transaction do
  feature.save!
end

There’s actually much more being done for us automatically within transaction blocks that's not shown here. From the database’s point of view, that simple block looks more like this:

transaction do
  # BEGIN;

  feature.save!
  # UPDATE features SET name = 'test' WHERE id = 1;

  _run_before_commit_callbacks
  # UPDATE epics SET updated_at = '2020-01-01' WHERE id = 2;
  # UPDATE releases SET updated_at = '2020-01-01' WHERE id = 3;

  # COMMIT;
end

This is all because we rely on ActiveRecord’s touch: true association options to automatically update parent records whenever something changes. These can be chained together, too:

class Requirement < ApplicationModel
  belongs_to :feature, touch: true
end

class Feature < ApplicationModel
  belongs_to :epic, touch: true
  belongs_to :release, touch: true
end

class Epic < ApplicationModel
  belongs_to :release, touch: true
end

class Release < ApplicationModel
end

The touch option is a key feature of Rails’ caching mechanism, and most of the time it works very well — but it’s still possible that each model could send these updates in a different order and cause deadlocks. It could also be caused by the different order of foreign keys in the selected data itself — and that would be impossible to predict!

So after eliminating what deadlocks we could trace back to our own code using the visual technique shown above, we decided to patch ActiveRecord ourselves.

What ActiveRecord’s TouchLater does

ActiveRecord’s TouchLater module is designed to defer timestamp updates to the end of the transaction using a before_commit callback. These are performed all at once, instead of being updated with each record as it gets saved.

Prior to TouchLater, this is what after_save callbacks looked like when saving two records in a transaction:

BEGIN;
UPDATE requirements SET name="Requirement 1" WHERE id = 5;
UPDATE features SET updated_at = '2020-01-01' WHERE id = 4;
UPDATE releases SET updated_at = '2020-01-01' WHERE id = 1;

UPDATE requirements SET name="Requirement 2" WHERE id = 3;
UPDATE features SET updated_at = '2020-01-01' WHERE id = 2;
UPDATE releases SET updated_at = '2020-01-01' WHERE id = 1;
COMMIT

Between these duplicate updates touching the same release twice, there was ample opportunity for deadlocks to happen wherever it looped for each record, and I suspect TouchLater was originally added precisely because this was causing deadlocks. TouchLater collapsed any duplicate updates and deferred them to the end of the transaction instead:

UPDATE requirements SET name="Requirement 1" WHERE id = 5
UPDATE requirements SET name="Requirement 2" WHERE id = 3

UPDATE features SET updated_at = '2020-01-01' WHERE id = 4
UPDATE releases SET updated_at = '2020-01-01' WHERE id = 1
UPDATE features SET updated_at = '2020-01-01' WHERE id = 2

The ordering here is much less prone to deadlocks. This is how ActiveRecord works as of today, but it’s still not ideal.

Improving TouchLater

The first problem is that the updates are not grouped by table, and different tables are interleaved within the list. So we first need to sort the transaction callback records by table name.

The second problem is that records within a single table could also appear in any order. So we also need to sort by id.

Our initial solution did just that, and resolved over 95% of deadlocks in our use case. Recently, we revisited this and found that we could improve it further after we realized that our simple solution had undone the one thing that ActiveRecord was already doing well: depth traversal. As it gathers the belongs_to associations, ActiveRecord would follow up the chain and always add parent and grandparent associations to the callback array in their natural order.

To understand why the depth of model associations is relevant to deadlocks, it helps to consider ActiveRecord transactions as two separate phases, with your work phase performed before the TouchLater callbacks that happen separately, later.

Here’s what ActiveRecord would do when we add one level of nested model, which moves the feature out of the work and into the callbacks:

feature.update!(attrs)

work = [feature]
callbacks = [
  epic,       # (sort 1)
  release     # (sort 2)
]

requirement.update!(attrs)

work = [requirement]
callbacks = [
  feature,    # (sort 2)
  epic,       # (sort 1)
  release     # (sort 3)
]

When sorting the callbacks simply by table name, the example on the left would update features before epics naturally, but then epics before features with callback sorting on the right!

Hence, our new solution incorporates sorting by association depth as well. We do this by looking up the longest belongs_to chain that a model can appear in and update the deepest models first. This has solved all the remaining unexplained deadlocks we were seeing.

I was very happy to see the improvement after this new patch was deployed to production. Finding solutions to problems like this and sharing them so the greater developer community can benefit is what excites me about working at Aha!

If you’ve read this far, I can only imagine that you’re like-minded and passionate about improving the tools we work with. We are hiring and would love to discuss what kinds of challenging problems you like to work on!