PostgreSQL transactions wrapping child+parent modifications, deadlocks, and ActiveRecord

From PostgreSQL’s perspective

To demonstrate the issue, let’s set up two tables and create a parent record with id 1:

  create table parents(
    id serial not null primary key,
    children_count integer not null default 0
  );

  create table children(
    id serial not null primary key,
    parent_id integer not null references parents(id)
  );

  insert into parents(id) values(1);

Now, let’s do a transaction that inserts a child under parent 1, and increment that parent’s @children_count@ field:

  begin;
  insert into children(parent_id) values(1);
  update parents set children_count=children_count + 1 where id=1;
  commit;

Everything is fine and peachy. However, what happens if there are two processes doing the above concurrently, and the timing for each step happens as demonstrated below:

	Session 1	Session 2
1	begin;
2	insert into children(parent_id) values(1);
3		begin;
4		insert into children(parent_id) values(1);
5	update parents set children_count=children_count + 1 where id=1;
6		update parents set children_count=children_count + 1 where id=1;
7	commit;
8		commit;

Things go smoothly as expected until step 5, when we receive the first hint that something isn’t right. It’ll block.

Step 6 will also appear to block, for roughly 1 second, then session 1 will un-block while session 2 will receive:

  DETAIL:  Process 20906 waits for ExclusiveLock on tuple (0,1) of relation 26335948 of database 16385; blocked by process 20918.
  Process 20918 waits for ShareLock on transaction 168526; blocked by process 20906.
  HINT:  See server log for query details.

So what happened ?

Deadlocks generally occur if two processes are grabbing locks without following the same order in acquiring and releasing them. For example:

  Process 1 grabs lock A = success
  Process 2 grabs lock B = success
  Process 1 grabs lock B = block
  Process 2 grabs lock A = impossible

PostgreSQL has an internal mechanism for detecting such a scenario, and it (by default) kicks in 1 second after a blocking lock. If a deadlock is detected, one of the parties receives an error and their transaction is rolled back.

In our case though, we’re not acquiring any obvious locks (explicitly with an SQL LOCK command, or select.. FOR UPDATE). So what gives ?

It turns out that, when tables are referencing each other using foreign keys, PostgreSQL will implicitly acquire a row lock on the referenced row in the secondary (parents) table to protect the atomicity of the entire operation. Therefore, in the above example, both step 2 and step 4 acquire a shared lock on the parents row id 1. Step 5 then blocks on the lock held by step 4. Step 6 causes the deadlock, which aborts session 2’s transaction. This releases the shared lock and allows step 5 to proceed.

This has been a long-standing issue in the database and the PostgreSQL hackers refer to that type of thing as “user-hostile”. Recently, I ran into this issue and through searching saw it was recently re-ignited in August by Joel Jacobson. The discussion had some excellent ideas but it went dormant after a while.

From ActiveRecord’s perspective

All clients to PostgreSQL being equal, Rails/ActiveRecord is at a disadvantage with regards to this bug, mainly because:

It’s a web framework - where the usual use case is concurrent access
A popular little feature called “cached counters”
The ease of extending an ActiveRecord model with callbacks, touching the parent for various reasons
The abstraction between the developer and the raw SQL calls, which, next to lowering the barrier of entry, makes isolating and investigating such a problem more difficult

This problem may manifest itself in Rails with an error whose stack trace includes something like:

ActiveRecord::StatementInvalid: PGError: ERROR: deadlock detected DETAIL: Process 4844 waits for ShareLock on transaction 34688860; blocked by process 4846. Process 4846 waits for ExclusiveLock on tuple (1,40) of relation 19104 of database 16388; blocked by process 4844. HINT: See server log for query details. : UPDATE 'parents' SET 'children_count' = COALESCE('children_count', 0) + 1 WHERE ('id' = 925) AND ( ('parents'.'type' = 'MaleParent' ) )

What can be done

None of the obvious solutions are pretty:

Skip cached counters, parent modifications, and anything of that sort
Remove the foreign key constraints
Don’t use transactions
Divide the operation into several transactions, where the parents aren’t modified in the same transaction as the children
Serialize the entire transaction by grabbing a lock (using an external mechanism or a PostgreSQL advisory lock)
Add client code to handle the deadlock error and (safely) re-try the needed operations
Fix the issue in PostgreSQL

The viability of all of the above depends on the application, however fixing the issue in PostgreSQL isn’t out of the question. As a matter of fact, it’s already begun! You can see progress here, here and here

How you can help

With a monetary contribution, small or large.

The above work was commissionned after the August discussion on the pgsql-hackers mailing list, however the funds for it have only been minimally raised and they’re still a few thousand dollars short.

If you or your employer would like to see this fix come to fruition, please visit the donation page at:

Every little bit helps. You’ll be helping improve an already great open source database, as well as all who rely on it including ROR consumers.

Update

28 patch revisions later, it seems that Alvaro Herrera has carried this feature through. You can see the patch here

This patch is candidate for commitfest 2013-01 - which means you may see it in a beta PostgreSQL release near you sometime soon.

Update II

PostgreSQL 9.3 is out! - this should fully address this type of deadlock.

November 22, 2010