From PostgreSQL’s perspective
To demonstrate the issue, let’s set up two tables and create a parent record with id 1:
create table parents(
id serial not null primary key,
children_count integer not null default 0
);
create table children(
id serial not null primary key,
parent_id integer not null references parents(id)
);
insert into parents(id) values(1);
Now, let’s do a transaction that inserts a child under parent 1, and increment that parent’s @children_count@ field:
begin;
insert into children(parent_id) values(1);
update parents set children_count=children_count + 1 where id=1;
commit;
Everything is fine and peachy. However, what happens if there are two processes doing the above concurrently, and the timing for each step happens as demonstrated below:
Session 1 | Session 2 | |
---|---|---|
1 | begin; | |
2 | insert into children(parent_id) values(1); | |
3 | begin; | |
4 | insert into children(parent_id) values(1); | |
5 | update parents set children_count=children_count + 1 where id=1; | |
6 | update parents set children_count=children_count + 1 where id=1; | |
7 | commit; | |
8 | commit; |
Things go smoothly as expected until step 5, when we receive the first hint that something isn’t right. It’ll block.
Step 6 will also appear to block, for roughly 1 second, then session 1 will un-block while session 2 will receive:
DETAIL: Process 20906 waits for ExclusiveLock on tuple (0,1) of relation 26335948 of database 16385; blocked by process 20918.
Process 20918 waits for ShareLock on transaction 168526; blocked by process 20906.
HINT: See server log for query details.
So what happened ?
Deadlocks generally occur if two processes are grabbing locks without following the same order in acquiring and releasing them. For example:
Process 1 grabs lock A = success
Process 2 grabs lock B = success
Process 1 grabs lock B = block
Process 2 grabs lock A = impossible
PostgreSQL has an internal mechanism for detecting such a scenario, and it (by default) kicks in 1 second after a blocking lock. If a deadlock is detected, one of the parties receives an error and their transaction is rolled back.
In our case though, we’re not acquiring any obvious locks (explicitly with an SQL LOCK command, or select.. FOR UPDATE). So what gives ?
It turns out that, when tables are referencing each other using foreign keys, PostgreSQL will implicitly acquire a row lock on the referenced row in the secondary (parents) table to protect the atomicity of the entire operation. Therefore, in the above example, both step 2 and step 4 acquire a shared lock on the parents row id 1. Step 5 then blocks on the lock held by step 4. Step 6 causes the deadlock, which aborts session 2’s transaction. This releases the shared lock and allows step 5 to proceed.
This has been a long-standing issue in the database and the PostgreSQL hackers refer to that type of thing as “user-hostile”. Recently, I ran into this issue and through searching saw it was recently re-ignited in August by Joel Jacobson. The discussion had some excellent ideas but it went dormant after a while.
From ActiveRecord’s perspective
All clients to PostgreSQL being equal, Rails/ActiveRecord is at a disadvantage with regards to this bug, mainly because:
- It’s a web framework - where the usual use case is concurrent access
- A popular little feature called “cached counters”
- The ease of extending an ActiveRecord model with callbacks, touching the parent for various reasons
- The abstraction between the developer and the raw SQL calls, which, next to lowering the barrier of entry, makes isolating and investigating such a problem more difficult
This problem may manifest itself in Rails with an error whose stack trace includes something like:
ActiveRecord::StatementInvalid: PGError: ERROR: deadlock detected DETAIL: Process 4844 waits for ShareLock on transaction 34688860; blocked by process 4846. Process 4846 waits for ExclusiveLock on tuple (1,40) of relation 19104 of database 16388; blocked by process 4844. HINT: See server log for query details. : UPDATE 'parents' SET 'children_count' = COALESCE('children_count', 0) + 1 WHERE ('id' = 925) AND ( ('parents'.'type' = 'MaleParent' ) )
What can be done
None of the obvious solutions are pretty:
- Skip cached counters, parent modifications, and anything of that sort
- Remove the foreign key constraints
- Don’t use transactions
- Divide the operation into several transactions, where the parents aren’t modified in the same transaction as the children
- Serialize the entire transaction by grabbing a lock (using an external mechanism or a PostgreSQL advisory lock)
- Add client code to handle the deadlock error and (safely) re-try the needed operations
- Fix the issue in PostgreSQL
The viability of all of the above depends on the application, however fixing the issue in PostgreSQL isn’t out of the question. As a matter of fact, it’s already begun! You can see progress here, here and here
How you can help
With a monetary contribution, small or large.
The above work was commissionned after the August discussion on the pgsql-hackers mailing list, however the funds for it have only been minimally raised and they’re still a few thousand dollars short.
If you or your employer would like to see this fix come to fruition, please visit the donation page at:
Every little bit helps. You’ll be helping improve an already great open source database, as well as all who rely on it including ROR consumers.
Update
28 patch revisions later, it seems that Alvaro Herrera has carried this feature through. You can see the patch here
This patch is candidate for commitfest 2013-01 - which means you may see it in a beta PostgreSQL release near you sometime soon.
Update II
PostgreSQL 9.3 is out! - this should fully address this type of deadlock.