A few days ago a fellow DBA asked me to review an email he received from a developer. In the email, the developer explained his application was affected by database errors and asked us to check the attached file for details. The error was a database deadlock. Attached to the email was the trace file Oracle generates whenever a deadlock occurs in the database. I don’t see deadlocks regularly so I hardly ever need to dissect one of those trace files. The trace file contained the following information:
Deadlock graph: ---------Blocker(s)-------- ---------Waiter(s)--------- Resource Name process session holds waits process session holds waits TX-002b0006-00000968 23 223 X 25 35 X TX-002c0007-00000b13 25 35 X 23 223 X session 223: DID 0001-0017-000163D0 session 35: DID 0001-0019-00002809 session 35: DID 0001-0019-00002809 session 223: DID 0001-0017-000163D0 Rows waited on: Session 223: obj - rowid = 0001FBA6 - AAAfumAAFAAAWikAAA (dictionary objn - 129958, file - 5, block - 92324, slot - 0) Session 35: obj - rowid = 0001FBA6 - AAAfumAAFAAAWikAAB (dictionary objn - 129958, file - 5, block - 92324, slot - 1)Session 223 holds an exclusive lock on a row and session 35 holds another exclusive lock on a different row. Session 35 wants to lock session 235 row and vice versa. This clearly shows there is a deadlock. Immediately following this section the SQLs involved in the deadlock are shown in the trace file. I was expecting to see two different queries but the current session and the “other” session executed exactly the same SQL:
UPDATE PERSON_TAB SET PERSON_ID=:1, NAME=:2, FIRST_NAME=:3, MIDDLE_NAME=:4, LAST_NAME=:5, DATE_OF_BIRTH=:6, PHONE_NUMBER=:7, ALT_PHONE_NUMBER=:8, FAX_NUMBER=:9, E_MAIL=:10 where DATA_SOURCE_ID=:11The fact that the same UPDATE was executed by both sessions against the same table confused me for a moment. For some reason I wanted to see two different tables but found the same table in both UPDATEs. I started thinking one session updated a row and another session wanted to update the same row. On that scenario the second session would just wait for the first session to either commit or rollback the update. Once that happens the exclusive lock on the row is released and the UPDATE from the second session goes through. How can that cause a deadlock? As you can tell, I didn’t read the trace file close enough. The rowids above are different so both sessions were trying to update different rows. Once again, I rushed to faulty reasoning thinking two sessions updating two different rows should not cause a deadlock. Clearly, Oracle is able to handle two sessions updating two different rows with ease. They are completely independent transactions so there shouldn’t be a deadlock. Remember, I don’t analyze deadlock trace files on a daily basis so that’s my defense for not being able to immediately explain what caused the deadlock. After a few moments trying to imagine what could have caused the deadlock I was able to see the full picture. The first session updates row 1, the second session updates row 2. The first session tries to update row 2 and the second session tries to update row 1. This sequence causes a deadlock. In order to validate my reasoning I opened two SQL*Plus sessions and ran the following:
On session #1: SQL> CREATE TABLE T (N NUMBER); SQL> INSERT INTO T VALUES (1); SQL> INSERT INTO T VALUES (2); SQL> COMMIT; SQL> UPDATE T SET N = 10 WHERE N = 1;On session #2:
SQL> UPDATE T SET N = 20 WHERE N = 2;On session #1:
SQL> UPDATE T SET N = 20 WHERE N = 2; (this one blocks because it’s locked by session #2)On session #2
SQL> UPDATE T SET N = 10 WHERE N = 1; (this one causes a deadlock)After a few seconds the database reported a deadlock on session #1. As a result, the second update on session #1 was lost. After issuing a commit on both sessions I noticed the table didn’t contain two rows with 1 and 2 but 10 and 20. No updates were lost because both sessions tried to update the table with the same values. The same would happen if the UPDATEs on PERSON_TAB contained the same values on all columns. If PHONE_NUMBER was different on both sessions one of them would be lost as a result of the deadlock. With this information on hand my colleague replied the email with a detailed explanation as to what caused the deadlock and provided the small case scenario to help the developer reproduce the issue. We also supplied the SQL showing the table involved in the deadlock.
We acted as good DBAs (we think so) because we took the time to examine the trace file, compose a detailed explanation, and supply steps on how to reproduce the issue.
What would a bad DBA do if faced with the same request? The bad DBA would open the trace file and copy the following section on a reply email:
The following deadlock is not an ORACLE error. It is a deadlock due to user error in the design of an application or from issuing incorrect ad-hoc SQL.The email would close with a simple “fix your code”.
What would a bad DBA do if faced with the same request?
ReplyDeleteAre you sure? What if, a DBA has acted as a "good DBA" while handling deadlocks previously and when the deadlocks don't go away, decides that it is not worth his time to act like a "good DBA" ? :)
It happens
Is it's the DBAs responsibility to train developers? I see the point to support developers at the best possible level, but I feel Narendras pain also quite vivid.
ReplyDeleteAnd the explanation is not helping to much at all, as the developer has no idea how to come out of the deadlock. (if he knew about deadlocks, he would never have written the email?)
what#s about a suggestion like allways update all your sets in ascending order of DATA_SOURCE_ID ?
Or if you want to manipulate many rows, insert the new values into a global temporary table and afterwards do an update based on this GTT? SQL is a language for sets, so don't think sequential.
Interesting that the developer was able to get a trace file. I'd expect anyone who is able to look there would be able to understand a deadlock.
ReplyDeleteNot sure how much the SQL was anonymized, but having an update on PERSON_TAB where the PERSON_ID is updated seems peculiar. I'd expect PERSON_ID to be the PK and so never updated. I'd wonder if they'd missed a predicate from the SQL.
Martin,
ReplyDeleteThe DBA's responsibility is not to train developers but to at least help them solve DB issues. Developers should be motivated to learn about the DB on their own once they understand the critical role the DB plays in any application.
Regarding your suggestion to update all sets in ascending order of the DATA_SOURCE_ID, I'm afraid that's not possible because the deadlock didn't occur as part of a nightly batch job but as part of an online application that supports multiple concurrent users. Also, the DATA_SOURCE_ID column can't be sorted because it is loaded with a dreadful GUID.
I mentioned on my story I don't examine trace files on a regular basis. Maybe that's why I was a "good DBA". If I have the misfortune to deal with the same issue many times I probably would not be as nice :)
Application support/ developer support is not more related to the quality of DBA work than any other facet of DBA work.
ReplyDeleteIf the SLA for the specific DB involved includes application support and/or developer support then this is part of your work. If that is not the case then at best the trace file gets send to the devs. If they want support then they can buy it.
The SLA is everything.