====== SQL Transaction Anomalies — Concurrency Problems in Databases ====== > A comprehensive guide to the 5 anomalies that occur when multiple transactions run concurrently, with real-world examples and solutions. ---- ===== Table of Contents ===== - [[#1_dirty_read|1. Dirty Read]] - [[#2_non-repeatable_read|2. Non-repeatable Read]] - [[#3_phantom_read|3. Phantom Read]] - [[#4_lost_update|4. Lost Update]] - [[#5_write_skew|5. Write Skew]] - [[#6_isolation_levels_summary|6. Isolation Levels Summary]] ---- ===== 1. Dirty Read ===== ==== What is it? ==== Transaction A reads data that Transaction B has **modified but not yet committed**. If B rolls back, A has been working with a value that never officially existed — known as "phantom data". > **Important:** At the default isolation level (`READ COMMITTED`), A still sees the **original** value — B's uncommitted changes are invisible. Dirty reads only occur when using `READ UNCOMMITTED`. ==== Real-world Example ==== Alice's account has `balance = 100`. B is transferring money and changes it to 200 but hasn't committed yet. ^ Step ^ Txn A ^ Txn B ^ DB (balance) ^ | 1 | | Modify 100 → 200 (not committed) | 100 *(B holds: 200)* | | 2 | Read → sees **200** (dirty!) | | 100 | | 3 | | **ROLLBACK** → reverts to 100 | 100 | | 4 | A is using 200 — **wrong!** 200 never officially existed | | 100 | ==== Why is it wrong? ==== B rolled back, so 200 was erased from history. But A already read and used 200 — A is working with invalid data. **If B committed instead of rolling back?** → No problem at all. 200 becomes the official value. ==== How to Fix ==== **Solution: Use `READ COMMITTED` (default in PostgreSQL, SQL Server)** -- PostgreSQL / SQL Server: already READ COMMITTED by default -- MySQL: set manually if needed SET TRANSACTION ISOLATION LEVEL READ COMMITTED; BEGIN; SELECT balance FROM accounts WHERE id = 1; -- Always returns committed values, never dirty COMMIT; At `READ COMMITTED`, A always sees **100** even though B is holding 200 in an uncommitted transaction. ---- ===== 2. Non-repeatable Read ===== ==== What is it? ==== Within the same transaction, A reads **the same row** twice — but gets **different results** because B **updated or deleted** that row and committed between A's two reads. ==== Comparison with Phantom Read ==== ^ ^ Non-repeatable Read ^ Phantom Read ^ | What is read? | Same **single row** | Same **set of rows** by condition | | Cause | B **UPDATE/DELETE** the row | B **INSERT/DELETE** a new row | | Result | Row's value changes | Number of rows changes | ==== Real-world Example ==== A is generating an audit report and needs to read the balance twice to verify. ^ Step ^ Txn A (auditing) ^ Txn B (transfer) ^ DB (balance) ^ | 1 | SELECT balance WHERE id=1 → **1000** | | 1000 | | 2 | | UPDATE balance=500 WHERE id=1, **COMMIT** | 500 | | 3 | SELECT balance WHERE id=1 → **500** 😱 | | 500 | | 4 | Same transaction, same row — **two different results!** | | — | ==== Why is it wrong? ==== A is compiling a financial report — reads 1000 to calculate totals, then reads again to verify and sees 500. The report is completely wrong within the same transaction. ==== How to Fix ==== **Solution: Use `REPEATABLE READ` (default in MySQL)** BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ; SELECT balance FROM accounts WHERE id = 1; -- → 1000 -- B updates and commits here SELECT balance FROM accounts WHERE id = 1; -- → still 1000, snapshot unchanged ✓ COMMIT; `REPEATABLE READ` guarantees: a row read within a transaction will always return the same value as the first read — even if B updates or deletes it in between. ---- ===== 3. Phantom Read ===== ==== What is it? ==== Within the same transaction, A executes **the same query** twice — but gets **a different number of rows** because B **inserted or deleted** a row matching the query's condition and committed in between. ==== Real-world Example ==== A is generating a report of orders with amount greater than 100. ^ Step ^ Txn A (reporting) ^ Txn B ^ DB (orders) ^ | 1 | SELECT WHERE amount > 100 → **3 rows** | | 3 rows | | 2 | | INSERT order amount=500, **COMMIT** | 4 rows | | 3 | SELECT WHERE amount > 100 → **4 rows** 😱 | | 4 rows | | 4 | Same transaction, same query — **two different results!** | | — | ==== Why is it wrong? ==== The 4th row is a "phantom" — it appeared mid-transaction. A is compiling a financial report and reads twice in the same transaction — the figures are inconsistent. ==== How to Fix ==== ===== Option 1: SERIALIZABLE (strongest) ===== SET TRANSACTION ISOLATION LEVEL SERIALIZABLE; BEGIN; SELECT COUNT(*), SUM(amount) FROM orders WHERE amount > 100; -- 1st read: 3 rows, total 1500 -- B inserts and commits here SELECT COUNT(*), SUM(amount) FROM orders WHERE amount > 100; -- 2nd read: still 3 rows, total 1500 — consistent ✓ COMMIT; ===== Option 2: SELECT FOR SHARE (range lock) ===== BEGIN; SELECT COUNT(*), SUM(amount) FROM orders WHERE amount > 100 FOR SHARE; -- Locks the entire range of matching rows -- B cannot INSERT into this range until A commits COMMIT; -- Only after this can B insert ===== Option 3: Snapshot Isolation / MVCC (PostgreSQL) ===== -- PostgreSQL: REPEATABLE READ uses MVCC -- Each transaction sees a snapshot from the moment BEGIN was called BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ; SELECT COUNT(*) FROM orders WHERE amount > 100; -- → 3 rows (snapshot created here) -- B inserts and commits — A cannot see it, not in snapshot SELECT COUNT(*) FROM orders WHERE amount > 100; -- → still 3 rows ✓ COMMIT; ---- ===== 4. Lost Update ===== ==== What is it? ==== Two transactions both read **the same value**, both modify it, and one **overwrites the other's result** — one update is silently lost. Unlike dirty read: B has **committed** properly, but the data is still wrong because A read too early and overwrote B's result. ==== Real-world Example ==== A warehouse has `qty = 10`. A sells 2 items, B sells 3 items — at the same time. ^ Step ^ Txn A ^ Txn B ^ DB (qty) ^ | 1 | Read → **10** | Read → **10** | 10 | | 2 | | 10 − 3 = 7 → write **7**, commit | 7 | | 3 | 10 − 2 = 8 → write **8**, commit | | **8** ❌ | | 4 | Correct result should be **5** (10−3−2) — B's deduction of 3 is **lost** | | — | ==== Why is it wrong? ==== A read 10 before B committed. When A writes, it calculates from 10 (the old value) instead of 7 (the latest value) — completely overwriting B's result. ==== How to Fix ==== ===== Option 1: Atomic Update — simplest ✅ ===== -- WRONG: read value then calculate in application code → race condition SELECT qty FROM products WHERE id = 1; -- → 10 UPDATE products SET qty = 10 - 2 WHERE id = 1; -- CORRECT: combine into one atomic SQL statement UPDATE products SET qty = qty - 2 -- database handles read+calculate+write atomically WHERE id = 1; **When to use:** When you only need simple arithmetic (add/subtract). No need to read the value in application code first. **Limitation:** Cannot be used when you need to inspect the value first (e.g., checking if stock is sufficient before deducting). ---- ===== Option 2: SELECT FOR UPDATE — most common ===== -- Txn A: lock the row on read BEGIN; SELECT qty FROM products WHERE id = 1 FOR UPDATE; -- lock this row, B must wait -- check if qty is sufficient -- if yes, proceed with UPDATE UPDATE products SET qty = qty - 2 WHERE id = 1; COMMIT; -- B is now allowed to proceed -- Txn B also does SELECT FOR UPDATE on the same row -- → B is blocked until A commits -- → B reads qty=8 (already deducted by A) → calculates correctly **When to use:** When you need to read first and check business logic (e.g., verify balance or stock before writing). **Watch out for deadlocks:** If A locks row 1 then needs row 2, and B locks row 2 then needs row 1 → deadlock. Always lock rows in a consistent order. ---- ===== Option 3: Optimistic Lock (version column) ===== -- Schema: add a version column ALTER TABLE products ADD COLUMN version INT DEFAULT 0; -- Step 1: read with version SELECT qty, version FROM products WHERE id = 1; -- → qty=10, version=5 -- Step 2: write with version condition UPDATE products SET qty = qty - 2, version = version + 1 -- increment version WHERE id = 1 AND version = 5; -- only update if version is still 5 -- rowsAffected = 1 → success -- rowsAffected = 0 → someone else updated first, version changed → retry # Retry logic in Python def update_with_optimistic_lock(conn, product_id, qty_change): MAX_RETRY = 3 for attempt in range(MAX_RETRY): with conn.cursor() as cur: # Step 1: read with version cur.execute( "SELECT qty, version FROM products WHERE id = %s", (product_id,) ) row = cur.fetchone() qty, version = row[0], row[1] # Step 2: business logic check if qty < qty_change: raise Exception("Insufficient stock") # Step 3: write with version condition cur.execute( """UPDATE products SET qty = qty - %s, version = version + 1 WHERE id = %s AND version = %s""", (qty_change, product_id, version) ) if cur.rowcount == 1: conn.commit() print(f"Success after {attempt + 1} attempt(s)") return # success else: conn.rollback() print(f"Conflict on attempt {attempt + 1}, retrying...") raise Exception("Failed after 3 retries") // Retry logic in Node.js async function updateWithOptimisticLock(pool, productId, qtyChange) { const MAX_RETRY = 3; for (let attempt = 0; attempt < MAX_RETRY; attempt++) { const client = await pool.connect(); try { // Step 1: read with version const { rows } = await client.query( 'SELECT qty, version FROM products WHERE id = $1', [productId] ); const { qty, version } = rows[0]; // Step 2: business logic check if (qty < qtyChange) throw new Error('Insufficient stock'); // Step 3: write with version condition const result = await client.query( `UPDATE products SET qty = qty - $1, version = version + 1 WHERE id = $2 AND version = $3`, [qtyChange, productId, version] ); if (result.rowCount === 1) { await client.query('COMMIT'); console.log(`Success after ${attempt + 1} attempt(s)`); return; // success } // rowCount = 0 → conflict → loop continues and retries } finally { client.release(); } } throw new Error('Failed after 3 retries'); } **When to use:** When conflicts are rare (few people modifying the same row at once). Great for REST APIs and high-concurrency systems. **Limitation:** You must write retry logic yourself. If conflicts happen frequently, repeated retries become less efficient than `SELECT FOR UPDATE`. ---- ===== Comparison of the 3 options ===== ^ ^ Atomic Update ^ SELECT FOR UPDATE ^ Optimistic Lock ^ | Real lock? | No | Yes | No | | Retry needed? | No | No | Yes (manual) | | Deadlock risk? | No | Possible | No | | Best for | Simple add/subtract | Need to check before write | Low conflict, many users | ---- ===== 5. Write Skew ===== ==== What is it? ==== The most subtle anomaly. A and B both read data, both check **a shared condition** — both see the condition is satisfied — both write to **different rows**. The result violates the condition, but neither transaction knows. **Difference from Lost Update:** Lost update is two transactions writing to the **same row**. Write skew is two transactions writing to **different rows** but corrupting a shared invariant. ==== Example 1 — Hospital: at least 1 doctor must be on call ==== Rule: **total number of on-call doctors must be ≥ 1**. Currently 2 doctors are on call: Alice and Bob. Both request to go off call at the same time. ^ Step ^ Txn A (Alice goes off call) ^ Txn B (Bob goes off call) ^ on_call ^ | 1 | SELECT COUNT(*) WHERE on_call=true → **2** | | Alice✓ Bob✓ | | 2 | | SELECT COUNT(*) WHERE on_call=true → **2** | Alice✓ Bob✓ | | 3 | count=2 ≥ 1 → allowed. UPDATE Alice on_call=false | | Alice✗ Bob✓ | | 4 | | count=2 ≥ 1 → allowed. UPDATE Bob on_call=false | Alice✗ Bob✗ | | 5 | **No doctors on call! Rule violated!** | | **0** ❌ | ==== Example 2 — Meeting room double booking ==== Rule: a room cannot be booked twice at the same time slot. ^ Step ^ Txn A (person 1) ^ Txn B (person 2) ^ bookings ^ | 1 | SELECT COUNT(*) room 101 at 2pm → **0** | | empty | | 2 | | SELECT COUNT(*) room 101 at 2pm → **0** | empty | | 3 | 0 = available → INSERT booking A, commit | | 1 booking | | 4 | | 0 = available → INSERT booking B, commit | **2 bookings** ❌ | | 5 | **Room 101 at 2pm is double-booked! Nobody knows!** | | — | ==== Why is it wrong? ==== Each transaction looks correct in isolation — it only breaks when viewed as a whole. A and B write to different rows (Alice vs Bob, booking A vs booking B), so neither blocks the other. ==== How to Fix ==== ===== Option 1: SERIALIZABLE (handles all write skew cases) ===== BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE; SELECT COUNT(*) FROM doctors WHERE on_call = true; -- database tracks this dependency -- if conflict detected → throws serialization failure error → needs retry UPDATE doctors SET on_call = false WHERE id = 1; COMMIT; ===== Option 2: SELECT FOR UPDATE (lock all rows involved in the condition) ===== BEGIN; -- Lock ALL on-call rows — B is blocked here SELECT COUNT(*) FROM doctors WHERE on_call = true FOR UPDATE; -- Re-check after locking -- Only UPDATE if count is still > 1 UPDATE doctors SET on_call = false WHERE id = 1; COMMIT; ===== Option 3: UNIQUE constraint (for booking-style problems) ===== -- Add a constraint to prevent duplicates ALTER TABLE bookings ADD CONSTRAINT unique_room_time UNIQUE (room_id, start_time); -- → Database automatically rejects duplicates and throws an error -- No need to worry about transaction isolation level **When to use each option:** * `SERIALIZABLE` — when the invariant is complex and cannot be expressed as a UNIQUE constraint * `SELECT FOR UPDATE` — when you know exactly which rows to lock * `UNIQUE constraint` — when the problem can be modeled as a unique key (simplest and most efficient) ---- ===== 6. Isolation Levels Summary ===== ==== Which anomalies occur at each level? ==== ^ Anomaly ^ READ UNCOMMITTED ^ READ COMMITTED ^ REPEATABLE READ ^ SERIALIZABLE ^ | Dirty read | ❌ Yes | ✅ No | ✅ No | ✅ No | | Non-repeatable read | ❌ Yes | ❌ Yes | ✅ No | ✅ No | | Phantom read | ❌ Yes | ❌ Yes | ❌ Yes* | ✅ No | | Lost update | ❌ Yes | ❌ Yes | ❌ Yes | ✅ No** | | Write skew | ❌ Yes | ❌ Yes | ❌ Yes | ✅ No | > * MySQL `REPEATABLE READ` prevents phantom reads thanks to MVCC. PostgreSQL does not. > > ** `SERIALIZABLE` prevents lost updates but has significant overhead — in practice, use `FOR UPDATE` or atomic updates instead of raising the isolation level. ==== MySQL vs PostgreSQL defaults ==== ^ Anomaly ^ MySQL (REPEATABLE READ) ^ PostgreSQL (READ COMMITTED) ^ | Dirty read | ✅ No | ✅ No | | Non-repeatable read | ✅ No | ❌ Possible | | Phantom read | ✅ No* | ❌ Possible | | Lost update | ❌ Possible | ❌ Possible | | Write skew | ❌ Possible | ❌ Possible | > Both MySQL and PostgreSQL defaults **do not protect** against Lost Update and Write Skew — these must be handled in your application code. ==== How to set Isolation Level ==== -- For a single transaction SET TRANSACTION ISOLATION LEVEL REPEATABLE READ; BEGIN; -- SQL statements COMMIT; -- For the entire session (PostgreSQL) SET default_transaction_isolation = 'repeatable read'; -- For the entire session (MySQL) SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ; ==== Practical recommendations ==== ^ Situation ^ Recommended approach ^ | Typical web application | `READ COMMITTED` (default) is sufficient | | Reports and analytics | `REPEATABLE READ` | | Transfers, stock deduction | `READ COMMITTED` + `SELECT FOR UPDATE` | | Finance and banking | `SERIALIZABLE` or `SELECT FOR UPDATE` | | Simple increments/decrements | Atomic update (`SET qty = qty - 1`) | | Many users, low conflict rate | Optimistic lock (version column) | > **Rule of thumb:** Don't blindly raise to `SERIALIZABLE` — it significantly impacts performance. Most web applications are fine with `READ COMMITTED` + `SELECT FOR UPDATE` where needed. ---- ''Document compiled from a conversation about SQL Transaction Anomalies.''