MySQL Repeatable-Read 實(shí)現(xiàn)的一些誤解

作者：ba0tiao 2023-12-12 10:54:55

對于 first-committer-wins 的定義, 在si 模式下, 如果在Start-Timestamp -> Commit-Timestamp 這之間如果有其他的trx2 修改了當(dāng)前trx1 修改過的內(nèi)容, 并且在trx1 提交的時(shí)候, trx2 已經(jīng)提交了. 那么trx1 就會(huì)abort, 這個(gè)叫first-committer-wins。

背景

首先1992 年發(fā)表的SQL Standard 對隔離級別進(jìn)行的定義是根據(jù)幾個(gè)異象(Dirty Read, Non-Repeatable Read, Phantom Read) , 當(dāng)然這個(gè)定義非常模糊, 后面Jim Grey 也有文章說這個(gè)不合理, 然而此時(shí)MVCC, snapshot isolation 還沒被發(fā)明. 等有snapshot isolation 以后發(fā)現(xiàn)snapshot isolation 能夠規(guī)避Dirty Read, Non-Repeatable Read, 因此認(rèn)為snapshot isolation 和 Repeatable-read 很像, 所以MySQL, Pg 把他們實(shí)現(xiàn)的snapshot isolation 就稱為了Repeatable-read isolation.

另外snapshot isolation 其實(shí)也沒有準(zhǔn)確的定義, 因此MySQL 和 PG, Oracle 等等的實(shí)現(xiàn)也是有很大的區(qū)別的.

關(guān)于snapshot isolation 的定義:

A transaction running in Snapshot Isolation is never blocked attempting a read as long as the snapshot data from its Start-Timestamp can be maintained.The transaction’s writes (updates, inserts, and deletes) will also be reflected in this snapshot, to be read again if the transaction accesses (i.e., reads or updates) the data a second time.

這里對于snapshot isolation 的定義不論對于讀操作和寫操作都是讀取snapshot 的版本, 這也是pg, oracle 等等版本實(shí)現(xiàn)的, 但是InnoDB 不是這樣的. InnoDB 只有讀操作讀取到的是snapshot 的版本, 但是DML 操作是讀取當(dāng)前已提交的最新版本.

When the transaction T1 is ready to commit, it gets a Commit-Timestamp, which is larger than any existing Start-Timestamp or Commit-Timestamp. The transaction successfully commits only if no other transaction T2 with a Commit-Timestamp in T1’s execution interval [Start- Timestamp, Commit-Timestamp] wrote data that T1 also wrote. Otherwise, T1 will abort. This feature, called First- committer-wins prevents lost updates (phenomenon P4).

但是InnoDB 也不是這樣的. InnoDB 并不遵守這個(gè)規(guī)則, 在repeatable read 模式下, 如果trx1, trx2 都修改了同一行, trx2 是先提交的, 那么trx1 的提交會(huì)直接把trx2 覆蓋. 而在類似PG, Oracle 實(shí)現(xiàn)的snapshot isolation 里面, 則是遵守first-committer-wins 的規(guī)則.

所以InnoDB 的snapshot isolation

僅僅Read 操作讀的是歷史版本

不遵守first-committer-wins 規(guī)則

官方把這種實(shí)現(xiàn)叫做Write committed Repeatable Read.

MySQL 開發(fā)者對于InnoDB repeatable-read 實(shí)現(xiàn)的介紹:

But when InnoDB Repeatable Read transactions modify the database, it is possible to get phantom reads added into the static view of the database, just as the ANSI description allows. Moreover, InnoDB relaxes the ANSI description for Repeatable Read isolation in that it will also allow non-repeatable reads during an UPDATE or DELETE. Specifically, it will write to newly committed records within its read view. And because of gap locking, it will actually wait on other transactions that have pending records that may become committed within its read view. So not only is an UPDATE or DELETE affected by pending or newly committed records that satisfy the predicate, but also ‘SELECT … LOCK IN SHARE MODE’ and ‘SELECT … FOR UPDATE’.

This WRITE COMMITTED implementation of REPEATABLE READ is not typical of any other database that I am aware of. But it has some real advantages over a standard ‘Snapshot’ isolation. When an update conflict would occur in other database engines that implement a snapshot isolation for Repeatable Read, an error message would typically say that you need to restart your transaction in order to see the current data. So the normal activity would be to restart the entire transaction and do the same changes over again. But InnoDB allows you to just keep going with the current transaction by waiting on other records which might join your view of the data and including them on the fly when the UPDATE or DELETE is done. This WRITE COMMITTED implementation combined with implicit record and gap locking actually adds a serializable component to Repeatable Read isolation.

PG 社區(qū)對于repeatable-read 實(shí)現(xiàn)的介紹:

UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE commands behave the same as SELECT in terms of searching for target rows: they will only find target rows that were committed as of the transaction start time. However, such a target row might have already been updated (or deleted or locked) by another concurrent transaction by the time it is found. In this case, the repeatable read transaction will wait for the first updating transaction to commit or roll back (if it is still in progress). If the first updater rolls back, then its effects are negated and the repeatable read transaction can proceed with updating the originally found row. But if the first updater commits (and actually updated or deleted the row, not just locked it) then the repeatable read transaction will be rolled back with the message

https://www.postgresql.org/docs/13/transaction-iso.html#XACT-READ-COMMITTED

所以這里我們看一下MySQL repeatable-read 的具體行為, 也了解MySQL社區(qū)為什么要做這樣的實(shí)現(xiàn).

mysql> create table checking (name char(20) key, balance int) engine InnoDB;
Query OK, 0 rows affected (0.03 sec)

mysql> insert into checking values ("Tom", 1000), ("Dick", 2000), ("John", 1500);
Query OK, 3 rows affected (0.00 sec)
Records: 3  Duplicates: 0  Warnings: 0

Client #1                               Client #2
=====================================   =====================================
mysql> begin;
Query OK, 0 rows affected (0.00 sec)

mysql> select * from checking;
+------+---------+
| name | balance |
+------+---------+
| Dick |    2000 |
| John |    1500 |
| Tom  |    1000 |
+------+---------+
3 rows in set (0.00 sec)

mysql> begin;
Query OK, 0 rows affected (0.00 sec)

mysql> update checking
   set balance = balance - 250
   where name = "Dick";
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> update checking
   set balance = balance + 250
   where name = "Tom";
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> select * from checking;
+------+---------+
| name | balance |
+------+---------+
| Dick |    1750 |
| John |    1500 |
| Tom  |    1250 |
+------+---------+
3 rows in set (0.02 sec)
                                        mysql> begin;
                                        Query OK, 0 rows affected (0.00 sec)

                                        mysql> select * from checking;
                                        +------+---------+
                                        | name | balance |
                                        +------+---------+
                                        | Dick |    2000 |
                                        | John |    1500 |
                                        | Tom  |    1000 |
                                        +------+---------+
                                        3 rows in set (0.00 sec)

                                        mysql> update checking
                                           set balance = balance - 200
                                           where name = "John";
                                        Query OK, 1 row affected (0.00 sec)
                                        Rows matched: 1  Changed: 1  Warnings: 0

                                        mysql> update checking
                                           set balance = balance + 200
                                           where name = "Tom";

                                        ### Client 2 waits on the locked record
mysql> commit;
Query OK, 0 rows affected (0.00 sec)
                                        Query OK, 1 row affected (19.34 sec)
                                        Rows matched: 1  Changed: 1  Warnings: 0
mysql> select * from checking;
+------+---------+
| name | balance |
+------+---------+
| Dick |    1750 |
| John |    1500 |
| Tom  |    1250 |
+------+---------+
3 rows in set (0.00 sec)
                                        mysql> select * from checking;
                                        +------+---------+
                                        | name | balance |
                                        +------+---------+
                                        | Dick |    2000 |
                                        | John |    1300 | 
                                        | Tom  |    1450 |
                                        +------+---------+
                                        3 rows in set (0.00 sec)

                                      # 這里可以看到Tom = 1450, 而不是從上面 1000 + 200 = 1200, 
                                      # 因?yàn)閡pdate 的時(shí)候, InnoDB 實(shí)現(xiàn)的是write-committed repeatable, 
                                      # 不是基于場景的snapshot isolation的實(shí)現(xiàn), 
                                      # write 操作是直接讀取的已提交的最新版本的數(shù)據(jù)1250, 
                                      # 而不是snapshot 中的數(shù)據(jù)1000.

                                        mysql> commit;
                                        Query OK, 0 rows affected (0.00 sec)

mysql> select * from checking;
+------+---------+
| name | balance |
+------+---------+
| Dick |    1750 |
| John |    1300 |
| Tom  |    1450 |
+------+---------+
3 rows in set (0.02 sec)

這里可以看到Tom = 1450, 而不是從上面 1000 + 200 = 1200, 因?yàn)閡pdate 的時(shí)候, InnoDB 實(shí)現(xiàn)的是write-committed repeatable, 不是基于場景的snapshot isolation的實(shí)現(xiàn), write 操作是直接讀取的已提交的最新版本的數(shù)據(jù)1250, 而不是snapshot 中的數(shù)據(jù)1000.

對比在PG里面, 由于PG是使用常見的 snapshot isolation 實(shí)現(xiàn)repeatable-read, 那么trx2 在修改Tom 的時(shí)候, 同樣必須等待trx1 commit or rollback, 因?yàn)镻G 讀取和修改基于trx 開始時(shí)候的snapshot 的record. 因此如果trx1 rollback, 那么trx2 則會(huì)基于開始snapshot 時(shí)候的值進(jìn)行修改, 也就是Tom = 1200, 如果trx1 commit, 那么trx2 只能rollback, 并且會(huì)返回

ERROR: could not serialize access due to concurrent update

也就是在上面的場景下 trx2 是會(huì)rollback.

那么MySQL 為什么要這么做呢?

MySQL 社區(qū)的觀點(diǎn)是在常見的通過snapshot isolation 來實(shí)現(xiàn)repeatable Read 的方案里面, 經(jīng)常會(huì)出現(xiàn)如果兩個(gè)事務(wù)修改了同一個(gè)record, 那么就需要后提交的事務(wù)重試這個(gè)流程. 這種在小事務(wù)場景是可以接受的, 但是如果后提交的事務(wù)是大事務(wù), 比如trx1 修改了1個(gè)record rec1并先提交了, 但是trx2 修改了100 行, 正好包含了rec1, 那么常見的snapshot isolation 的實(shí)現(xiàn)就需要trx2 返回錯(cuò)誤, 然后重新執(zhí)行這個(gè)事務(wù). 這樣對沖突多的場景是特別不友好的.

但是Innodb 的實(shí)現(xiàn)則在修改rec1 的時(shí)候, 如果trx1 已經(jīng)提交了, 那么直接讀取trx1 committed 的結(jié)果, 這樣就可以避免了讓trx2 重試的過程了. 也可以達(dá)到幾乎一樣的效果.

當(dāng)然這個(gè)僅僅MySQL InnoDB 是這樣的實(shí)現(xiàn), 其他的數(shù)據(jù)庫都不會(huì)這樣.

兩種方案都有優(yōu)缺點(diǎn)吧, 基于常見SI(snapshot isolation) 實(shí)現(xiàn)會(huì)存在更多的事務(wù)回滾, 一旦兩個(gè)事務(wù)修改了同一個(gè)row, 那么必然有一個(gè)事務(wù)需要回滾, 但是InnoDB 的行為可以允許和其他trx 修改同一個(gè)record, 并且可以在其他trx 修改后的結(jié)果上進(jìn)行更新, 不需要進(jìn)行事務(wù)回滾, 效率會(huì)更高一些, 但是基于常見的snapshot isolation 的實(shí)現(xiàn)更符合直觀感受.

責(zé)任編輯：武曉燕來源： MySQL內(nèi)核剖析