我們一起聊聊如何加速無索引表引起的主從延遲數(shù)據(jù)回放
一、場景
由于某些原因,客戶現(xiàn)場存在一張 8千萬 的大表,而且該表上無任何索引(也無主鍵),平時該表上 UPDATE 或 DELETE 只操作幾條數(shù)據(jù)。忽然有一天業(yè)務(wù)進(jìn)行了某種操作,DELETE 2萬 條數(shù)據(jù),悲劇發(fā)生了,當(dāng)在主庫上執(zhí)行了之后,傳到從庫上之后一直回放,當(dāng)時評估了下可能會回放10天,后來在經(jīng)過業(yè)務(wù)同意之后,對表進(jìn)行操作,用于加速回放日志,處理該問題。
二、處理思路
停掉復(fù)制線程,關(guān)閉 Binlog 日志記錄,添加索引,重啟復(fù)制線程。
三、復(fù)現(xiàn)步驟
1.準(zhǔn)備數(shù)據(jù):
創(chuàng)建庫:
greatsql> CREATE DATABASE qj;
Query OK, 1 row affected (0.01 sec)
準(zhǔn)備數(shù)據(jù),準(zhǔn)備一張8千萬數(shù)據(jù)的表:
$ sysbench --db-driver=mysql --mysql-host=192.168.139.230 --mysql-port=3307 --mysql-user=greatsql --mysql-password=GreatSQL@2024 --mysql-db=qj --table_size=80000000 --tables=1 oltp_write_only prepare
sysbench 1.1.0-df89d34 (using bundled LuaJIT 2.1.0-beta3)
Creating table 'sbtest1'...
Inserting 80000000 records into 'sbtest1'
Creating a secondary index on 'sbtest1'...
主節(jié)點(diǎn)修改表結(jié)構(gòu),將表的索引去掉
greatsql> USE qj
Database changed
greatsqll> ALTER TABLE sbtest1 MODIFY id int not null;
Query OK, 80000000 rows affected (36 min 29.13 sec)
Records: 80000000 Duplicates: 0 Warnings: 0
greatsql> ALTER TABLE sbtest1 DROP key `k_1`;
Query OK, 0 rows affected (0.96 sec)
Records: 0 Duplicates: 0 Warnings: 0
greatsql> ALTER TABLE sbtest1 DROP PRIMARY KEY;
Query OK, 80000000 rows affected (14 min 29.31 sec)
Records: 80000000 Duplicates: 0 Warnings: 0
2.模擬延遲:
01.主節(jié)點(diǎn):
模擬刪除數(shù)據(jù)
greatsql> DELETE FROM qj.sbtest1 WHERE k<20000;
Query OK, 19894 rows affected (4 min 20.40 sec)
02.從節(jié)點(diǎn):
等待時間超過DELETE語句的時間,觀察延遲情況,確認(rèn)從節(jié)點(diǎn)可讀可寫
greatsql> SELECT sleep(300);
+------------+
| sleep(300) |
+------------+
| 0 |
+------------+
1 row in set (5 min 0.01 sec)
greatsql> SHOW SLAVE STATUS \G
*************************** 1. row ***************************
Slave_IO_State: Waiting for source to send event
Master_Host: 172.17.139.230
Master_User: greatsql
Master_Port: 3307
Connect_Retry: 60
Master_Log_File: binlog.000078
Read_Master_Log_Pos: 953750186
Relay_Log_File: relaylog.000231
Relay_Log_Pos: 949954084
Relay_Master_Log_File: binlog.000078
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 949953874
Relay_Log_Space: 953750683
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 1345
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 1000403307
Master_UUID: 1d0963e9-85d9-11ef-80e1-00163e28e06a
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Replica has read all relay log; waiting for more updates
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 1d0963e9-85d9-11ef-80e1-00163e28e06a:1-163235
Executed_Gtid_Set: 1d0963e9-85d9-11ef-80e1-00163e28e06a:1-163234
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
Master_public_key_path:
Get_master_public_key: 0
Network_Namespace:
1 row in set, 1 warning (0.00 sec)
greatsql> SHOW GLOBAL VARIABLES LIKE 'read_only';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| read_only | OFF |
+---------------+-------+
1 row in set (0.01 sec)
03.通過操作從節(jié)點(diǎn)加速日志回放
找到回放線程,KILL掉回放線程,關(guān)閉當(dāng)前會話binlog日志記錄,添加索引,重新啟動復(fù)制回放線程
greatsql> SHOW PROCESSLIST;
+--------+-----------------+----------------------+------+---------+---------+----------------------------------------------------------+------------------+------------+-----------+---------------+
| Id | User | Host | db | Command | Time | State | Info | Time_ms | Rows_sent | Rows_examined |
+--------+-----------------+----------------------+------+---------+---------+----------------------------------------------------------+------------------+------------+-----------+---------------+
| 5 | event_scheduler | localhost | NULL | Daemon | 2369454 | Waiting on empty queue | NULL | 2369454039 | 0 | 0 |
| 17 | system user | connecting host | NULL | Connect | 2369447 | Waiting for source to send event | NULL | 2369446555 | 0 | 0 |
| 543448 | greatsql | 172.17.136.93:48298 | NULL | Sleep | 110 | | NULL | 109832 | 0 | 0 |
| 543500 | greatsql | 172.17.139.230:35198 | NULL | Sleep | 586 | | NULL | 585885 | 0 | 0 |
| 543588 | greatsql | 172.17.136.93:57948 | NULL | Sleep | 50 | | NULL | 49878 | 0 | 0 |
| 543663 | greatsql | 172.17.139.230:35726 | NULL | Query | 0 | init | show processlist | 0 | 0 | 0 |
| 543666 | greatsql | 172.17.136.93:58908 | NULL | Sleep | 290 | | NULL | 289861 | 0 | 0 |
| 543708 | system user | | NULL | Query | 477 | Replica has read all relay log; waiting for more updates | NULL | 476520 | 0 | 0 |
| 543709 | system user | | qj | Query | 1383 | Applying batch of row changes (delete) | NULL | 26102 | 0 | 0 |
| 543710 | system user | | NULL | Connect | 477 | Waiting for an event from Coordinator | NULL | 476570 | 0 | 0 |
| 543711 | system user | | NULL | Connect | 477 | Waiting for an event from Coordinator | NULL | 476569 | 0 | 0 |
| 543712 | system user | | NULL | Connect | 477 | Waiting for an event from Coordinator | NULL | 476568 | 0 | 0 |
| 543714 | greatsql | 172.17.136.93:33582 | NULL | Sleep | 470 | | NULL | 469798 | 0 | 0 |
+--------+-----------------+----------------------+------+---------+---------+----------------------------------------------------------+------------------+------------+-----------+---------------+
13 rows in set (0.00 sec)
greatsql> KILL 543709;
Query OK, 0 rows affected (0.00 sec)
greatsql> SHOW PROCESSLIST;
+--------+-----------------+----------------------+------+---------+---------+----------------------------------+------------------+------------+-----------+---------------+
| Id | User | Host | db | Command | Time | State | Info | Time_ms | Rows_sent | Rows_examined |
+--------+-----------------+----------------------+------+---------+---------+----------------------------------+------------------+------------+-----------+---------------+
| 5 | event_scheduler | localhost | NULL | Daemon | 2369473 | Waiting on empty queue | NULL | 2369473241 | 0 | 0 |
| 17 | system user | connecting host | NULL | Connect | 2369466 | Waiting for source to send event | NULL | 2369465757 | 0 | 0 |
| 543448 | greatsql | 172.17.136.93:48298 | NULL | Sleep | 129 | | NULL | 129034 | 0 | 0 |
| 543500 | greatsql | 172.17.139.230:35198 | NULL | Sleep | 605 | | NULL | 605087 | 0 | 0 |
| 543588 | greatsql | 172.17.136.93:57948 | NULL | Sleep | 9 | | NULL | 9132 | 0 | 0 |
| 543663 | greatsql | 172.17.139.230:35726 | NULL | Query | 0 | init | show processlist | 0 | 0 | 0 |
| 543666 | greatsql | 172.17.136.93:58908 | NULL | Sleep | 309 | | NULL | 309064 | 0 | 0 |
| 543714 | greatsql | 172.17.136.93:33582 | NULL | Sleep | 489 | | NULL | 489000 | 0 | 0 |
+--------+-----------------+----------------------+------+---------+---------+----------------------------------+------------------+------------+-----------+---------------+
8 rows in set (0.00 sec)
greatsql> SHOW SLAVE STATUS \G
*************************** 1. row ***************************
Slave_IO_State: Waiting for source to send event
Master_Host: 172.17.139.230
Master_User: greatsql
Master_Port: 3307
Connect_Retry: 60
Master_Log_File: binlog.000078
Read_Master_Log_Pos: 953750186
Relay_Log_File: relaylog.000231
Relay_Log_Pos: 949954084
Relay_Master_Log_File: binlog.000078
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 1317
Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '1d0963e9-85d9-11ef-80e1-00163e28e06a:163235' at master log binlog.000078, end_log_pos 950241265. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Skip_Counter: 0
Exec_Master_Log_Pos: 949953874
Relay_Log_Space: 953750683
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 1317
Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '1d0963e9-85d9-11ef-80e1-00163e28e06a:163235' at master log binlog.000078, end_log_pos 950241265. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Replicate_Ignore_Server_Ids:
Master_Server_Id: 1000403307
Master_UUID: 1d0963e9-85d9-11ef-80e1-00163e28e06a
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State:
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp: 241105 19:06:59
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 1d0963e9-85d9-11ef-80e1-00163e28e06a:1-163235
Executed_Gtid_Set: 1d0963e9-85d9-11ef-80e1-00163e28e06a:1-163234
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
Master_public_key_path:
Get_master_public_key: 0
Network_Namespace:
1 row in set, 1 warning (0.00 sec)
greatsql> SET sql_log_bin=0;
Query OK, 0 rows affected (0.00 sec)
greatsql> ALTER TABLE qj.sbtest1 ADD index tmp_k(k);
Query OK, 0 rows affected (4 min 51.92 sec)
Records: 0 Duplicates: 0 Warnings: 0
greatsql> START SLAVE;
Query OK, 0 rows affected, 1 warning (0.02 sec)
greatsql> SHOW SLAVE STATUS \G
*************************** 1. row ***************************
Slave_IO_State: Waiting for source to send event
Master_Host: 172.17.139.230
Master_User: greatsql
Master_Port: 3307
Connect_Retry: 60
Master_Log_File: binlog.000078
Read_Master_Log_Pos: 953750186
Relay_Log_File: relaylog.000231
Relay_Log_Pos: 953750396
Relay_Master_Log_File: binlog.000078
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 953750186
Relay_Log_Space: 953750683
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 1000403307
Master_UUID: 1d0963e9-85d9-11ef-80e1-00163e28e06a
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Replica has read all relay log; waiting for more updates
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 1d0963e9-85d9-11ef-80e1-00163e28e06a:1-163235
Executed_Gtid_Set: 1d0963e9-85d9-11ef-80e1-00163e28e06a:1-163235
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
Master_public_key_path:
Get_master_public_key: 0
Network_Namespace:
1 row in set, 1 warning (0.00 sec)
四、思考
01.如何保持主從一致性
02.正在回放數(shù)據(jù)時,STOP SLAVE 是否會有問題
03.如果主從結(jié)構(gòu)作為數(shù)據(jù)節(jié)點(diǎn),上層還有計算節(jié)點(diǎn)該如何保持元數(shù)據(jù)的一致性即計算節(jié)點(diǎn)記錄的表的結(jié)構(gòu)和數(shù)據(jù)節(jié)點(diǎn)的表結(jié)構(gòu)要一致
關(guān)于以上三個問題我的處理方案如下:
1.如何保持主從一致性
#為保持主從表結(jié)構(gòu)一致,主節(jié)點(diǎn)添加索引
greatsql> SET sql_log_bin=0;
greatsql> ALTER TABLE qj.sbtest1 ADD index tmp_k(k);
2.正在回放數(shù)據(jù)時,STOP SLAVE是否會有問題
greatsql> STOP SLAVE --當(dāng)該GTID回放很長時間時,stop可能會很長時間,我們可以采用 kill 線程方式結(jié)束回放
greatsql> SHOW PROCESSLIST; --找到會話ID
greatsql> KILL 會話號
greatsql> SHOW SLAVE STATUS \G
-- 添加索引即可
3.如果主從結(jié)構(gòu)作為數(shù)據(jù)節(jié)點(diǎn),上層還有計算節(jié)點(diǎn)該如何保持元數(shù)據(jù)的一致性即計算節(jié)點(diǎn)記錄的表的結(jié)構(gòu)和數(shù)據(jù)節(jié)點(diǎn)的表結(jié)構(gòu)要一致
-- 當(dāng)從節(jié)點(diǎn)添加上索引之后,在計算節(jié)點(diǎn)即代理層給表添加索引,當(dāng)然該索引名不要與臨時添加索引名重復(fù)
-- 添加上索引之后,在將從節(jié)點(diǎn)上臨時添加的索引刪除即可
greatsql> STOP SLAVE;
greatsql> SHOW SLAVE STATUS \G
greatsql> SET sql_log_bin=0;
greatsql> ALTER table qj.sbtest1 DROP index tmp_k;
greatsql> START SLAVE;
greatsql> SHOW SLAVE STATUS \G
五、總結(jié)
加速無索引表的回放主要是針對該表上進(jìn)行 DELETE 或 UPDATE 操作時有效,而且我們操作的時候要注意是否能記錄binlog日志,保證主從數(shù)據(jù)一致性,當(dāng)從節(jié)點(diǎn)作為數(shù)據(jù)節(jié)點(diǎn)即存儲節(jié)點(diǎn)的角色時,要關(guān)注計算節(jié)點(diǎn)即代理層元數(shù)據(jù)與存儲節(jié)點(diǎn)元數(shù)據(jù)的一致性或者是符合計算節(jié)點(diǎn)關(guān)于該表的定義規(guī)則。