GreatSQL一個關(guān)于主從復(fù)制的限制描述與規(guī)避
一、背景
分享一個在項目運維中遇到的一個主從復(fù)制限制的一個坑,項目的架構(gòu)為主集群+災(zāi)備集群,每個集群為一主兩從模式。主集群到災(zāi)備集群的同步為主從復(fù)制的方式,根據(jù)業(yè)務(wù)需求災(zāi)備集群需要忽略系統(tǒng)庫跟某些配置表,所以才會觸發(fā)此限制,而這個限制如果我們之前沒有遇到過,那么排查起來也是相對不易的。
二、限制描述
1、主從同步出現(xiàn)報錯
greatsql> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 192.168.xxx.xxx
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: greatsql-bin.000990
Read_Master_Log_Pos: 92274290
Relay_Log_File: greatsql-relay.002963 -----
Relay_Log_Pos: 701548899
Relay_Master_Log_File: greatsql-bin.000988
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB: mysql,dbscale,dbscale_tmp,information_schema,performance_schema,sys
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table: A.ab,B.bc
Last_Errno: 1146
Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '9e668a93-2618-11ee-93ee-bc16954181bb:47508257' at master log greatdb-bin.000988, end_log_pos 701570116. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Skip_Counter: 0
Exec_Master_Log_Pos: 701548690
Relay_Log_Space: 2246320360
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 1146
Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '9e668a93-2618-11ee-93ee-bc16954181bb:47508257' at master log greatdb-bin.000988, end_log_pos 701570116. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Replicate_Ignore_Server_Ids:
Master_Server_Id: 1943306
Master_UUID: 9e668a93-2618-11ee-93ee-bc16954181bb
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State:
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp: 230822 14:14:18
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 9e668a93-2618-11ee-93ee-bc16954181bb:2-47565802
Executed_Gtid_Set: 30873cfe-8750-11ed-b56f-744aa4073024:1-270,
9e668a93-2618-11ee-93ee-bc16954181bb:1-47508256
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
1 row in set (0.00 sec)
根據(jù)slave status狀態(tài)信息可以看出
- 報錯的GTID為:'9e668a93-2618-11ee-93ee-bc16954181bb:47508257'
- 應(yīng)用的主集群的binlog為:greatsql-bin.000988
- 災(zāi)備集群的relay log為:greatsql-relay.002963
詳細(xì)信息查看performance_schema.replication_applier_status_by_worker表
2、查看錯誤的詳細(xì)信息
greatsql> select * from performance_schema.replication_applier_status_by_worker\G
*************************** 1. row ***************************
CHANNEL_NAME:
WORKER_ID: 1
THREAD_ID: NULL
SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: 9e668a93-2618-11ee-93ee-bc16954181bb:47508257
LAST_ERROR_NUMBER: 1146
LAST_ERROR_MESSAGE: Worker 1 failed executing transaction
'9e668a93-2618-11ee-93ee-bc16954181bb:47508257' at master log greatdb-bin.000988,
end_log_pos 701570116; Error executing row event: 'Table 'abs_xxx.tmp_xxx_info' doesn't exist'
LAST_ERROR_TIMESTAMP: 2023-08-22 14:14:18
上述信息說明根據(jù)performance_schema.replication_applier_status_by_worker表中的詳細(xì)錯誤信息可以發(fā)現(xiàn)為災(zāi)備集群abs_xxx.tmp_xxx_info表不存在,導(dǎo)致同步報錯
3、問題分析
3.1、確認(rèn)災(zāi)備集群中目標(biāo)表是否存在
greatsql> show create table abs_xxx.tmp_xxx_info;
ERROR 1146 (42S02): Table 'abs_xxx.tmp_xxx_info' doesn't exist
greatsql> desc abs_xxx.tmp_xxx_info;
ERROR 1146 (42S02): Table 'abs_xxx.tmp_xxx_info' doesn't exist
結(jié)論:災(zāi)備集群中目標(biāo)表的確不存在
3.2、根據(jù)主從報錯信息解析主集群binlog,報錯的SQL
解析主集群binlog
SET @@SESSION.GTID_NEXT= '9e668a93-2618-11ee-93ee-bc16954181bb:47508257'/*!*/;
……
#230822 14:14:18 server id 1943306 end_log_pos 701570000 Table_map: `abs_xxx`.`tmp_xxx_info` mapped to number 1595
# at 701570000
#230822 14:14:18 server id 1943306 end_log_pos 701570116 Write_rows: table id 1595 flags: STMT_END_F
### INSERT INTO `abs_xxx`.`tmp_xxx_info`
### SET
### @1=2
### @2='自動化'
### @3='2300121212120000'
### @4='90000000'
### @5='1'
### @6='202001290231001'
### @7='2021-01-31 00:00:00'
# at 701570116
#230822 14:14:18 server id 1943306 end_log_pos 701570143 Xid = 800998400
COMMIT/*!*/;
# at 701570143
#230822 14:14:18 server id 1943306 end_log_pos 701570204 GTID last_committed=26491 sequence_number=26521 rbr_only=yes
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/;
結(jié)論:根據(jù)復(fù)制的報錯信息得知具體的GTID號以及主集群的binlog文件,解析binlog得知此事務(wù)為一條INSERT語句,語句中的目標(biāo)表與performance_schema.replication_applier_status_by_worker表中信息一致
3.3、尋找主集群目標(biāo)表binlog中是否有建表語句
在同一binlog日志中尋找建表語句
SET TIMESTAMP=1692684495/*!*/;
CREATE DATABASE IF NOT EXISTS `abs_xxx` /*!40100 DEFAULT CHARACTER SET utf8 COLLATE utf8_bin */
/*!*/;
……
use `information_schema`/*!*/;
SET TIMESTAMP=1692684495/*!*/;
CREATE TABLE `abs_xxx`.`tmp_xxx_info` (
`ID` int(64) NOT NULL AUTO_INCREMENT,
`CUSTOMER_NAME` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL,
`CUSTOMER_NUM` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL,
`PRODIST_SKU_NUM` varchar(64) COLLATE utf8mb4_bin DEFAULT NULL,
`STATUS` varchar(64) COLLATE utf8mb4_bin DEFAULT NULL,
`AGREEMENT_NUM` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL,
`END_DATE` datetime DEFAULT NULL,
PRIMARY KEY (`ID`),
KEY `PK_PRODIST_SKU_NUM_AGREEMENT` (`PRODIST_SKU_NUM`) USING BTREE,
KEY `IDX_AGREEMENT_NUM` (`AGREEMENT_NUM`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin ROW_FORMAT=DYNAMIC
/*!*/;
# at 475864451
結(jié)論:在主集群的binlog日志中找到了目標(biāo)表的建表語句,說明主集群執(zhí)行DDL時并沒有關(guān)閉binlog日志,那么繼續(xù)查看在災(zāi)備集群的中繼日志中是否存在DDL語句
3.4、解析災(zāi)備集群的中繼日志,確認(rèn)是否拉取到災(zāi)備集群
#230822 14:08:15 server id 1943306 end_log_pos 475863662 GTID last_committed=16341 sequence_number=16342 rbr_only=no
SET @@SESSION.GTID_NEXT= '9e668a93-2618-11ee-93ee-bc16954181bb:47498079'/*!*/;
……
use `information_schema`/*!*/;
SET TIMESTAMP=1692684495/*!*/;
CREATE TABLE `abs_xxx`.`tmp_xxx_info` (
`ID` int(64) NOT NULL AUTO_INCREMENT,
`CUSTOMER_NAME` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL,
`CUSTOMER_NUM` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL,
`PRODIST_SKU_NUM` varchar(64) COLLATE utf8mb4_bin DEFAULT NULL,
`STATUS` varchar(64) COLLATE utf8mb4_bin DEFAULT NULL,
`AGREEMENT_NUM` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL,
`END_DATE` datetime DEFAULT NULL,
PRIMARY KEY (`ID`),
KEY `PK_PRODIST_SKU_NUM_AGREEMENT` (`PRODIST_SKU_NUM`) USING BTREE,
KEY `IDX_AGREEMENT_NUM` (`AGREEMENT_NUM`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin ROW_FORMAT=DYNAMIC
/*!*/;
# at 475864660
#230822 14:08:15 server id 1943306 end_log_pos 475864512 GTID last_committed=16342 sequence_number=16343 rbr_only=yes
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/;
結(jié)論:災(zāi)備集群的中繼日志中存在DDL建表語句,說明并不是IO線程出了問題
3.5、排查復(fù)制配置的忽略庫表
Replicate_Ignore_DB: mysql,dbscale,dbscale_tmp,information_schema,performance_schema,sys
Replicate_Wild_Ignore_Table: A.ab,B.bc
結(jié)論:忽略庫表中并不包含目標(biāo)表,但是根據(jù)以上解析日志發(fā)現(xiàn),在主集群binlog日志中建表語句之前有個use information_schema/!/; 的語句,此庫為同步忽略的系統(tǒng)庫,因此觸發(fā)了GreatSQL的規(guī)范限制,在忽略庫下對未忽略進(jìn)行操作Statement模式下記錄語句默認(rèn)不起作用 (詳情:https://dev.mysql.com/doc/refman/5.7/en/replication-options-replica.html#option_mysqld_replicate-do-db)
4、解決同步報錯
在災(zāi)備集群創(chuàng)建目標(biāo)表
greatsql> CREATE TABLE `abs_xxx`.`tmp_xxx_info` (
`ID` int(64) NOT NULL AUTO_INCREMENT,
`CUSTOMER_NAME` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL,
`CUSTOMER_NUM` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL,
`PRODIST_SKU_NUM` varchar(64) COLLATE utf8mb4_bin DEFAULT NULL,
`STATUS` varchar(64) COLLATE utf8mb4_bin DEFAULT NULL,
`AGREEMENT_NUM` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL,
`END_DATE` datetime DEFAULT NULL,
PRIMARY KEY (`ID`),
KEY `PK_PRODIST_SKU_NUM_AGREEMENT` (`PRODIST_SKU_NUM`) USING BTREE,
KEY `IDX_AGREEMENT_NUM` (`AGREEMENT_NUM`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin ROW_FORMAT=DYNAMIC;
greatsql> stop slave;
greatsql> start slave;
結(jié)論:在災(zāi)備集群創(chuàng)建目標(biāo)表后重啟復(fù)制恢復(fù)成功
三、限制規(guī)避
1、第一種規(guī)避方式
執(zhí)行DDL時進(jìn)入目標(biāo)庫
greatsql> use abs_cust
greatsql> DDL 語句(CREATE\DROP\ALTER)
說明:在應(yīng)用連接數(shù)據(jù)庫時有可能默認(rèn)就是information_schema庫,而此環(huán)境將系統(tǒng)庫全部忽略,所以為了規(guī)避類似的問題,請在執(zhí)行SQL語句時請先use到目標(biāo)表的目標(biāo)庫。
2、第二種規(guī)避方式
修改主從復(fù)制配置,以下步驟為測試環(huán)境
關(guān)閉災(zāi)備集群在復(fù)制同步
greatsql> stop slave;
Query OK, 0 rows affected, 1 warning (0.03 sec)
修改忽略庫
greatsql> change replication filter Replicate_Ignore_DB=();
修改忽略表
greatsql> change replication filter replicate_wild_ignore_table =('mysql.%','information_schema.%','sys.%','performance_schema.%');
啟動同步
greatsql> start slave;
Query OK, 0 rows affected, 1 warning (0.37 sec)
測試驗證
主集群:
greatsql> use mysql
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
greatsql> create table test111.test111(id int primary key);
Query OK, 0 rows affected (0.06 sec)
greatsql> show tables;
+-------------------+
| Tables_in_test111 |
+-------------------+
| test111 |
+-------------------+
1 row in set (0.00 sec)
災(zāi)備集群:
greatsql> use test111
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
greatsql> show tables;
+-------------------+
| Tables_in_test111 |
+-------------------+
| test111 |
+-------------------+
1 row in set (0.00 sec)
說明:復(fù)制配置中參數(shù)Replicate_Ignore_DB設(shè)置為空,將replicate_wild_ignore_table參數(shù)設(shè)置為shema_name.%的方式也可以規(guī)避類似的問題