從實現(xiàn)原理來看為什么 Clone 插件比 Xtrabackup 更好用?
從 MySQL 8.0.17 版本開始,官方實現(xiàn)了 Clone 的功能,允許用戶通過簡單的 SQL 命令把遠(yuǎn)端或本地的數(shù)據(jù)庫實例拷貝到其他實例后,快速拉起一個新的實例。
該功能由一些列的 WL 組成 :
- Clone local replica(WL#9209) :實現(xiàn)了數(shù)據(jù)本地 Clone。
- Clone remote replica(WL#9210) :在本地 Clone 的基礎(chǔ)上,實現(xiàn)了遠(yuǎn)程 Clone。將數(shù)據(jù)保存到遠(yuǎn)程的一個目錄中,解決跨節(jié)點部署 MySQL 的問題。
- Clone Remote provisioning(WL#11636) :將數(shù)據(jù)直接拷貝到需要重新初始化的 MySQL 實例中。此外這個 WL 還增加了預(yù)檢查的功能。
- Clone Replication Coordinates(WL#9211) :完成了獲取和保存 Clone 點位的功能,方便 Clone 實例正常的加入到集群中。
- Support cloning encrypted database (WL#9682) :最后一個 worklog 解決了數(shù)據(jù)加密情況下的數(shù)據(jù)拷貝問題。
本文主要初步的介紹 Clone Plugin 的原理以及和 Xtrabackup 的異同,以及整體實現(xiàn)的框架。
1.Xtrabackup 備份的不足
在 Xtrabackup 備份的過程中,可能遇到的最大的問題在于拷貝 Redo Log 的速度跟不上線上生產(chǎn) Redo Log 的速度。
因為 Redo Log 是會循環(huán)利用的,當(dāng) CK 過后舊的 Redo Log 可能會被新的 Redo Log 覆蓋,而此時如果 Xtrabackup 沒有完成舊的 Redo Log 的拷貝,那么沒法保證備份過程中的數(shù)據(jù)一致性。
圖片
圖片來源:https://www.cnblogs.com/linuxk/p/9372990.html
Redo Log 工作原理
2.Clone 實現(xiàn)的基本原理
那么在 Clone Plugin 中如何去解決這個問題? 從 WL#9209 中可以看到官方整體的設(shè)計思路。在完成 Clone 的過程中將過程分為了 5 步:
- INIT: The clone object is initialized identified by a locator.
- FILE COPY: The state changes from INIT to "FILE COPY" when snapshot_copy interface is called. Before making the state change we start "Page Tracking" at lsn "CLONE START LSN". In this state we copy all database files and send to the caller.
- PAGE COPY: The state changes from "FILE COPY" to "PAGE COPY" after all files are copied and sent. Before making the state change we start "Redo Archiving" at lsn "CLONE FILE END LSN" and stop "Page Tracking". In this state, all modified pages as identified by Page IDs between "CLONE START LSN" and "CLONE FILE END LSN" are read from "buffer pool" and sent. We would sort the pages by space ID, page ID to avoid random read(donor) and random write(recipient) as much as possible.
- REDO COPY: The state changes from "PAGE COPY" to "REDO COPY" after all modified pages are sent. Before making the state change we stop "Redo Archiving" at lsn "CLONE LSN". This is the LSN of the cloned database. We would also need to capture the replication coordinates at this point in future. It should be the replication coordinate of the last committed transaction up to the "CLONE LSN". We send the redo logs from archived files in this state from "CLONE FILE END LSN" to "CLONE LSN" before moving to "Done" state.
- Done: The clone object is kept in this state till destroyed by snapshot_end() call.
這中間最重要的便是 :
- FILE COPY :跟 Xtrabackup一樣,會物理的拷貝所有的 InnoDB 表空間文件,同時會啟動一個 Page Tracking 進(jìn)程監(jiān)控從 CLONE START LSN 開始監(jiān)控所有 InnoDB PAGE 的改動。
- PAGE COPY :PAGE COPY 是在 Xtrabackup 中沒有的一個階段。主要完成 2 個工作:
在完成數(shù)據(jù)庫庫文件拷貝之后,會開啟 Redo Archiving,同時停止 Page Tracking 進(jìn)程(PS 開始前會做一次 checkpoint)。Redo Archiving 會從指定的 LSN 位置開始拷貝 Redo Log。
將 Page Tracking 記錄的臟頁發(fā)送到指定位置,為了保持高效,會基于 spaceid 和 page id 進(jìn)行排序,盡可能確保磁盤讀寫的順序性。
- Redo Copy :這個階段,會加鎖獲取 Binlog 文件及當(dāng)前偏移位置和 gtid_executed 信息并停止 Redo Archiving 進(jìn)程。之后將所有歸檔的 Redo Log 日志文件發(fā)往目標(biāo)端。
Clone 的三個重要階段
3.代碼結(jié)構(gòu)和調(diào)用邏輯
整體實現(xiàn)上分為了三個部分:
SQL/Server 層 :
- sql/sql_lex.h
- sql/sql_yacc.yy
增加了對 Clone 語法的支持。
- sql_admin.cc
增加了客戶端處理 SQL(clone instance) 和服務(wù)端處理 COM_XXX 命令。
- clone_handler.cc
增加調(diào)用 Plugin 的具體實現(xiàn)響應(yīng) SQL 層處理。
Plugin 插件層
- clone_plugin.cc : plugin interface
- clone_local.cc : 具體的 Clone 操作。
- clone_os.cc : 系統(tǒng)層面具體的一些操作函數(shù),包括 OS [sendfile/read/write]。
- clone_hton.cc : 與存儲引擎層的接口。
- clone_client.cc 和 clone_server.cc : Clone 的客戶端和服務(wù)端。
- clone_status.cc : Clone 的時候的整體任務(wù)的進(jìn)度和狀態(tài)。會有一個 Clone_Task_Manager 去記錄狀態(tài)信息。
- clone_plugin.cc : Clone 插件的入口以及初始化和系統(tǒng)變量等內(nèi)容。
插件層目錄
InnoDB 引擎層
- Clone: storage/innobase/clone
clone0clone.cc : clone task and runtime operation
clone0snapshot.cc : snapshot management
clone0copy.cc : copy specific methods
clone0apply.cc : apply specific methods
clone0desc.cc : serialized data descriptor
- Archiver: storage/innobase/arch : Page tracing 相關(guān)的內(nèi)容。
- arch0arch.cc
- arch0page.cc
- arch0log.cc
本地 Clone 的函數(shù)調(diào)用棧:
Clone_Handle::process_chunk(Clone_Task*, unsigned int, unsigned int, Ha_clone_cbk*) (/mysql-8.0.33/storage/innobase/clone/clone0copy.cc:1440)
Clone_Handle::copy(unsigned int, Ha_clone_cbk*) (/mysql-8.0.33/storage/innobase/clone/clone0copy.cc:1379)
innodb_clone_copy(handlerton*, THD*, unsigned char const*, unsigned int, unsigned int, Ha_clone_cbk*) (/mysql-8.0.33/storage/innobase/clone/clone0api.cc:561)
hton_clone_copy(THD*, std::__1::vector<myclone::Locator, std::__1::allocator<myclone::Locator>>&, std::__1::vector<unsigned int, std::__1::allocator<unsigned int>>&, Ha_clone_cbk*) (/mysql-8.0.33/plugin/clone/src/clone_hton.cc:152)
myclone::Local::clone_exec() (/mysql-8.0.33/plugin/clone/src/clone_local.cc:172)
myclone::Local::clone() (/mysql-8.0.33/plugin/clone/src/clone_local.cc:73)
plugin_clone_local(THD*, char const*) (/mysql-8.0.33/plugin/clone/src/clone_plugin.cc:456)
Clone_handler::clone_local(THD*, char const*) (/mysql-8.0.33/sql/clone_handler.cc:135)
Sql_cmd_clone::execute(THD*) (/mysql-8.0.33/sql/sql_admin.cc:2017)
mysql_execute_command(THD*, bool) (/mysql-8.0.33/sql/sql_parse.cc:4714)
dispatch_sql_command(THD*, Parser_state*) (/mysql-8.0.33/sql/sql_parse.cc:5363)
dispatch_command(THD*, COM_DATA const*, enum_server_command) (/mysql-8.0.33/sql/sql_parse.cc:2050)
do_command(THD*) (/mysql-8.0.33/sql/sql_parse.cc:1439)
handle_connection(void*) (/mysql-8.0.33/sql/conn_handler/connection_handler_per_thread.cc:302)
pfs_spawn_thread(void*) (/mysql-8.0.33/storage/perfschema/pfs.cc:3042)
_pthread_start (@_pthread_start:40)
Clone 函數(shù)調(diào)用
4Page Archiving 系統(tǒng)
Page Archiving 是之前 Xtrabackup 中沒有的部分,因此在這里特別介紹下整體實現(xiàn)的過程。
為了減少在 Clone 過程中的 Redo Log 的拷貝量,Clone 插件中使用了對 Dirty Page 進(jìn)行跟蹤和收集的方法,在拷貝表空間的過程中追蹤 Dirty Page,并在 File Copy 結(jié)束的階段將 Dirty Page 打包發(fā)送到目標(biāo)端。
Page Tracking 臟頁監(jiān)控的方式可以有兩種實現(xiàn)方案:
- mtr 提交的時候收集。
- 在 purge 進(jìn)程刷臟的時候收集。
為了不阻塞 MySQL 事務(wù)的提交,當(dāng)前 Clone 插件選擇的是方案 2。
Purge 進(jìn)程刷臟的入口是 buf\_flush\_page 函數(shù)。
buf0flu.cc
if (flush) {
/* We are committed to flushing by the time we get here */
mutex_enter(&buf_pool->flush_state_mutex);
....
arch_page_sys->track_page(bpage, buf_pool->track_page_lsn, frame_lsn,
false);
}
在將臟頁刷回到磁盤的時候,會將需要追蹤的臟頁加入 arch\_page\_sys 中。如果在加入臟頁的過程中 block 滿了,需要開辟新的空間,會阻塞刷臟的進(jìn)程。
/** Check and add page ID to archived data.
Check for duplicate page.
@param[in] bpage page to track
@param[in] track_lsn LSN when tracking started
@param[in] frame_lsn current LSN of the page
@param[in] force if true, add page ID without check */
void Arch_Page_Sys::track_page(buf_page_t *bpage, lsn_t track_lsn,
lsn_t frame_lsn, bool force) {
Arch_Block *cur_blk;
uint count = 0;
... ...
/* We need to track this page. */
arch_oper_mutex_enter();
while (true) {
if (m_state != ARCH_STATE_ACTIVE) {
break;
}
... ...
cur_blk = m_data.get_block(&m_write_pos, ARCH_DATA_BLOCK);
if (cur_blk->get_state() == ARCH_BLOCK_ACTIVE) {
if (cur_blk->add_page(bpage, &m_write_pos)) {
/* page added successfully. */
break;
}
/* Current block is full. Move to next block. */
cur_blk->end_write();
m_write_pos.set_next();
/* Writing to a new file so move to the next reset block. */
if (m_write_pos.m_block_num % ARCH_PAGE_FILE_DATA_CAPACITY == 0) {
Arch_Block *reset_block =
m_data.get_block(&m_reset_pos, ARCH_RESET_BLOCK);
reset_block->end_write();
m_reset_pos.set_next();
}
os_event_set(page_archiver_thread_event);
++count;
continue;
} else if (cur_blk->get_state() == ARCH_BLOCK_INIT ||
cur_blk->get_state() == ARCH_BLOCK_FLUSHED) {
ut_ad(m_write_pos.m_offset == ARCH_PAGE_BLK_HEADER_LENGTH);
cur_blk->begin_write(m_write_pos);
if (!cur_blk->add_page(bpage, &m_write_pos)) {
/* Should always succeed. */
ut_d(ut_error);
}
/* page added successfully. */
break;
} else {
bool success;
... ...
/* Might release operation mutex temporarily. Need to
loop again verifying the state. */
success = wait_flush_archiver(cbk);
count = success ? 0 : 2;
continue;
}
}
arch_oper_mutex_exit();
}
臟頁收集的整體入口在 Page\_Arch\_Client\_Ctx::start 和 Arch\_Page\_Sys::start。
這里需要注意的是,在開啟 Page Archiving 之前需要強制一次 checkpoint,因此如果系統(tǒng)處于比較高的負(fù)載(比如 IO Wait 很高)可能會導(dǎo)致系統(tǒng)卡頓。
int Page_Arch_Client_Ctx::start(bool recovery, uint64_t *start_id) {
... ...
/* Start archiving. */
err = arch_page_sys->start(&m_group, &m_last_reset_lsn, &m_start_pos,
m_is_durable, reset, recovery);
... ...
}
int Arch_Page_Sys::start(Arch_Group **group, lsn_t *start_lsn,
Arch_Page_Pos *start_pos, bool is_durable,
bool restart, bool recovery) {
... ...
log_sys_lsn = (recovery ? m_last_lsn : log_get_lsn(*log_sys));
/* Enable/Reset buffer pool page tracking. */
set_tracking_buf_pool(log_sys_lsn); // page_id
... ...
auto err = start_page_archiver_background(); sp_id, page_id
... ...
if (!recovery) {
/* Request checkpoint */
log_request_checkpoint(*log_sys, true); checkpoint
}
}
臟頁的歸檔由 page\_archiver\_thread 線程進(jìn)行:
/** Archiver background thread */
void page_archiver_thread() {
bool page_wait = false;
... ...
while (true) {
/* Archive in memory data blocks to disk. */
auto page_abort = arch_page_sys->archive(&page_wait);
if (page_abort) {
ib::info(ER_IB_MSG_14) << "Exiting Page Archiver";
break;
}
if (page_wait) {
/* Nothing to archive. Wait until next trigger. */
os_event_wait(page_archiver_thread_event);
os_event_reset(page_archiver_thread_event);
}
}
}
bool Arch_Page_Sys::archive(bool *wait) {
... ...
db_err = flush_blocks(wait);
if (db_err != DB_SUCCESS) {
is_abort = true;
}
... ...
return (is_abort);
}
dberr_t Arch_Page_Sys::flush_blocks(bool *wait) {
... ...
err = flush_inactive_blocks(cur_pos, end_pos);
... ...
}
dberr_t Arch_Page_Sys::flush_inactive_blocks(Arch_Page_Pos &cur_pos,
Arch_Page_Pos end_pos) {
/* Write all blocks that are ready for flushing. */
while (cur_pos.m_block_num < end_pos.m_block_num) {
cur_blk = m_data.get_block(&cur_pos, ARCH_DATA_BLOCK);
err = cur_blk->flush(m_current_group, ARCH_FLUSH_NORMAL);
if (err != DB_SUCCESS) {
break;
}
... ...
}
return (err);
}
在最后會調(diào)用 Arch\_Block 去歸檔臟頁。這里當(dāng)把臟頁歸檔的時候也需要使用 doublewrite buffer。
/** Flush this block to the file group.
@param[in] file_group current archive group
@param[in] type flush type
@return error code. */
dberr_t Arch_Block::flush(Arch_Group *file_group, Arch_Blk_Flush_Type type) {
... ...
switch (m_type) {
case ARCH_RESET_BLOCK:
err = file_group->write_file_header(m_data, m_size);
break;
case ARCH_DATA_BLOCK: {
bool is_partial_flush = (type == ARCH_FLUSH_PARTIAL);
/* Callback responsible for setting up file's header starting at offset 0.
This header is left empty within this flush operation. */
auto get_empty_file_header_cbk = [](uint64_t, byte *) {
return DB_SUCCESS;
};
/* We allow partial flush to happen even if there were no pages added
since the last partial flush as the block's header might contain some
useful info required during recovery. */
err = file_group->write_to_file(nullptr, m_data, m_size, is_partial_flush,
true, get_empty_file_header_cbk);
break;
}
default:
ut_d(ut_error);
}
return (err);
}
dberr_t Arch_Group::write_to_file(Arch_File_Ctx *from_file, byte *from_buffer,
uint length, bool partial_write,
bool do_persist,
Get_file_header_callback get_header) {
... ...
if (do_persist) {
Arch_Page_Dblwr_Offset dblwr_offset =
(partial_write ? ARCH_PAGE_DBLWR_PARTIAL_FLUSH_PAGE
: ARCH_PAGE_DBLWR_FULL_FLUSH_PAGE);
/** Write to the doublewrite buffer before writing archived data to a file.
The source is either a file context or buffer. Caller must ensure that data
is in single file in source file context. **/
Arch_Group::write_to_doublewrite_file(from_file, from_buffer, write_size,
dblwr_offset);
}
... ...
return (DB_SUCCESS);
}
5總結(jié)
- Clone 功能相對于使用 Xtrabackup 拉起一個 Slave,更加的方便。
- Clone 功能相對于 Xtrabackup,拷貝的 Redo Log 日志量更少,也更不容易遇到失敗的問題(arch\_log\_sys 會控制日志寫入以避免未歸檔的日志被覆蓋)。
- 從源碼的分析來看,啟動 Clone 的時候會強制做一次 CK,在 Redo Log Archiving 的時候會控制日志寫入量,因此從原理上看,如果處于高負(fù)載的主庫做 Clone 操作,可能會對系統(tǒng)有影響。
參考
- 《MySQL · 引擎特性 · 初探 Clone Plugin》 http://mysql.taobao.org/monthly/2019/09/02/
- 《MySQL:插件回調(diào)的方式》https://greatsql.cn/blog-74-1158.html
- 《MySQL · 引擎特性 · clone_plugin》 http://mysql.taobao.org/monthly/2019/08/05/
- 《實戰(zhàn) MySQL 8.0.17 Clone Plugin》https://opensource.actionsky.com/20190726-mysql/
- 《全網(wǎng)最完整的 MySQL Clone Plugin 實現(xiàn)原理解析》 https://zhuanlan.zhihu.com/p/433606318
- 《MySQL/InnoDB數(shù)據(jù)克隆插件(clone plugin)實現(xiàn)剖析》 https://sq.sf.163.com/blog/article/364933037836570624