自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

51CTO首頁(yè)

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開(kāi)發(fā)者社區(qū)

WOT技術(shù)大會(huì)

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫(kù)

在線學(xué)習(xí)

文章資源問(wèn)答課堂專欄直播

51CTO

鴻蒙開(kāi)發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營(yíng)

鴻蒙開(kāi)發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開(kāi)發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫(kù)

賬號(hào)設(shè)置退出

MySQL 記錄、頁(yè)、索引的數(shù)據(jù)結(jié)構(gòu)簡(jiǎn)析

作者：鳥(niǎo)山明 2023-12-28 10:54:58

數(shù)據(jù)庫(kù) MySQL

MySQL 中有行格式有三種存儲(chǔ)方式，包括 Server 層的格式、索引元組格式（邏輯記錄，tuple）、物理存儲(chǔ)記錄格式（record）。

引言

本文在介紹 MySQL 內(nèi)存中記錄、頁(yè)、索引、游標(biāo)的數(shù)據(jù)結(jié)構(gòu)的基礎(chǔ)上，通過(guò)簡(jiǎn)單分析插入操作過(guò)程中行格式的轉(zhuǎn)換介紹了不同數(shù)據(jù)結(jié)構(gòu)的關(guān)系，其中不涉及加鎖相關(guān)邏輯。

原理

記錄

概念

InnoDB 存儲(chǔ)引擎基于記錄（record）存儲(chǔ)，表明記錄是根據(jù)行（row）格式存儲(chǔ)。

MySQL 中有行格式有以下三種存儲(chǔ)方式：

Server 層的格式，這種格式與存儲(chǔ)引擎沒(méi)關(guān)系，適用于所有存儲(chǔ)引擎，這種格式是操作數(shù)據(jù)必然要經(jīng)過(guò)的一種格式，也是 Row 模式下 binlog 存儲(chǔ)所使用的一種格式；
索引元組格式，這種格式是 InnoDB 存儲(chǔ)引擎在存取記錄時(shí)一種記錄格式的中間狀態(tài)，它要么是從 Server 層轉(zhuǎn)換而來(lái)，要是么轉(zhuǎn)換為 Server 層的格式。索引元組格式與索引一一對(duì)應(yīng)，是內(nèi)存中一種用來(lái)存儲(chǔ)索引中所有列數(shù)據(jù)的數(shù)據(jù)結(jié)構(gòu)，對(duì)應(yīng)邏輯記錄（tuple）；
物理存儲(chǔ)記錄格式，這種格式是一條記錄在物理頁(yè)面中的存儲(chǔ)格式，也就是 Compact 格式。這種格式與索引元組格式一一對(duì)應(yīng)，對(duì)應(yīng)物理記錄（record）。

因此：

存數(shù)據(jù)時(shí)，發(fā)生以下轉(zhuǎn)換：Server 層的格式 -> 索引元組格式 -> 物理存儲(chǔ)記錄格式；
取數(shù)據(jù)時(shí)，發(fā)生以下轉(zhuǎn)換：物理存儲(chǔ)記錄格式 -> 索引元組格式 -> Server 層的格式。

個(gè)人理解存取數(shù)據(jù)時(shí)完成二進(jìn)制到“字符串”的相互轉(zhuǎn)換（忽略其他數(shù)據(jù)類型），相當(dāng)于客戶端與服務(wù)端交互時(shí)的編碼與解碼。

數(shù)據(jù)結(jié)構(gòu)

內(nèi)存中邏輯記錄與其中的字段分別對(duì)應(yīng)數(shù)據(jù)結(jié)構(gòu) dtuple_t 與 dfield_t。

結(jié)構(gòu)體定義如下所示。

/** Structure for an SQL data tuple of fields (logical record) */
struct dtuple_t {
 ulint  n_fields; /*!< number of fields in dtuple */
 ulint  n_fields_cmp; /*!< number of fields which should
     be used in comparison services
     of rem0cmp.*; the index search
     is performed by comparing only these
     fields, others are ignored; the
     default value in dtuple creation is
     the same value as n_fields */
 dfield_t* fields;  /*!< fields */
 UT_LIST_NODE_T(dtuple_t) tuple_list;
     /*!< data tuples can be linked into a
     list using this field */
};

其中：

ulint n_fields，表示列的數(shù)量；
dfield_t* fields，表示記錄中的列，存儲(chǔ)指向每一列真實(shí)數(shù)據(jù)的指針；
UT_LIST_NODE_T(dtuple_t) tuple_list，連接多個(gè) dtuple_t，原因是每個(gè)索引對(duì)應(yīng)一個(gè) dtuple_t，比如 INSERT SQL 需要向主鍵索引和二級(jí)索引中都插入時(shí)。

數(shù)據(jù)結(jié)構(gòu) dfield_t 中保存列的信息。

/** Structure for an SQL data field */
struct dfield_t{
 void*  data; /*!< pointer to data */
 unsigned ext:1; /*!< TRUE=externally stored, FALSE=local */
 unsigned len; /*!< data length; UNIV_SQL_NULL if SQL null */
 dtype_t  type; /*!< type of data */
};

其中：

data，表示真實(shí)列數(shù)據(jù)的指針；
ext:1，如果是大記錄（blob），則在外部頁(yè)存儲(chǔ)；
len，表示列數(shù)據(jù)的長(zhǎng)度；
type，表示列數(shù)據(jù)的類型。

物理記錄由 rec_t 數(shù)據(jù)結(jié)構(gòu)表示，字節(jié)類型。

/* We define the physical record simply as an array of bytes */
typedef byte rec_t;

頁(yè)

概念

頁(yè)和記錄類似同樣可以分為以下兩種形式：

物理頁(yè)（block），表示存儲(chǔ)在外部存儲(chǔ)設(shè)備上，一般是持久的。block 是文件系統(tǒng)中一次 IO 的大??；
內(nèi)存頁(yè)（page），表示存儲(chǔ)在緩沖池（buffer pool）中，當(dāng)物理頁(yè)與內(nèi)存頁(yè)不同時(shí)表明是臟頁(yè)。

頁(yè)分多種類型，比如 index page、undo log page 等，本文關(guān)注的是 index page。

由于 InnoDB 是索引組織樹(shù)，索引即數(shù)據(jù)，因此可以將索引頁(yè)理解為數(shù)據(jù)頁(yè)。

InnoDB 中頁(yè)是一個(gè)無(wú)序堆，因此頁(yè)中的記錄無(wú)序存放，記錄間通過(guò) record header 中的 next record 組成單向鏈表。

數(shù)據(jù)結(jié)構(gòu)

buffer pool 中與頁(yè)相關(guān)的有兩個(gè)數(shù)據(jù)結(jié)構(gòu)，包括 buf_block_t 和 buf_page_t 。

buf_block_t 是數(shù)據(jù)頁(yè)控制體的一種。

/** The buffer control block structure */

struct buf_block_t{

 buf_page_t page;  /*!< page information; this must
     be the first field, so that
     buf_pool->page_hash can point
     to buf_page_t or buf_block_t */
 byte*  frame;
 BPageLock lock;  /*!< read-write lock of the buffer frame */

其中：

block->buf_page_t，數(shù)據(jù)頁(yè)控制塊指針，要求是 buf_block_t 的第一個(gè)成員，便于 buf_block_t 隨時(shí)強(qiáng)制類型轉(zhuǎn)換為 buf_page_t；
block->frame，數(shù)據(jù)頁(yè)指針，指向頁(yè)對(duì)應(yīng)在緩沖池中的內(nèi)存地址，保存頁(yè)的真實(shí)內(nèi)容；
block->lock，讀寫鎖用于保護(hù) page 的內(nèi)容。

buf_page_t 稱為數(shù)據(jù)頁(yè)控制體。

/** The common buffer control block structure
for compressed and uncompressed frames */

class buf_page_t {
public:
  
  page_id_t id; // page id
  buf_page_t* hash;  /*!< node used in chaining to
     buf_pool->page_hash or
     buf_pool->zip_hash */
  lsn_t  newest_modification; // 當(dāng)前Page最新修改lsn
  lsn_t  oldest_modification; // 當(dāng)前Page最老修改lsn，即第一條修改lsn
  
};

其中：

oldest_modification，最初修改該頁(yè) redo log record 的 end LSN（或者是 mtr end LSN），oldest_modification 不為 0 表示是臟頁(yè)；
newest_modification，最近修改該頁(yè) redo log record 的 end LSN（或者是 mtr end LSN）。

索引

概念

索引是基于以空間換時(shí)間的思想用于加速查詢的數(shù)據(jù)結(jié)構(gòu)。

InnoDB 中索引基于 B+ 樹(shù)實(shí)現(xiàn)，因此每個(gè)索引都是一棵 B+ 樹(shù)。索引用于組織頁(yè)，頁(yè)用于組織行記錄。

數(shù)據(jù)結(jié)構(gòu)

每個(gè) B+ 樹(shù)索引在內(nèi)存中對(duì)應(yīng)數(shù)據(jù)結(jié)構(gòu) dict_index_t。

/** Data structure for an index.  Most fields will be
initialized to 0, NULL or FALSE in dict_mem_index_create(). */
struct dict_index_t{
 index_id_t id; /*!< id of the index */
 mem_heap_t* heap; /*!< memory heap */
 id_name_t name; /*!< index name */
 const char* table_name;/*!< table name */
 dict_table_t* table; /*!< back pointer to table */
  unsigned space:32;
    /*!< space where the index tree is placed */
 unsigned page:32;/*!< index tree root page number */
  unsigned allow_duplicates:1;
      /*!< if true, allow duplicate values
      even if index is created with unique
      constraint */
  unsigned n_uniq:10;/*!< number of fields from the beginning
    which are enough to determine an index
    entry uniquely */
 unsigned n_def:10;/*!< number of fields defined so far */
 unsigned n_fields:10;/*!< number of fields in the index */
 unsigned n_nullable:10;/*!< number of nullable fields */
  dict_field_t* fields; /*!< array of field descriptions */
};

其中：

index->table，指向當(dāng)前索引對(duì)應(yīng)的表；
index->n_fields，表示當(dāng)前索引中包含的列數(shù)；
index->page，表示當(dāng)前索引 root 頁(yè)的編號(hào)（偏移量）；
index->fields，表示當(dāng)前索引列的描述信息。

通過(guò) B+ 樹(shù)索引進(jìn)行查找（lookup）是最為常見(jiàn)的操作，具體通過(guò)游標(biāo)實(shí)現(xiàn)。

游標(biāo)

概念

游標(biāo)是一個(gè)邏輯概念，無(wú)論是寫入還是查詢，都需要在 B+ 樹(shù)上遍歷至某個(gè)位置，具體到某個(gè) block 上的某個(gè) record。

InnoDB 中通過(guò) cursor search 實(shí)現(xiàn) B+ 樹(shù)的查找，每個(gè) open 一個(gè) cursor 都會(huì)開(kāi)啟一個(gè)從 B+ 樹(shù) root 結(jié)點(diǎn)搜索到指定層級(jí)的 record 的搜索過(guò)程。

cursor 分為以下兩種：

B-tree cursor
page cursor，表示指向記錄所在位置的游標(biāo)。

page cursor 在定位（lookup）到記錄后，再通過(guò)查詢模式（search mode）再進(jìn)行向前或向后的記錄掃描（scan）。

比如對(duì)于一個(gè)主鍵的范圍查詢，首先需要定位第一條記錄，然后進(jìn)行記錄的順序掃描。

InnoDB 中定義了四種查詢模式：

PAGE_CUR_G: > 查詢，查詢第一個(gè)大于 dtuple 的 rec_t
PAGE_CUR_GE: >=，> 查詢，查詢第一個(gè)大于等于 dtuple 的 rec_t
PAGE_CUR_L: < 查詢，查詢最后一個(gè)小于 dtuple 的 rec_t
PAGE_CUR_LE: <= 查詢，查詢最后一個(gè)小于等于 dtuple 的 rec_t

插入操作的 search_mode 默認(rèn)是 PAGE_CUR_LE，即插在最后一個(gè)小于等于該 dtuple 的 rec_t 后。

為了提高查詢效率，InnoDB 做了以下優(yōu)化：

record buffer，也就是預(yù)讀，連續(xù)讀取多條記錄；
persistent cursor（持久化游標(biāo)，簡(jiǎn)稱 pcur），用于保存每次查詢到的記錄，待查詢下一條記錄時(shí)，首先恢復(fù)上一次查詢的記錄，然后再獲取下一條記錄；
Adaptive Hash Index（自適應(yīng)哈希索引，簡(jiǎn)稱 AHI），根據(jù)熱點(diǎn)行記錄進(jìn)行索引，但是如果記錄發(fā)生變化，也需要維護(hù)自適應(yīng)哈希索引。

數(shù)據(jù)結(jié)構(gòu)

B-tree cursor 對(duì)應(yīng) btr_cur_t 結(jié)構(gòu)體。

/** The tree cursor: the definition appears here only for the compiler
to know struct size! */
struct btr_cur_t {
 btr_cur_t() { memset(this, 0, sizeof(*this)); }

 dict_index_t* index;  /*!< index where positioned */
 page_cur_t page_cur; /*!< page cursor */
 purge_node_t* purge_node; /*!< purge node, for BTR_DELETE */
 buf_block_t* left_block; /*!< this field is used to store
     a pointer to the left neighbor
     page, in the cases
     BTR_SEARCH_PREV and
     BTR_MODIFY_PREV */
 /*------------------------------*/
 que_thr_t* thr;  /*!< this field is only used
     when btr_cur_search_to_nth_level
     is called for an index entry
     insertion: the calling query
     thread is passed here to be
     used in the insert buffer */
 /*------------------------------*/

};

其中：

page_cur_t page_cur，保存 page cursor 。

page cursor 對(duì)應(yīng) page_cur_t 結(jié)構(gòu)體。

/** Index page cursor */

struct page_cur_t{
 const dict_index_t* index;
 rec_t*  rec; /*!< pointer to a record on page */
 ulint*  offsets;
 buf_block_t* block; /*!< pointer to the block containing rec */
};

其中：

rec_t* rec，表示游標(biāo)查詢得到的記錄；
buf_block_t* block，表示表示游標(biāo)查詢得到的記錄所在塊信息。

persistent cursor 對(duì)應(yīng) btr_pcur_t 數(shù)據(jù)結(jié)構(gòu)。

/* The persistent B-tree cursor structure. This is used mainly for SQL
selects, updates, and deletes. */

struct btr_pcur_t{
 btr_pcur_t() { memset(this, 0, sizeof(*this)); }

 /** a B-tree cursor */
 btr_cur_t btr_cur;
  
  rec_t*  old_rec;
   
  /** Return the index of this persistent cursor */
 dict_index_t* index() const { return(btr_cur.index); }
};

其中：

rec_t* old_rec，持久化游標(biāo)中保存查詢得到的記錄。

下面以 insert 語(yǔ)句的執(zhí)行為例，分析不同數(shù)據(jù)結(jié)構(gòu)之間的關(guān)系。

insert 語(yǔ)句

build_template

從 server 層調(diào)用存儲(chǔ)引擎的接口 handler 開(kāi)始，具體是 ha_innobase::write_row 函數(shù)。

// storage/innbase/handler/ha_innodb.cc

/* Stores a row in an InnoDB database, to the table specified in this handle. */
int
ha_innobase::write_row(
/*===================*/
 uchar* record) /*!< in: a row in MySQL format */
{
  /* Step-4: Prepare INSERT graph that will be executed for actual INSERT
 (This is a one time operation) */
 if (m_prebuilt->mysql_template == NULL
     || m_prebuilt->template_type != ROW_MYSQL_WHOLE_ROW) {

  /* Build the template used in converting quickly between
  the two database formats */

  build_template(true);
 }
  
  /* Step-5: Execute insert graph that will result in actual insert. */
 // 插入數(shù)據(jù)
 error = row_insert_for_mysql((byte*) record, m_prebuilt);
}

其中：

入?yún)?uchar* record 指向 table->record[0]，其中不包含表列的類型、列的長(zhǎng)度、列數(shù)等信息，完全是客戶端插入的數(shù)據(jù)；

unsigned char 數(shù)據(jù)類型經(jīng)常被用于處理原始字節(jié)數(shù)據(jù)，比如在進(jìn)行二進(jìn)制文件讀寫、網(wǎng)絡(luò)通信、處理圖像數(shù)據(jù)、編碼轉(zhuǎn)換或者其他需要以字節(jié)為單位操作的場(chǎng)景中。利用 unsigned char 可以確保不會(huì)有負(fù)值出現(xiàn)，這在與二進(jìn)制數(shù)據(jù)和位操作打交道時(shí)非常有用。

調(diào)用 build_template 函數(shù)初始化 m_prebuilt 變量；
調(diào)用 row_insert_for_mysql 函數(shù)插入數(shù)據(jù)，入?yún)⒅袑?record 指針從 uchar 轉(zhuǎn)換成 byte 類型。

typedef unsigned char uchar; /* Short for unsigned char */

/* Note that inside MySQL 'byte' is defined as char on Linux! */
#define byte   unsigned char

m_prebuilt 變量是 row_prebuilt_t 結(jié)構(gòu)體指針類型，其中部分成員如下所示，包括表、索引、游標(biāo)等各種信息。

/** Save CPU time with prebuilt/cached data structures */
row_prebuilt_t*  m_prebuilt;

/** A struct for (sometimes lazily) prebuilt structures in an Innobase table
handle used within MySQL; these are used to save CPU time. */

struct row_prebuilt_t {
 dict_table_t* table;  /*!< Innobase table handle */
 dict_index_t* index;  /*!< current index for a search, if any */
 trx_t*  trx;  /*!< current transaction handle */
  unsigned n_template:10; /*!< number of elements in the
        template */
  ins_node_t* ins_node; /*!< Innobase SQL insert node
        used to perform inserts
        to the table */
  byte*  ins_upd_rec_buff;/*!< buffer for storing data converted
        to the Innobase format from the MySQL
        format */
  mysql_row_templ_t* mysql_template;/*!< template used to transform
        rows fast between MySQL and Innobase
        formats; memory for this template
        is not allocated from 'heap' */
  btr_pcur_t* pcur;  /*!< persistent cursor used in selects
        and updates */
};

其中：

ins_node_t* ins_node，該結(jié)構(gòu)體很重要，下文中將介紹；
btr_pcur_t* pcur，持久化游標(biāo)用于保存當(dāng)前查詢的記錄；
mysql_row_templ_t* mysql_template，輔助結(jié)構(gòu)，主要保存列的元數(shù)據(jù)信息，用于加快行記錄格式的轉(zhuǎn)換。

row_prebuilt_t 結(jié)構(gòu)體與 build_template 函數(shù)的作用參考 chatgpt：

row_prebuilt_t:row_prebuilt_t結(jié)構(gòu)體是 InnoDB 存儲(chǔ)引擎用來(lái)預(yù)存儲(chǔ)有關(guān)某個(gè)表操作所需的信息的數(shù)據(jù)結(jié)構(gòu)。這個(gè)結(jié)構(gòu)體包括了眾多和查詢執(zhí)行相關(guān)的字段，如指向 InnoDB 內(nèi)部諸如表描述（dict_table_t）、索引描述（dict_index_t）的指針；控制當(dāng)前操作的游標(biāo)狀態(tài)；查詢的模板信息；用于讀寫操作的緩沖區(qū)；鎖定信息等。簡(jiǎn)言之，row_prebuilt_t是一個(gè)操作上下文，它保存了InnoDB在執(zhí)行諸如插入、更新、刪除、查詢等操作時(shí)所需要的所有上下文信息。
build_template:build_template函數(shù)的主要工作是為 SQL 語(yǔ)句的執(zhí)行準(zhǔn)備模板數(shù)據(jù)，這些數(shù)據(jù)用于確定當(dāng)從存儲(chǔ)引擎獲取數(shù)據(jù)時(shí)應(yīng)該返回哪些列，以及如何快速地從 InnoDB 的內(nèi)部格式轉(zhuǎn)換為 MySQL 格式。例如，如果一個(gè) SELECT 語(yǔ)句只查詢了表中的一些列，則build_template將生成相應(yīng)的模板以確保只有這些被查詢的列的數(shù)據(jù)被讀取和轉(zhuǎn)換。這個(gè)過(guò)程包括決定哪些列可以跳過(guò)、哪些列需要轉(zhuǎn)換等。這樣就可以避免不必要的數(shù)據(jù)轉(zhuǎn)換和傳輸，提高查詢效率。
在 InnoDB 存儲(chǔ)引擎的設(shè)計(jì)中，row_prebuilt_t和build_template都是實(shí)現(xiàn) SQL 層與存儲(chǔ)引擎層之間高效數(shù)據(jù)交互的重要組成部分。一方面，row_prebuilt_t作為一個(gè)操作上下文，保存了完成某個(gè)表操作所需的所有信息；另一方面，build_template則通過(guò)預(yù)先構(gòu)建模板來(lái)優(yōu)化數(shù)據(jù)列的讀取和轉(zhuǎn)換過(guò)程。

byte -> dtuple_t* row

row_insert_for_mysql 函數(shù)中調(diào)用 row_insert_for_mysql_using_ins_graph 函數(shù)，其中將 Server 層的記錄格式轉(zhuǎn)換為 InnoDB 的記錄格式，具體是從 byte 轉(zhuǎn)換為 dtuple_t。

// storage/innobase/row/row0mysql.cc

static
dberr_t
row_insert_for_mysql_using_ins_graph(
 const byte* mysql_rec,   /* row in the MySQL format */
 row_prebuilt_t* prebuilt)  /* prebuilt struct in MySQL handle */
{
  que_thr_t* thr;
  ins_node_t* node  = prebuilt->ins_node;
  
  // 主要構(gòu)造用于執(zhí)行插入操作的 2 個(gè)對(duì)象：
 // 1. ins_node_t 對(duì)象，保存在 prebuilt->ins_node 中
 // 2. que_fork_t 對(duì)象，保存在 prebuilt->ins_graph 中
 row_get_prebuilt_insert_row(prebuilt);
 node = prebuilt->ins_node;

 // 把 server 層的記錄格式轉(zhuǎn)換為 InnoDB 的記錄格式
 row_mysql_convert_row_to_innobase(node->row, prebuilt, mysql_rec,
       &blob_heap);
  
  thr->run_node = node;
 thr->prev_node = node;
  
  // 執(zhí)行插入操作，插入記錄到主鍵索引、二級(jí)索引（包含唯一索引、非唯一索引）
 /*插入記錄*/
 row_ins_step(thr);
}

其中：

調(diào)用 row_get_prebuilt_insert_row 函數(shù)給 m_prebuilt->ins_node 成員賦初值，如 dtuple_t；
調(diào)用 row_mysql_convert_row_to_innobase 函數(shù)轉(zhuǎn)換行記錄格式；
調(diào)用 row_ins_step 函數(shù)插入記錄，其中入?yún)?que_thr_t* thr，thr->run_node = node = prebuilt->ins_node，具體 que_thr_t 結(jié)構(gòu)體暫不介紹。

這里又出現(xiàn)了一個(gè)重要的結(jié)構(gòu)體 ins_node_t，該結(jié)構(gòu)體部分成員如下所示。

/* Insert node structure */

struct ins_node_t{
 dtuple_t* row; /*!< row to insert */
 dict_table_t* table; /*!< table where to insert */
 sel_node_t* select; /*!< select in searched insert */
 que_node_t* values_list;/* list of expressions to evaluate and
    insert in an INS_VALUES insert */
 ulint  state; /*!< node execution state */
 dict_index_t* index; /*!< NULL, or the next index where the index
    entry should be inserted */
 dtuple_t* entry; /*!< NULL, or entry to insert in the index;
    after a successful insert of the entry,
    this should be reset to NULL */
}

其中：

dtuple_t* row，其中保存完整的行數(shù)據(jù)；
dtuple_t* entry，給每個(gè) index 創(chuàng)建一個(gè) dtuple_t，用于保存索引中的數(shù)據(jù)，比如二級(jí)索引中只需要保存索引列和主鍵列，因此可以理解為針對(duì)特定索引優(yōu)化后的列版本。

row_mysql_convert_row_to_innobase 函數(shù)中遍歷索引的所有列（prebuilt->n_template）進(jìn)行賦值。

static
void
row_mysql_convert_row_to_innobase(
/*==============================*/
 dtuple_t* row,  /*!< in/out: Innobase row where the
     field type information is already
     copied there! */
 row_prebuilt_t* prebuilt, /*!< in: prebuilt struct where template
     must be of type ROW_MYSQL_WHOLE_ROW */
 const byte* mysql_rec, /*!< in: row in the MySQL format; */
 mem_heap_t** blob_heap) /*!< in: FIX_ME, remove this after
     server fixes its issue */
{
 const mysql_row_templ_t*templ;
  
  for (i = 0; i < prebuilt->n_template; i++) {

  templ = prebuilt->mysql_template + i;
    
    // 獲取 field
    dfield = dtuple_get_nth_field(row, n_col);
    
    // 格式轉(zhuǎn)換，從 rec 寫入 dtuple_t
  row_mysql_store_col_in_innobase_format(
      /* dfield_t* dfield, call to set field */
   dfield,
      /* byte* buf, buffer for a converted integer value */
   prebuilt->ins_upd_rec_buff + templ->mysql_col_offset,
      /* ibool row_format_col, TRUE if the mysql_data is from a MySQL row, FALSE if from a MySQL key value; */
   TRUE, 
      /* byte* mysql_data, MySQL row format data */
   mysql_rec + templ->mysql_col_offset,
      /* ulint col_len, MySQL column length */
   templ->mysql_col_len,        
      /*!< ulint comp, nonzero=compact format */
   dict_table_is_comp(prebuilt->table));
}

其中：

調(diào)用 row_mysql_store_col_in_innobase_format 函數(shù)將列值保存在 dtuple_t::dfield_t 中。

const byte* ptr = mysql_data;
 const dtype_t* dtype;
 ulint  type;

 dtype = dfield_get_type(dfield);

 type = dtype->mtype;
  
  // 計(jì)算 col_len
  ...
  
  dfield_set_data(dfield, ptr, col_len);

其中：

獲取 dtype、計(jì)算 col_len
調(diào)用 dfield_set_data 函數(shù)保存數(shù)據(jù)完成行記錄格式的轉(zhuǎn)換

/* Sets pointer to the data and length in a field. */
UNIV_INLINE
void
dfield_set_data(
/*============*/
 dfield_t* field, /*!< in: field */
 const void* data, /*!< in: data */
 ulint  len) /*!< in: length or UNIV_SQL_NULL */
{
 field->data = (void*) data;
 field->ext = 0;
 field->len = static_cast<unsigned int>(len);
}

其中：

向 dfield_t 結(jié)構(gòu)體中保存字段數(shù)據(jù)、字段長(zhǎng)度

到這里完整的行數(shù)據(jù)索引元組格式 dtuple_t* row dtuple_t 準(zhǔn)備好了，但由于 InnoDB 是索引組織樹(shù)，因此還需要將數(shù)據(jù)保存到索引元組格式 dtuple_t* entry 中。

dtuple_t* row -> dtuple_t* entry

row_ins_step 函數(shù)中創(chuàng)建 ins_node_t 并調(diào)用 row_ins 函數(shù)。

// 創(chuàng)建 ins_node_t
 node = static_cast<ins_node_t*>(thr->run_node); 

 // 給表加上意向鎖
  err = lock_table(0, node->table, LOCK_IX, thr);
  
  /* DO THE CHECKS OF THE CONSISTENCY CONSTRAINTS HERE */
 // 調(diào)用 row_ins() 插入記錄到主鍵索引、二級(jí)索引
 err = row_ins(node, thr);

row_ins 函數(shù)中遍歷表的每個(gè) index，插入 index entry。

// 遍歷所有索引，向每個(gè)索引中插入記錄
 while (node->index != NULL) {
    
    /* 向索引中插入記錄 */
    err = row_ins_index_entry_step(node, thr);

  // 插入記錄到主鍵索引或二級(jí)索引成功后獲取下一個(gè)索引
    // node->index、entry 指向表中的下一個(gè)索引
  node->index = dict_table_get_next_index(node->index);
  node->entry = UT_LIST_GET_NEXT(tuple_list, node->entry);

其中：

循環(huán)調(diào)用 row_ins_index_entry_step 函數(shù)插入記錄，其中入?yún)?node 就是 row_mysql_convert_row_to_innobase 函數(shù)轉(zhuǎn)換后的 row_prebuilt_t->ins_node_t；
每次插入索引時(shí)更新 ins_node_t->index 與 ins_node_t->entry，因此行數(shù)據(jù)對(duì)應(yīng)的所有索引復(fù)用同一個(gè) ins_node_t 變量，其中 ins_node_t->row 保持不變。

row_ins_index_entry_step 函數(shù)中向指定索引中插入記錄。

// storage/innobase/row/row0insert.cc

/* Inserts a single index entry to the table. */
static MY_ATTRIBUTE((nonnull, warn_unused_result))
dberr_t
row_ins_index_entry_step(
/*=====================*/
 ins_node_t* node, /*!< in: row insert node */
 que_thr_t* thr) /*!< in: query thread */
{
  /*給索引項(xiàng)賦值*/
 err = row_ins_index_entry_set_vals(node->index, node->entry, node->row);

 /*插入索引項(xiàng)*/
 err = row_ins_index_entry(node->index, node->entry, thr);
}

其中：

調(diào)用 row_ins_index_entry_set_vals 函數(shù)遍歷索引的每個(gè)字段并賦值，其中將 innobase format field 的值賦給對(duì)應(yīng)的 index entry field。

/** Sets the values of the dtuple fields in entry from the values of appropriate columns in row. */
dberr_t
row_ins_index_entry_set_vals(
 const dict_index_t* index,
 dtuple_t*  entry,
 const dtuple_t*  row)
{
 n_fields = dtuple_get_n_fields(entry);

 for (i = 0; i < n_fields + num_v; i++) {
  dfield_t* field;
  const dfield_t* row_field;
    ulint  len;
    
    // 獲取 field
    field = dtuple_get_nth_field(entry, i);
    row_field = dtuple_get_nth_field(
    row, ind_field->col->ind);
    len = dfield_get_len(row_field);
    
    // 寫入 field
  dfield_set_data(field, dfield_get_data(row_field), len);
}

其中：

與 row 相同，循環(huán)調(diào)用 dfield_set_data 方法給索引 entry 中的每個(gè)字段 field 賦值。

到這里每個(gè)索引的數(shù)據(jù)也準(zhǔn)備好了，但是還不知道數(shù)據(jù)插入的位置，理論上是插在最后一個(gè)小于等于該 dtuple 的 rec_t 后。

cursor

row_ins_index_entry 函數(shù)中以插入主鍵索引為例。

row_ins_clust_index_entry_low 函數(shù)中以樂(lè)觀插入為例。

// storage/innbase/row/row0ins.cc

dberr_t
row_ins_clust_index_entry_low(
/*==========================*/
 ulint  flags, /*!< in: undo logging and locking flags */
 ulint  mode, /*!< in: BTR_MODIFY_LEAF or BTR_MODIFY_TREE,
    depending on whether we wish optimistic or
    pessimistic descent down the index tree */
 dict_index_t* index, /*!< in: clustered index */
 ulint  n_uniq, /*!< in: 0 or index->n_uniq */
 dtuple_t* entry, /*!< in/out: index entry to insert */
 ulint  n_ext, /*!< in: number of externally stored columns */
 que_thr_t* thr, /*!< in: query thread */
 bool  dup_chk_only)
    /*!< in: if true, just do duplicate check
    and return. don't execute actual insert. */
{
  btr_pcur_t pcur;
  btr_cur_t* cursor;
  
  // offsets
  ulint           offsets_[REC_OFFS_NORMAL_SIZE];
 ulint*          offsets         = offsets_;
  
  /* Note that we use PAGE_CUR_LE as the search mode, because then
 the function will return in both low_match and up_match of the
 cursor sensible values */
 // 插入操作的 search_mode 默認(rèn)是 PAGE_CUR_LE，即插在最后一個(gè)小于等于該 dtuple 的 rec_t 后
 // btr_pcur_open 函數(shù)內(nèi)部調(diào)用 btr_cur_search_to_nth_level 函數(shù)
 btr_pcur_open(index, entry, PAGE_CUR_LE, mode, &pcur, &mtr);
  // 從 btr_pcur_t 中獲取 btr_cur_t（tree cursor）
 cursor = btr_pcur_get_btr_cur(&pcur);
 cursor->thr = thr;
  
  rec_t* insert_rec;
  
  /* 樂(lè)觀插入 */
  err = btr_cur_optimistic_insert(
    flags, cursor, &offsets, &offsets_heap,
    entry, &insert_rec, &big_rec,
    n_ext, thr, &mtr);
}

其中：

入?yún)⒅邪?dict_index_t* index 與 dtuple_t* entry，分別對(duì)應(yīng) ins_node_t->index 與 ins_node_t->entry，其中 entry 中保存索引的數(shù)據(jù)；
調(diào)用 btr_pcur_open 函數(shù)將 cursor 移動(dòng)到索引上待插入的位置，也就是定位頁(yè)與行，注意入?yún)?mode = PAGE_CUR_LE；
調(diào)用 btr_cur_optimistic_insert 函數(shù)樂(lè)觀插入。

btr_pcur_open 函數(shù)中調(diào)用 btr_cur_search_to_nth_level 函數(shù)用于自頂向下查找整個(gè) B-tree。

page_cursor = btr_cur_get_page_cur(cursor);
 
 // 初始的，獲得索引的根節(jié)點(diǎn)（space_id，page_no）
 const ulint  space = dict_index_get_space(index);
 const page_size_t page_size(dict_table_page_size(index->table));

 /* Start with the root page. */
 // 從 dict_index_t 元信息中拿到 root page 在物理文件的 page no（默認(rèn)是 4）
 page_id_t  page_id(space, dict_index_get_page(index));
  
  // 從 buffer_pool 中根據(jù) page no 拿 page（buf_page_get_gen），buffer pool 模塊會(huì)根據(jù) page 是否被緩存來(lái)決定是從內(nèi)存中還是磁盤中讀取，并根據(jù)加鎖策略對(duì) page 加鎖
 block = buf_page_get_gen(page_id, page_size, rw_latch, guess,
     buf_mode, file, line, mtr);
 tree_blocks[n_blocks] = block;
  
  /* Search for complete index fields. */
  up_bytes = low_bytes = 0;
  // 對(duì) page 內(nèi)部的 record list 進(jìn)行二分搜索 (page_cur_search_with_match)
  // page_cur_search_with_match：在一個(gè)數(shù)據(jù)頁(yè)內(nèi)“二分查找”（使用數(shù)據(jù)頁(yè)中的 directory slot），定位到 record
  page_cur_search_with_match(
    block, index, tuple, page_mode, &up_match,
    &low_match, page_cursor,
    need_path ? cursor->rtr_info : NULL);

其中：

調(diào)用 buf_page_get_gen 函數(shù)根據(jù) page no 獲取 page（block），如果 page 在 buffer poo 中就直接讀取，否則從文件讀取到 buffer pool 中；
調(diào)用 page_cur_search_with_match 函數(shù)在 page 內(nèi)部首先通過(guò) Page Directory 二分查找確定 slot，然后線性查找確定行 record，其中 mode = PAGE_CUR_LE 。

/* Perform binary search until the lower and upper limit directory
 slots come to the distance 1 of each other */
  
  // 在索引頁(yè)內(nèi)查找對(duì)于指定的 tuple，滿足某種條件（依賴于傳入的 mode，比如 PAGE_CUR_L）的 record
 // PAGE_CUR_G（>），PAGE_CUR_GE（>=），PAGE_CUR_L（<），PAGE_CUR_LE（<=）
 // 1. 二分查找
 // 在稀疏的 Page Directory 內(nèi)使用二分查找
 low = 0;
 up = page_dir_get_n_slots(page) - 1;
  
  /* Perform linear search until the upper and lower records come to
 distance 1 of each other. */

 // 二分查找結(jié)束后，low / up是臨近的兩個(gè) slot，這兩個(gè) slot 指向的 record 記為
  // low_rec 和 up_rec，滿足：low_rec <= tuple <= up_rec，切記 tuple 為待插入的（邏輯）記錄

  // 2. 線性查找，漸漸夾逼
  // 在兩個(gè)相鄰的 directory 內(nèi)，進(jìn)行線性查找。線性查找的實(shí)現(xiàn)即不斷"增大 low"，"減小 up"

其中：

二分查找與線性查找過(guò)程中都需要比較大小，具體調(diào)用 cmp_dtuple_rec_with_match_low 函數(shù)比較物理記錄與邏輯記錄的大小。

// storage/innobase/rem/rem0cmp.cc

/** Compare a data tuple to a physical record.
@param[in] dtuple data tuple
@param[in] rec B-tree record
@param[in] offsets rec_get_offsets(rec)
@param[in] n_cmp number of fields to compare
@param[in,out] matched_fields number of completely matched fields
@return the comparison result of dtuple and rec
@retval 0 if dtuple is equal to rec
@retval negative if dtuple is less than rec
@retval positive if dtuple is greater than rec */
int
cmp_dtuple_rec_with_match_low(
 const dtuple_t* dtuple, 
 const rec_t* rec,
 const ulint* offsets,
 ulint  n_cmp,
 ulint*  matched_fields)
{
  /* Match fields in a loop */
  // 遍歷記錄的每個(gè)字段
 for (; cur_field < n_cmp; cur_field++) {
    
    // 獲取邏輯記錄
    const dfield_t* dtuple_field
   = dtuple_get_nth_field(dtuple, cur_field);
    
    // 將邏輯記錄轉(zhuǎn)換為物理記錄
    const byte* dtuple_b_ptr
   = static_cast<const byte*>(
    dfield_get_data(dtuple_field));
    
    // 根據(jù) offset 獲取物理記錄
  rec_b_ptr = rec_get_nth_field(rec, offsets, cur_field,
           &rec_f_len);

  // 比較大小
  ret = cmp_data(type->mtype, type->prtype,
          dtuple_b_ptr, dtuple_f_len,
          rec_b_ptr, rec_f_len);
  if (ret) {
   goto order_resolved;
  }
    
  }
}

其中：

入?yún)⒅邪ㄟ壿嬘涗?dtuple、物理記錄 rec、記錄每個(gè) field 偏移量的數(shù)組 offsets；
出參表示大小關(guān)系；
函數(shù)內(nèi)部遍歷記錄的每個(gè)字段，先將邏輯記錄轉(zhuǎn)換為物理記錄，然后與根據(jù) offset 獲取到的物理記錄比較大小。

到這里就定位到了數(shù)據(jù)插入的位置，開(kāi)始真正的數(shù)據(jù)插入。

dtuple_t* entry -> byte

調(diào)用 btr_cur_optimistic_insert 函數(shù)樂(lè)觀插入。

page_cur_t* page_cursor;
 buf_block_t* block;
 
 // 通過(guò) cursor 獲取 block
 block = btr_cur_get_block(cursor);
 // buf_block_t::frame 中保存 page
 page = buf_block_get_frame(block);
 index = cursor->index;

 // btr_cur_t->page_cursor
 page_cursor = btr_cur_get_page_cur(cursor);

 /* Check locks and write to the undo log, if specified */
  // 真正插入 entry 前，會(huì)檢查事務(wù)鎖和記錄 undo
  // 其中修改 inherit 值，如果下一條記錄上沒(méi)有鎖，就不需要鎖分裂
  err = btr_cur_ins_lock_and_undo(flags, cursor, entry,
          thr, mtr, &inherit);

  if (err != DB_SUCCESS) {
    goto fail_err;
  }

  // 插入數(shù)據(jù)
  *rec = page_cur_tuple_insert(
    page_cursor, entry, index, offsets, heap,
    n_ext, mtr);

其中：

buf_block_t::frame 是記錄要插入的數(shù)據(jù)頁(yè)，page_cursor 是記錄要插入的位置；
插入數(shù)據(jù)前調(diào)用 btr_cur_ins_lock_and_undo 函數(shù)檢查是否有與插入意向鎖沖突的鎖、是否需要發(fā)生鎖分裂；
調(diào)用 page_cur_tuple_insert 函數(shù)插入數(shù)據(jù)，邏輯記錄保存在 entry 中。

page_cur_tuple_insert 函數(shù)中調(diào)用 rec_convert_dtuple_to_rec_new 函數(shù)將邏輯記錄轉(zhuǎn)換成物理記錄。

/*********************************************************//**
Builds a new-style physical record out of a data tuple and
stores it beginning from the start of the given buffer.
@return pointer to the origin of physical record */
static
rec_t*
rec_convert_dtuple_to_rec_new(
/*==========================*/
 byte*   buf, /*!< in: start address of the physical record */
 const dict_index_t* index, /*!< in: record descriptor */
 const dtuple_t*  dtuple) /*!< in: data tuple */
{
 ulint extra_size;
 ulint status;
 rec_t* rec;

 // 計(jì)算記錄頭大小，記錄頭大小用 extra_size 表示
 rec_get_converted_size_comp(
  index, status, dtuple->fields, dtuple->n_fields, &extra_size);
 // 因?yàn)?buf 是用來(lái)存儲(chǔ)整個(gè)記錄的開(kāi)始位置的，這里的 buf + extra_size 表示存儲(chǔ)的第一個(gè)列的位置，即 rec 所指的位置
 rec = buf + extra_size;

 // 真正轉(zhuǎn)換格式
 rec_convert_dtuple_to_rec_comp(
  rec, index, dtuple->fields, dtuple->n_fields, NULL,
  status, false);
}

其中：

調(diào)用 rec_convert_dtuple_to_rec_comp 函數(shù)將 tuple 邏輯記錄轉(zhuǎn)換成物理記錄。

rec_convert_dtuple_to_rec_comp 函數(shù)中將每個(gè)列的值，以及 null 和 len 的信息存儲(chǔ)到對(duì)應(yīng)的位置。

/* Store the data and the offsets */

 // 將每個(gè)列的值，以及 null 和 len 的信息存儲(chǔ)到對(duì)應(yīng)的位置
 for (i = 0; i < n_fields; i++) {
  const dict_field_t* ifield;
  dict_col_t*  col = NULL;
    
    // 獲取每個(gè) field 的值
  field = &fields[i];
    
    // 計(jì)算 null 信息，因?yàn)檫@個(gè)標(biāo)志是通過(guò)位來(lái)存儲(chǔ)的，所以對(duì)每一個(gè)字節(jié)都需要做位處理
    // 計(jì)算列的大小和存儲(chǔ)其長(zhǎng)度字節(jié)數(shù)動(dòng)態(tài)匹配的位置，比如判斷變長(zhǎng)列的長(zhǎng)度占用1個(gè)字節(jié)或2個(gè)字節(jié)
    
    // 將元組列信息，寫入到 compact 記錄的對(duì)應(yīng)列中，len為其對(duì)應(yīng)的存儲(chǔ)長(zhǎng)度
  memcpy(end, dfield_get_data(field), len);
  }

最終調(diào)用 page_cur_insert_rec_low 函數(shù)將物理記錄寫入文件。

// 真正插入記錄
  rec = page_cur_insert_rec_low(cursor->rec,
              index, rec, *offsets, mtr);

到這里就完成了一條數(shù)據(jù)的插入。

總結(jié)

insert 過(guò)程中用到了多個(gè)數(shù)據(jù)結(jié)構(gòu)進(jìn)行數(shù)據(jù)的傳遞。

其中：

row_prebuilt_t結(jié)構(gòu)體中保存有關(guān)某個(gè)表操作相關(guān)信息，其中包括ins_node_t；
ins_node_t結(jié)構(gòu)體中保存插入操作相關(guān)的相關(guān)信息，其中包括dtuple_t；
dtuple_t結(jié)構(gòu)體中保存邏輯記錄，也就是索引元組，其中包括dfield_t；
dfield_t結(jié)構(gòu)體中保存字段信息，dfield_t::data中保存真實(shí)列數(shù)據(jù)的指針；
btr_pcur_t結(jié)構(gòu)體中包括btr_cur_t；
btr_cur_t結(jié)構(gòu)體中包括page_cur_t；
page_cur_t結(jié)構(gòu)體中保存查詢得到的記錄 rec，其中包括buf_block_t；
buf_block_t結(jié)構(gòu)體中保存數(shù)據(jù)頁(yè)指針 frame，其中包括buf_page_t。

insert 過(guò)程中行記錄格式發(fā)生了多次轉(zhuǎn)換。

其中：

首先從 byte 轉(zhuǎn)換成 dtuple_t* row，其中保存完整的行數(shù)據(jù)；
然后將 dtuple_t* row 轉(zhuǎn)換成 dtuple_t* entry，其中保存每個(gè)索引的列數(shù)據(jù)；
最后將 dtuple_t* entry 轉(zhuǎn)換成 byte，用于寫入文件。

插入流程的具體函數(shù)堆?？梢詤⒖嘉恼隆綢nnoDB --insert 操作分析】。

插入流程可以簡(jiǎn)化為下圖。

其中：

首先進(jìn)行數(shù)據(jù)行格式轉(zhuǎn)換
然后定位要插入的位置
最后進(jìn)行插入操作

結(jié)論

MySQL 中有行格式有三種存儲(chǔ)方式，包括 Server 層的格式、索引元組格式（邏輯記錄，tuple）、物理存儲(chǔ)記錄格式（record）。

類似的，數(shù)據(jù)頁(yè)分為兩種形式，包括物理頁(yè)（block）、內(nèi)存頁(yè)（page）。

通過(guò) B+ 樹(shù)索引進(jìn)行查找（lookup）是最為常見(jiàn)的操作，具體通過(guò)游標(biāo)實(shí)現(xiàn)。游標(biāo)分為三種類型，包括 B-tree cursor、page cursor、persistent cursor。

存取數(shù)據(jù)時(shí)需要進(jìn)行行格式的轉(zhuǎn)換，原因是 IO 時(shí)使用二進(jìn)制，內(nèi)存操作時(shí)使用邏輯記錄。

以插入數(shù)據(jù)為例，主要流程為：

首先進(jìn)行數(shù)據(jù)行格式轉(zhuǎn)換
然后定位要插入的位置，基于 cursor 實(shí)現(xiàn)
最后進(jìn)行插入操作

其中數(shù)據(jù)格式的轉(zhuǎn)換包括：

首先從 byte 轉(zhuǎn)換成 dtuple_t* row，其中保存完整的行數(shù)據(jù)；
然后將 dtuple_t* row 轉(zhuǎn)換成 dtuple_t* entry，其中保存每個(gè)索引的列數(shù)據(jù)；
最后將 dtuple_t* entry 轉(zhuǎn)換成 byte，用于寫入文件。

數(shù)據(jù)格式轉(zhuǎn)換過(guò)程中涉及較多數(shù)據(jù)結(jié)構(gòu)，其中：

dtuple_t結(jié)構(gòu)體中保存邏輯記錄，也就是索引元組，其中包括dfield_t；
dfield_t結(jié)構(gòu)體中保存字段信息，dfield_t::data中保存真實(shí)列數(shù)據(jù)的指針；
page_cur_t結(jié)構(gòu)體中保存查詢得到的記錄 rec，其中包括buf_block_t；
buf_block_t結(jié)構(gòu)體中保存數(shù)據(jù)頁(yè)指針 frame，其中包括buf_page_t。

待辦

row format
cursor
block

參考教程

《MySQL 內(nèi)核 InnoDB 存儲(chǔ)引擎》
《MySQL 運(yùn)維內(nèi)參》
InnoDB --insert 操作分析

https://zhuanlan.zhihu.com/p/103933731

InnoDB：B-tree index（1）

https://zhuanlan.zhihu.com/p/164705538

InnoDB：B-tree index（2）

https://zhuanlan.zhihu.com/p/164728032

InnoDB：Buffer Pool

https://zhuanlan.zhihu.com/p/270343437

MySQL · 源碼分析 · 一條insert語(yǔ)句的執(zhí)行過(guò)程
MySQL · 內(nèi)核分析 · InnoDB主鍵約束和唯一約束的實(shí)現(xiàn)分析

http://mysql.taobao.org/monthly/2021/04/05/

責(zé)任編輯：武曉燕來(lái)源：丹柿小院

MySQL 記錄存儲(chǔ)

點(diǎn)贊

51CTO技術(shù)棧公眾號(hào)

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開(kāi)發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營(yíng)

<blockquote id="qqsvw"><dfn id="qqsvw"></dfn></blockquote>

<sub id="qqsvw"></sub>