探究:Elasticsearch 文檔的 _id 是 Lucene 的 docid 嗎?
1、前言
之前在與研發(fā)進(jìn)行 ES 使用優(yōu)化的過程中,研發(fā)的同事饒有興致的在會(huì)議后問了我這么一個(gè)問題:我們寫入 ES 的 _id 字段和 lucene 中使用的 docid 是一個(gè)內(nèi)容么?
兩者有什么關(guān)聯(lián)么?
當(dāng)時(shí)對(duì) Lucene 沒有太多了解的我只能實(shí)話實(shí)說:兩者應(yīng)該不是一個(gè)概念,但是具體是否有關(guān)聯(lián)我這邊也沒有梳理清楚,后面有結(jié)論了可以再進(jìn)行溝通。
現(xiàn)在,我們針對(duì)這個(gè)問題梳理一下吧。
2、Lucene 的 docid
首先來看看 Lucene 官方對(duì) docid 的定義。
Lucene's index is composed of segments, each of which contains a subset of all the documents in the index, and is a complete searchable index in itself, over that subset. As documents are written to the index, new segments are created and flushed to directory storage. Segments are immutable; updates and deletions may only create new segments and do not modify existing ones. Over time, the writer merges groups of smaller segments into single larger ones in order to maintain an index that is efficient to search, and to reclaim dead space left behind by deleted (and updated) documents.
Each document is identified by a 32-bit number, its "docid," and is composed of a collection of Field values of diverse types (postings, stored fields, doc values, and points). Docids come in two flavors: global and per-segment. A document's global docid is just the sum of its per-segment docid and that segment's base docid offset. External, high-level APIs only handle global docids, but internal APIs that reference a LeafReader, which is a reader for a single segment, deal in per-segment docids.
Docids are assigned sequentially within each segment (starting at 0). Thus the number of documents in a segment is the same as its maximum docid; some may be deleted, but their docids are retained until the segment is merged. When segments merge, their documents are assigned new sequential docids. Accordingly, docid values must always be treated as internal implementation, not exposed as part of an application, nor stored or referenced outside of Lucene's internal APIs.
最直接的定義:
1.docid 是 32 位的數(shù)字,只是用于標(biāo)記文檔;
2.docid 由不同類型的Field值(發(fā)布、存儲(chǔ)字段、文檔值和點(diǎn))的集合組成;
3.docid 有兩種類型:全局的和每段的。文檔的全局文檔值只是其每段文檔值和該段的 docid 偏移量的總和。
4.docid 在每個(gè)段內(nèi)按順序分配(從 0 開始)。
5.docid 只在 segment merge 的時(shí)候回收,然后被重新分配。
最后,官方強(qiáng)調(diào) docid 只能作為 Lucene 的內(nèi)部實(shí)現(xiàn),并不適合外部應(yīng)用調(diào)用,也不能在 Lucene 的內(nèi)部 api 之外存儲(chǔ)或引用。
雖然官方對(duì) docid 的生成和使用都介紹的很詳細(xì),但是最后一句概括了所有,docid 只是 Lucene 內(nèi)部對(duì)文檔進(jìn)行標(biāo)記的實(shí)現(xiàn)方式,與外部無關(guān)。
3、ES 的 _id
下面我們看看 ES 的 _id 字段。
Each document has an _id that uniquely identifies it, which is indexed so that documents can be looked up either with the GET API or the ids query. The _id can either be assigned at indexing time, or a unique _id can be generated by Elasticsearch. This field is not configurable in the mappings.
這里 ES 對(duì) _id 字段的定義就很清晰,它是文檔唯一性標(biāo)記,它可以在寫入的時(shí)候可以被指定值,如果不指定 ES 會(huì)自己生成一個(gè)值。即 _id 是 ES 文檔數(shù)據(jù)的唯一性主鍵,即 uuid。
ES 這么做的原因是 Lucene 不強(qiáng)要求寫入的數(shù)據(jù)需要帶 uuid,這在 Lucene 的主要開發(fā)者 Michael McCandless 博客里得到了印證,Lucene 并不要求寫入的數(shù)據(jù)具有唯一性主鍵。原文如下
Most search applications using Apache Lucene assign a unique id, or primary key, to each indexed document. While Lucene itself does not require this (it could care less!), the application usually needs it to later replace, delete or retrieve that one document by its external id.
而從 Lucene 的角度看,像 _id 這樣的元信息字段,只是 ES 對(duì) Lucene 定制的一個(gè)字段,和其他字段沒有本質(zhì)的區(qū)別。
public AbstractIdFieldType() {super(NAME, isIndexed:true, isStored:true, hasDocValue:false, TextSearchInfo.SIMPLE_MATCH_ONLY, Collections.emptyMap());}
從 ES 源碼可以看到,_id 字段是一個(gè)被倒排索引,被 store 了,但是沒有生成列存 docvalue 的特殊字段。
而我們?nèi)粘S?_id 在 ES 進(jìn)行 GET 操作獲取文檔數(shù)據(jù),與使用 ES search API 進(jìn)行 _id 查詢獲取數(shù)據(jù),對(duì)于 Lucene 來說是一樣的,都是對(duì) _id 字段進(jìn)行檢索。
# get 操作
GET my-index-000001/_doc/2
# search 操作
GET my-index-000001/_search
{
"query": {
"terms": {
"_id": [ "2" ]
}
}
}
4、小結(jié)
綜上,我們不難看出,雖然 _id 字段和 docid 都是用與標(biāo)記文檔的,但兩者并無耦合性關(guān)系。
_id 字段是 ES 用來做數(shù)據(jù)唯一性主鍵的特殊字段。docid 則是 Lucene 的內(nèi)部實(shí)現(xiàn),服務(wù)于 Lucene 字段查找的過程。
當(dāng)然,ES 實(shí)現(xiàn)數(shù)據(jù)多版本的唯一性,不僅僅依靠 _id 字段,更多的信息在
《Elasticsearch內(nèi)核解析 - 數(shù)據(jù)模型篇》:
https://zhuanlan.zhihu.com/p/34680841,
而對(duì) _id 字段優(yōu)化使用有興趣的同學(xué)可以參考
《Choosing a fast unique identifier (UUID) for Lucene》:
https://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html
作者介紹
金多安,Elastic 認(rèn)證專家,Elastic 資深運(yùn)維工程師,死磕 Elasticsearch 知識(shí)星球嘉賓,星球Top活躍技術(shù)專家,搜索客社區(qū)日?qǐng)?bào)責(zé)任編輯