RAG文本切分LV3:輕松定制Markdown切分 原創(chuàng)
上篇文章我們介紹了借助LLM和OCR將文檔轉(zhuǎn)換成markdown的方法:??顛覆傳統(tǒng)OCR輕松搞定復(fù)雜PDF的工具??。本篇文章將介紹如何對(duì)markdown進(jìn)行有效切分。
之前介紹了文本切分五個(gè)層級(jí),本文方法是第三個(gè)層次:
Level 1: Character Splitting - 簡(jiǎn)單的字符長(zhǎng)度切分
Level 2: Recursive Character Text Splitting - 通過分隔符切分,然后遞歸合并
Level 3: Document Specific Splitting - 針對(duì)不同文檔格式切分 (PDF, Python, Markdown)
Level 4: Semantic Splitting - 語義切分
Level 5: Agentic Splitting-使用代理實(shí)現(xiàn)自動(dòng)切分
基本概念和環(huán)境
分塊通常旨在將具有共同上下文的文本放在一起??紤]到這一點(diǎn),我們可能希望特別尊重文檔本身的結(jié)構(gòu)。例如,markdown 文件按標(biāo)題組織。在特定標(biāo)題組中創(chuàng)建塊是一種直觀的想法。為了解決這一挑戰(zhàn),我們可以使用MarkdownHeaderTextSplitter。這將按指定的一組標(biāo)題拆分 markdown 文件。
本文用到的安裝包如下:
pip install langchain-text-splitters
切分實(shí)現(xiàn)
我們可以指定要拆分的標(biāo)題headers_to_split_on,切分之后內(nèi)容按標(biāo)題分組 :
markdown_document = "# Foo\n\n ## Bar\n\nHi this is Jim\n\nHi this is Joe\n\n ### Boo \n\n Hi this is Lance \n\n ## Baz\n\n Hi this is Molly"
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on)
md_header_splits = markdown_splitter.split_text(
markdown_document)
print(md_header_splits)
結(jié)果如下:
[Document(page_content='Hi this is Jim \nHi this is Joe', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}),
Document(page_content='Hi this is Lance', metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}),
Document(page_content='Hi this is Molly', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'})]
默認(rèn)情況下,MarkdownHeaderTextSplitter從輸出塊的內(nèi)容中剝離被分割的標(biāo)頭??梢酝ㄟ^設(shè)置strip_headers = False來禁用此功能。
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on,
strip_headers=False)
md_header_splits = markdown_splitter.split_text(
markdown_document)
print(md_header_splits)
可以看到,標(biāo)題添加到內(nèi)容中了
[Document(page_content='# Foo \n## Bar \nHi this is Jim \nHi this is Joe', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}),
Document(page_content='### Boo \nHi this is Lance', metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}),
Document(page_content='## Baz \nHi this is Molly', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'})]
如何將 Markdown 行返回為單獨(dú)的文檔
默認(rèn)情況下,MarkdownHeaderTextSplitter根據(jù)headers_to_split_on中指定的標(biāo)題聚合行。我們可以通過指定return_each_line來禁用此功能,使得一行就是一條內(nèi)容:
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on,
return_each_line=True,
)
md_header_splits = markdown_splitter.split_text(markdown_document)
print(md_header_splits)
[Document(page_content='Hi this is Jim', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}),
Document(page_content='Hi this is Joe', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}),
Document(page_content='Hi this is Lance', metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}),
Document(page_content='Hi this is Molly', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'})]
如何限制塊大小:
然后,我們可以在每個(gè) markdown 組中應(yīng)用任何我們想要的文本分割器,例如RecursiveCharacterTextSplitter,它允許進(jìn)一步控制塊大小。
markdown_document = "# Intro \n\n ## History \n\n Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] \n\n Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files. \n\n ## Rise and divergence \n\n As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \n\n additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. \n\n #### Standardization \n\n From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort. \n\n ## Implementations \n\n Implementations of Markdown are available for over a dozen programming languages."
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
]
# MD splits
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on, strip_headers=False
)
md_header_splits = markdown_splitter.split_text(markdown_document)
# Char-level splits
from langchain_text_splitters import RecursiveCharacterTextSplitter
chunk_size = 250
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size, chunk_overlap=chunk_overlap
)
# Split
splits = text_splitter.split_documents(md_header_splits)
splits
本文轉(zhuǎn)載自公眾號(hào)哎呀AIYA
