自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會(huì)

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營(yíng)

鴻蒙開發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫

AI.x社區(qū)

登錄/注冊(cè)
51CTO

中國(guó)優(yōu)質(zhì)的IT技術(shù)網(wǎng)站

51CTO博客

專業(yè)IT技術(shù)創(chuàng)作平臺(tái)

51CTO學(xué)堂

IT職業(yè)在線教育平臺(tái)

RAG文本切分LV3：輕松定制Markdown切分原創(chuàng)

發(fā)布于 2024-9-18 14:55

瀏覽

0收藏

上篇文章我們介紹了借助LLM和OCR將文檔轉(zhuǎn)換成markdown的方法：??顛覆傳統(tǒng)OCR輕松搞定復(fù)雜PDF的工具??。本篇文章將介紹如何對(duì)markdown進(jìn)行有效切分。

之前介紹了文本切分五個(gè)層級(jí)，本文方法是第三個(gè)層次：

Level 1: Character Splitting - 簡(jiǎn)單的字符長(zhǎng)度切分

Level 2: Recursive Character Text Splitting - 通過分隔符切分，然后遞歸合并

Level 3: Document Specific Splitting - 針對(duì)不同文檔格式切分 (PDF, Python, Markdown)

Level 4: Semantic Splitting - 語義切分

Level 5: Agentic Splitting-使用代理實(shí)現(xiàn)自動(dòng)切分

基本概念和環(huán)境

分塊通常旨在將具有共同上下文的文本放在一起?？紤]到這一點(diǎn)，我們可能希望特別尊重文檔本身的結(jié)構(gòu)。例如，markdown 文件按標(biāo)題組織。在特定標(biāo)題組中創(chuàng)建塊是一種直觀的想法。為了解決這一挑戰(zhàn)，我們可以使用MarkdownHeaderTextSplitter。這將按指定的一組標(biāo)題拆分 markdown 文件。

本文用到的安裝包如下：

pip install langchain-text-splitters

切分實(shí)現(xiàn)

我們可以指定要拆分的標(biāo)題headers_to_split_on，切分之后內(nèi)容按標(biāo)題分組：

markdown_document = "# Foo\n\n    ## Bar\n\nHi this is Jim\n\nHi this is Joe\n\n ### Boo \n\n Hi this is Lance \n\n ## Baz\n\n Hi this is Molly"

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(
  headers_to_split_on)
md_header_splits = markdown_splitter.split_text(
  markdown_document)
print(md_header_splits)

結(jié)果如下：

[Document(page_content='Hi this is Jim  \nHi this is Joe', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}),
Document(page_content='Hi this is Lance', metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}),
Document(page_content='Hi this is Molly', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'})]

默認(rèn)情況下，MarkdownHeaderTextSplitter從輸出塊的內(nèi)容中剝離被分割的標(biāo)頭?？梢酝ㄟ^設(shè)置strip_headers = False來禁用此功能。

markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on,
strip_headers=False)
md_header_splits = markdown_splitter.split_text(
markdown_document)
print(md_header_splits)

可以看到，標(biāo)題添加到內(nèi)容中了

[Document(page_content='# Foo  \n## Bar  \nHi this is Jim  \nHi this is Joe', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}),
Document(page_content='### Boo  \nHi this is Lance', metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}),
Document(page_content='## Baz  \nHi this is Molly', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'})]

如何將 Markdown 行返回為單獨(dú)的文檔

默認(rèn)情況下，MarkdownHeaderTextSplitter根據(jù)headers_to_split_on中指定的標(biāo)題聚合行。我們可以通過指定return_each_line來禁用此功能，使得一行就是一條內(nèi)容：

markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on,
return_each_line=True,
)
md_header_splits = markdown_splitter.split_text(markdown_document)
print(md_header_splits)

[Document(page_content='Hi this is Jim', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}),
Document(page_content='Hi this is Joe', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}),
Document(page_content='Hi this is Lance', metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}),
Document(page_content='Hi this is Molly', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'})]

如何限制塊大小：

然后，我們可以在每個(gè) markdown 組中應(yīng)用任何我們想要的文本分割器，例如RecursiveCharacterTextSplitter，它允許進(jìn)一步控制塊大小。


markdown_document = "# Intro \n\n    ## History \n\n Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] \n\n Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files. \n\n ## Rise and divergence \n\n As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \n\n additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. \n\n #### Standardization \n\n From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort. \n\n ## Implementations \n\n Implementations of Markdown are available for over a dozen programming languages."

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]

# MD splits
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on, strip_headers=False
)
md_header_splits = markdown_splitter.split_text(markdown_document)

# Char-level splits
from langchain_text_splitters import RecursiveCharacterTextSplitter

chunk_size = 250
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

# Split
splits = text_splitter.split_documents(md_header_splits)
splits

本文轉(zhuǎn)載自公眾號(hào)哎呀AIYA

原文鏈接：??https://mp.weixin.qq.com/s/58OJQoi-xuxdFhU02Q6uZg???

?著作權(quán)歸作者所有，如需轉(zhuǎn)載，請(qǐng)注明出處，否則將追究法律責(zé)任

標(biāo)簽

贊

收藏

回復(fù)

舉報(bào)

回復(fù)

相關(guān)推薦

如何在淘寶人生2一鍵定制你的專屬3D數(shù)字人

pangguiyu ? 4802瀏覽 ? 0回復(fù)
浙大、螞蟻集團(tuán)推出MaPa：文本生成超真實(shí)3D模型

Aceryt ? 2879瀏覽 ? 0回復(fù)
無需定制視頻數(shù)據(jù)，DeepMind讓文生定制視頻變得簡(jiǎn)單！

angel ? 2353瀏覽 ? 0回復(fù)
華科、華南理工發(fā)布Mini-Monkey，專治「切分增大分辨率」后遺癥

duhorse ? 1948瀏覽 ? 0回復(fù)
無需定制視頻數(shù)據(jù)，DeepMind讓文生定制視頻變得簡(jiǎn)單！

angel ? 2013瀏覽 ? 0回復(fù)
一個(gè)開源、清晰、強(qiáng)大且可定制的RAG UI

PaperAgent ? 2839瀏覽 ? 0回復(fù)
如何利用RAG+Agent輕松解決企業(yè)復(fù)雜問題？

玄姐聊AGI ? 2485瀏覽 ? 0回復(fù)
輕松解析本地PDF表格，基于LlamaIndex和UnstructuredIO打造RAG

小虎哦哦 ? 4672瀏覽 ? 0回復(fù)
優(yōu)化文本嵌入，大幅提升RAG檢索速度

小虎哦哦 ? 3718瀏覽 ? 0回復(fù)
AI自動(dòng)寫書神器，3個(gè)ChatGPT插件讓你輕松賺錢！

ermulong ? 2174瀏覽 ? 0回復(fù)
一個(gè)輕量級(jí)RAG文本切塊項(xiàng)目Chonkie

PaperAgent ? 2452瀏覽 ? 0回復(fù)
Meta開源多模式模型，輕松混合文本和語音

Aceryt ? 1546瀏覽 ? 0回復(fù)
CAG 通過鍵值緩存讓 RAG 輕松上手

凝固的雨_1 ? 2684瀏覽 ? 0回復(fù)
如何高效轉(zhuǎn)換PDF為Markdown：構(gòu)建優(yōu)質(zhì)Graph RAG的第一步

Halo咯咯 ? 1695瀏覽 ? 0回復(fù)
HtmlRAG：RAG系統(tǒng)中，HTML比純文本效果更好

大模型自然語言處理 ? 1622瀏覽 ? 0回復(fù)
Markdown + AI = 效率神器：10分鐘就能學(xué)會(huì)的大模型文本格式！

九歌AI大模型 ? 2755瀏覽 ? 0回復(fù)
RAG項(xiàng)目必備！文檔解析神器MinerU：2.5萬星標(biāo)！支持GPU加速，輕松應(yīng)對(duì)復(fù)雜文檔

AI博物院 ? 4292瀏覽 ? 0回復(fù)
基于文本結(jié)構(gòu)分塊 - 文本分塊（Text Splitting），RAG不可缺失的重要環(huán)節(jié)

AI取經(jīng)路 ? 598瀏覽 ? 0回復(fù)
Hybrid-RRF：動(dòng)態(tài)權(quán)重混合檢索RAG方案

大語言模型論文跟蹤 ? 1854瀏覽 ? 0回復(fù)

這個(gè)用戶很懶，還沒有個(gè)人簡(jiǎn)介

帖子

聲望

粉絲

關(guān)注

最近發(fā)布

LLM-R：基于RAG和層次化Agent落地案例解析 2024-11-15 09:58:18發(fā)布
TextIn：一款優(yōu)秀的文檔解析神器，提升RAG性能必備 2024-11-13 09:10:07發(fā)布

熱門推薦

大半精銳盡出！o1下線！滿血o3之后，模型本身就是Manus，最大賣點(diǎn)：替代人干真活！ 1回復(fù)

王炸！MCP 架構(gòu)設(shè)計(jì)深度剖析 & 使用 Spring AI + MCP 四步教你實(shí)現(xiàn) Agent 智能體開發(fā) 0回復(fù)

Dify從入門到高階系列二：手把手教學(xué)！超詳細(xì)的Dify知識(shí)庫配置全攻略 0回復(fù)

Crawl4AI：GitHub榜首40K星標(biāo)！LLM專屬極速開源爬蟲神器 0回復(fù)

只需5分鐘，教你用Python搭建MCP Server 0回復(fù)

上一篇：顛覆傳統(tǒng)OCR輕松搞定復(fù)雜PDF的工具

下一篇： RAG高級(jí)優(yōu)化：檢索策略探討Fusion, HyDE安排上(含代碼)

社區(qū)精華內(nèi)容

目錄

<legend id="wr1le"><track id="wr1le"></track></legend>