譯者 | 李睿
審校 | 重樓
本文探討了LLM分塊的不同方法,包括固定大小分塊、遞歸分塊、語義分塊和代理分塊,每種方法都各有獨(dú)特的優(yōu)勢。
大型語言模型(LLM)通過其生成類似人類水平的文本、解答復(fù)雜問題的能力以及對大量信息進(jìn)行分析所展現(xiàn)出的驚人準(zhǔn)確性,已經(jīng)改變了自然語言處理(NLP)領(lǐng)域。從客戶服務(wù)到醫(yī)學(xué)研究,LLM在處理各種查詢并生成詳細(xì)回復(fù)的能力使它們在許多領(lǐng)域都具有不可估量的價(jià)值。然而,隨著LLM的規(guī)模擴(kuò)大以處理不斷增長的數(shù)據(jù),它們在管理長文檔和高效檢索最相關(guān)信息方面面臨著挑戰(zhàn)。
盡管LLM擅長處理和生成類似人類的文本,但它們的“場景窗口”相對有限。這意味著它們一次只能在內(nèi)存中保留有限的信息,這使得管理非常長的文檔面臨重重困難。此外,LLM還難以從大型數(shù)據(jù)集中快速找到最相關(guān)的信息。更重要的是,LLM是在固定數(shù)據(jù)集上訓(xùn)練的,因此隨著新信息的不斷涌現(xiàn),它們可能會逐漸過時(shí)。為了保持準(zhǔn)確性和實(shí)用性,需要定期更新數(shù)據(jù)。
檢索增強(qiáng)生成(RAG)解決了這些挑戰(zhàn)。在RAG工作流中有許多組件,例如查詢、嵌入、索引等等。以下對LLM分塊策略進(jìn)行探討。
通過將文檔分成更小的、有意義的部分,并將它們嵌入到向量數(shù)據(jù)庫中,RAG系統(tǒng)可以搜索和檢索每個(gè)查詢最相關(guān)的塊。這種方法使LLM能夠?qū)W⒂谔囟ǖ男畔ⅲ瑥亩岣唔憫?yīng)的準(zhǔn)確性和效率。
本文將更深入地探討LLM不同的分塊方法及其策略,以及它們在為現(xiàn)實(shí)世界的應(yīng)用程序優(yōu)化LLM中的作用。
什么是分塊?
分塊是將大數(shù)據(jù)源拆分成更小的、可管理的部分或“塊”。這些塊存儲在向量數(shù)據(jù)庫中,允許基于相似性的快速有效搜索。當(dāng)用戶提交查詢時(shí),向量數(shù)據(jù)庫會找到最相關(guān)的塊,并將它們發(fā)送給LLM。這樣,這些模型可以只關(guān)注最相關(guān)的信息,使其響應(yīng)更快、更準(zhǔn)確。
分塊可以幫助語言模型更順利地處理大型數(shù)據(jù)集,并通過縮小需要查看的數(shù)據(jù)范圍來提供精確的答案。
對于需要快速、精確答案的應(yīng)用程序(例如客戶支持或法律文檔搜索),分塊是提高性能和可靠性的基本策略。
以下是一些在RAG中使用的主要分塊策略:
- 固定大小分塊
- 遞歸分塊
- 語義分塊
- 代理分塊
現(xiàn)在深入探討各種分塊策略的細(xì)節(jié)。
1.固定大小分塊
固定大小分塊涉及將數(shù)據(jù)分成大小相同的部分,從而更容易處理大型文檔。
有時(shí),開發(fā)人員會在各個(gè)塊之間添加少許重疊部分,也就是讓一個(gè)段落的小部分內(nèi)容在下一個(gè)段落的開頭重復(fù)出現(xiàn)。這種重疊的方法有助于模型在每個(gè)塊的邊界上保留場景,確保關(guān)鍵信息不會在邊緣丟失。這種策略對于需要連續(xù)信息流的任務(wù)特別有用,因?yàn)樗鼓P湍軌蚋鼫?zhǔn)確地解釋文本,并理解段落之間的關(guān)系,從而產(chǎn)生更連貫和場景感知的響應(yīng)。
上圖是固定大小分塊的完美示例,其中每個(gè)塊都由一種獨(dú)特的顏色表示。綠色部分表示塊之間的重疊部分,確保模型在處理下一個(gè)分塊時(shí)能夠訪問前一個(gè)分塊的相關(guān)場景信息。
這種重疊策略提高了模型處理和理解全文的能力,從而在摘要或翻譯等任務(wù)中獲得更好的性能,在這些任務(wù)中,維護(hù)跨塊邊界的信息流至關(guān)重要。
代碼示例
現(xiàn)在使用一個(gè)代碼示例重新創(chuàng)建這個(gè)示例。將使用LangChain來實(shí)現(xiàn)固定大小分塊。
Python
1 from langchain.text_splitter import RecursiveCharacterTextSplitter
2
3 # Function to split text with fixed-size chunks and overlap
4 def split_text_with_overlap(text, chunk_size, overlap_size):
5 # Create a text splitter with overlap
6 text_splitter = RecursiveCharacterTextSplitter(
7 chunk_size=chunk_size,
8 chunk_overlap=overlap_size
9 )
10
11 # Split the text
12 chunks = text_splitter.split_text(text)
13
14 return chunks
15
16 # Example text
17 text = """Artificial Intelligence (AI) simulates human intelligence in machines for tasks like visual perception, speech recognition, and language translation. It has evolved from rule-based systems to data-driven models, enhancing performance through machine learning and deep learning."""
18
19 # Define chunk size and overlap size
20 chunk_size = 80 # 80 characters per chunk
21 overlap_size = 10 # 10 characters overlap between chunks
22
23 # Get the chunks with overlap
24 chunks = split_text_with_overlap(text, chunk_size, overlap_size)
25
26 # Print the chunks and overlaps
27 for i in range(len(chunks)):
28 print(f"Chunk {i+1}:")
29 print(chunks[i]) # Print the chunk itself
30
31 # If there's a next chunk, print the overlap between current and next chunk
32 if i < len(chunks) - 1:
33 overlap = chunks[i][-overlap_size:] # Get the overlap part
34 print(f"Overlap with Chunk {i+2}:")
35 print(overlap)
36
37 print("\n" + "="*50 + "\n")
執(zhí)行上述代碼后,它將生成以下輸出:
HTML
1 Chunk 1:
2 Artificial Intelligence (AI) simulates human intelligence in machines for tasks
3 Overlap with Chunk 2:
4 for tasks
5
6 ==================================================
7
8 Chunk 2:
9 for tasks like visual perception, speech recognition, and language translation.
10 Overlap with Chunk 3:
11 anslation.
12
13 ==================================================
14
15 Chunk 3:
16 It has evolved from rule-based systems to data-driven models, enhancing
17 Overlap with Chunk 4:
18 enhancing
19
20 ==================================================
21
22 Chunk 4:
23 enhancing performance through machine learning and deep learning.
2.遞歸分塊
遞歸分塊是一種高效的方法,它通過將文本反復(fù)拆分為更小的子塊,從而系統(tǒng)地將龐大的文本內(nèi)容拆分為更易于管理的部分。這種方法在處理復(fù)雜或具有層次結(jié)構(gòu)的文檔時(shí)特別有效,能夠確保每個(gè)拆分后的部分都保持一致性且場景完整。該過程會持續(xù)進(jìn)行,直至文本被拆分成適合模型進(jìn)行有效處理的大小。
以需要由具有有限場景窗口的語言模型處理的一個(gè)冗長文檔為例,遞歸分塊方法會首先將該文檔拆分為幾個(gè)主要部分。若這些部分仍然過于龐大,該方法會進(jìn)一步將其細(xì)分為更小的子部分,并持續(xù)這一過程,直至每個(gè)塊都符合模型的處理能力。這種層次分明的拆分方式不僅保留了原始文檔的邏輯流程和場景,而且使LLM能夠更有效地處理長文本。
在實(shí)際應(yīng)用中,遞歸分塊可以根據(jù)文檔的結(jié)構(gòu)和任務(wù)的特定需求采用多種策略來實(shí)現(xiàn),根據(jù)標(biāo)題、段落或句子進(jìn)行拆分。
在上圖中,文本通過遞歸分塊被拆分為四個(gè)不同顏色的塊,每個(gè)塊都代表了一個(gè)更小、更易管理的部分,并且每個(gè)塊包含最多80個(gè)單詞。這些塊之間沒有重疊。顏色編碼有助于展示內(nèi)容是如何被分割成邏輯部分,使模型更容易處理和理解長文本,避免了重要場景的丟失。
代碼示例
現(xiàn)在編寫一個(gè)示例,演示如何實(shí)現(xiàn)遞歸分塊。
Python
1 from langchain.text_splitter import RecursiveCharacterTextSplitter
2
3 # Function to split text into chunks using recursive chunking
4 def split_text_recursive(text, chunk_size=80):
5 # Initialize the RecursiveCharacterTextSplitter
6 text_splitter = RecursiveCharacterTextSplitter(
7 chunk_size=chunk_size, # Maximum size of each chunk (80 words)
8 chunk_overlap=0 # No overlap between chunks
9 )
10
11 # Split the text into chunks
12 chunks = text_splitter.split_text(text)
13
14 return chunks
15
16 # Example text
17 text = """Artificial Intelligence (AI) simulates human intelligence in machines for tasks like visual perception, speech recognition, and language translation. It has evolved from rule-based systems to data-driven models, enhancing performance through machine learning and deep learning."""
18
19 # Split the text using recursive chunking
20 chunks = split_text_recursive(text, chunk_size=80)
21
22 # Print the resulting chunks
23 for i, chunk in enumerate(chunks):
24 print(f"Chunk {i+1}:")
25 print(chunk)
26 print("="*50)
上述代碼將生成以下輸出:
HTML
1 Chunk 1:
2 Artificial Intelligence (AI) simulates human intelligence in machines for tasks
3 ==================================================
4 Chunk 2:
5 like visual perception, speech recognition, and language translation. It has
6 ==================================================
7 Chunk 3:
8 evolved from rule-based systems to data-driven models, enhancing performance
9 ==================================================
10 Chunk 4:
11 through machine learning and deep learning.
在理解了這兩種基于長度的分塊策略之后,是理解一種更關(guān)注文本含義/場景的分塊策略的時(shí)候了。
3.語義分塊
語義分塊是指根據(jù)內(nèi)容的含義或場景將文本拆分成塊。這種方法通常使用機(jī)器學(xué)習(xí)或自然語言處理(NLP)技術(shù),例如句子嵌入,來識別文本中具有相似含義或語義結(jié)構(gòu)的部分。
在上圖中,每個(gè)塊都采用不同的顏色表示——藍(lán)色代表人工智能,黃色代表提示工程。這些塊是分隔開的,因?yàn)樗鼈兒w了不同的想法。這種方法可以確保模型對每個(gè)主題都能有清晰且準(zhǔn)確的理解,避免了不同主題間的混淆與干擾。
代碼示例
現(xiàn)在編寫一個(gè)實(shí)現(xiàn)語義分塊的示例。
Python
1 import os
2 from langchain_experimental.text_splitter import SemanticChunker
3 from langchain_openai.embeddings import OpenAIEmbeddings
4
5 # Set the OpenAI API key as an environment variable (Replace with your actual API key)
6 os.environ["OPENAI_API_KEY"] = "replace with your actual OpenAI API key"
7
8 # Function to split text into semantic chunks
9 def split_text_semantically(text, breakpoint_type="percentile"):
10 # Initialize the SemanticChunker with OpenAI embeddings
11 text_splitter = SemanticChunker(OpenAIEmbeddings(), breakpoint_threshold_type=breakpoint_type)
12
13 # Create documents (chunks)
14 docs = text_splitter.create_documents([text])
15
16 # Return the list of chunks
17 return [doc.page_content for doc in docs]
18
19 def main():
20 # Example content (State of the Union address or your own text)
21 document_content = """
22 Artificial Intelligence (AI) simulates human intelligence in machines for tasks like visual perception, speech recognition, and language translation. It has evolved from rule-based systems to data-driven models, enhancing performance through machine learning and deep learning.
23
24 Prompt Engineering involves designing input prompts to guide AI models in producing accurate and relevant responses, improving tasks such as text generation and summarization.
25 """
26
27 # Split text using the chosen threshold type (percentile)
28 threshold_type = "percentile"
29 print(f"\nChunks using {threshold_type} threshold:")
30 chunks = split_text_semantically(document_content, breakpoint_type=threshold_type)
31
32 # Print each chunk's content
33 for idx, chunk in enumerate(chunks):
34 print(f"Chunk {idx + 1}:\n{chunk}\n")
35
36 if __name__ == "__main__":
37 main()
上述代碼將生成以下輸出:
HTML
1 Chunks using percentile threshold:
2 Chunk 1:
3 Artificial Intelligence (AI) simulates human intelligence in machines for tasks like visual perception, speech recognition, and language translation. It has evolved from rule-based systems to data-driven models, enhancing performance through machine learning and deep learning.
4
5 Chunk 2:
6 Prompt Engineering involves designing input prompts to guide AI models in producing accurate and relevant responses, improving tasks such as text generation and summarization.
4.代理分塊
在這些策略中,代理分塊是一種強(qiáng)大的策略。這個(gè)策略利用像GPT這樣的LLM作為分塊過程中的代理。LLM不再依賴于人工設(shè)定的規(guī)則來確定內(nèi)容的拆分方式,而是憑借其強(qiáng)大的理解能力,主動地對輸入信息進(jìn)行組織或劃分。LLM會依據(jù)任務(wù)的具體場景,自主決定如何將內(nèi)容拆分成易于管理的部分,從而找到最佳的拆分方案。
上圖顯示了一個(gè)分塊代理將一個(gè)龐大的文本拆分成更小的、有意義的部分。這個(gè)代理是由人工智能驅(qū)動的,這有助于它更好地理解文本,并將其分成有意義的塊。這被稱為“代理分塊”,與簡單地將文本拆分為相等的部分相比,這是一種更智能的處理文本的方式。
接下來探討如何在代碼示例中實(shí)現(xiàn)。
Python
1 from langchain.chat_models import ChatOpenAI
2 from langchain.prompts import PromptTemplate
3 from langchain.chains import LLMChain
4 from langchain.agents import initialize_agent, Tool, AgentType
5
6 # Initialize OpenAI chat model (replace with your API key)
7 llm = ChatOpenAI(model="gpt-3.5-turbo", api_key="replace with your actual OpenAI API key")
8
9 # Step 1: Define Chunking and Summarization Prompt Template
10 chunk_prompt_template = """
11 You are given a large piece of text. Your job is to break it into smaller parts (chunks) if necessary and summarize each chunk.
12 Once all parts are summarized, combine them into a final summary.
13 If the text is already small enough to process at once, provide a full summary in one step.
14 Please summarize the following text:\n{input}
15 """
16 chunk_prompt = PromptTemplate(input_variables=["input"], template=chunk_prompt_template)
17
18 # Step 2: Define Chunk Processing Tool
19 def chunk_processing_tool(query):
20 """Processes text chunks and generates summaries using the defined prompt."""
21 chunk_chain = LLMChain(llm=llm, prompt=chunk_prompt)
22 print(f"Processing chunk:\n{query}\n") # Show the chunk being processed
23 return chunk_chain.run(input=query)
24
25 # Step 3: Define External Tool (Optional, can be used to fetch extra information if needed)
26 def external_tool(query):
27 """Simulates an external tool that could fetch additional information."""
28 return f"External response based on the query: {query}"
29
30 # Step 4: Initialize the agent with tools
31 tools = [
32 Tool(
33 name="Chunk Processing",
34 func=chunk_processing_tool,
35 description="Processes text chunks and generates summaries."
36 ),
37 Tool(
38 name="External Query",
39 func=external_tool,
40 description="Fetches additional data to enhance chunk processing."
41 )
42 ]
43
44 # Initialize the agent with defined tools and zero-shot capabilities
45 agent = initialize_agent(
46 tools=tools,
47 agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
48 llm=llm,
49 verbose=True
50 )
51
52 # Step 5: Agentic Chunk Processing Function
53 def agent_process_chunks(text):
54 """Uses the agent to process text chunks and generate a final output."""
55 # Step 1: Chunking the text into smaller, manageable sections
56 def chunk_text(text, chunk_size=500):
57 """Splits large text into smaller chunks."""
58 return [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]
59
60 chunks = chunk_text(text)
61
62 # Step 2: Process each chunk with the agent
63 chunk_results = []
64 for idx, chunk in enumerate(chunks):
65 print(f"Processing chunk {idx + 1}/{len(chunks)}...")
66 response = agent.invoke({"input": chunk}) # Process chunk using the agent
67 chunk_results.append(response['output']) # Collect the chunk result
68
69 # Step 3: Combine the chunk results into a final output
70 final_output = "\n".join(chunk_results)
71 return final_output
72
73 # Step 6: Running the agent on an example large text input
74 if __name__ == "__main__":
75 # Example large text content
76 text_to_process = """
77 Artificial intelligence (AI) is transforming industries by enabling machines to perform tasks that
78 previously required human intelligence. From healthcare to finance, AI is driving innovation and improving
79 efficiency. For instance, in healthcare, AI algorithms assist doctors in diagnosing diseases, interpreting
80 medical images, and predicting patient outcomes. Meanwhile, in finance, AI helps detect fraud, manage
81 investments, and automate customer service.
82
83 However, the widespread adoption of AI also raises ethical concerns. Issues like privacy invasion,
84 algorithmic bias, and the potential loss of jobs due to automation are significant challenges. Experts
85 argue that it's essential to develop AI responsibly to ensure that it benefits society as a whole.
86 Proper regulations, transparency, and accountability can help address these issues, ensuring that AI
87 technologies are used for the greater good.
88
89 Beyond individual industries, AI is also impacting the global economy. Nations are investing heavily
90 in AI research and development to maintain a competitive edge. This technological race could redefine
91 global power dynamics, with countries that excel in AI leading the way in economic and military strength.
92 Despite the potential for AI to contribute positively to society, its development and application require
93 careful consideration of ethical, legal, and societal implications.
94 """
95
96 # Process the text and print the final result
97 final_result = agent_process_chunks(text_to_process)
98 print("\nFinal Output:\n", final_result)
上述代碼將生成以下輸出:
HTML
1 Processing chunk 1/3...
2
3
4 > Entering new AgentExecutor chain...
5 I should use Chunk Processing to extract the key information from the text provided.
6 Action: Chunk Processing
7 Action Input: Artificial intelligence (AI) is transforming industries by enabling machines to perform tasks that previously required human intelligence. From healthcare to finance, AI is driving innovation and improving efficiency. For instance, in healthcare, AI algorithms assist doctors in diagnosing diseases, interpreting medical images, and predicting patient outcomes. Meanwhile, in finance, AI helps detect fraud, manage investments, and automate customer service.Processing chunk:
8 Artificial intelligence (AI) is transforming industries by enabling machines to perform tasks that previously required human intelligence. From healthcare to finance, AI is driving innovation and improving efficiency. For instance, in healthcare, AI algorithms assist doctors in diagnosing diseases, interpreting medical images, and predicting patient outcomes. Meanwhile, in finance, AI helps detect fraud, manage investments, and automate customer service.
9
10 Observation: Artificial intelligence (AI) is revolutionizing various industries by allowing machines to complete tasks that once needed human intelligence. In healthcare, AI algorithms aid doctors in diagnosing illnesses, analyzing medical images, and forecasting patient results. In finance, AI is used to identify fraud, oversee investments, and streamline customer service. AI is playing a vital role in enhancing efficiency and driving innovation across different sectors.
11 Thought:I need more specific information about the impact of AI in different industries.
12 Action: External Query
13 Action Input: Impact of artificial intelligence in healthcare
14 Observation: External response based on the query: Impact of artificial intelligence in healthcare
15 Thought:I should now look for information on the impact of AI in finance.
16 Action: External Query
17 Action Input: Impact of artificial intelligence in finance
18 Observation: External response based on the query: Impact of artificial intelligence in finance
19 Thought:I now have a better understanding of how AI is impacting healthcare and finance.
20 Final Answer: Artificial intelligence is revolutionizing industries like healthcare and finance by enhancing efficiency, driving innovation, and enabling machines to perform tasks that previously required human intelligence. In healthcare, AI aids in diagnosing diseases, interpreting medical images, and predicting patient outcomes, while in finance, it helps detect fraud, manage investments, and automate customer service.
21
22 > Finished chain.
23 Processing chunk 2/3...
24
25 > Entering new AgentExecutor chain...
26 This question is discussing ethical concerns related to the widespread adoption of AI and the need to develop AI responsibly.
27 Action: Chunk Processing
28 Action Input: The text providedProcessing chunk:
29 The text provided
30
31 Observation: I'm sorry, but you haven't provided any text to be summarized. Could you please provide the text so I can assist you with summarizing it?
32 Thought:I need to provide the text for chunk processing to summarize.
33 Action: External Query
34 Action Input: Retrieve the text related to the ethical concerns of AI adoption and responsible development
35 Observation: External response based on the query: Retrieve the text related to the ethical concerns of AI adoption and responsible development
36 Thought:Now that I have the text related to ethical concerns of AI adoption and responsible development, I can move forward with chunk processing.
37 Action: Chunk Processing
38 Action Input: The retrieved textProcessing chunk:
39 The retrieved text
40
41 Observation: I'm sorry, but it seems like you have not provided any text for me to summarize. Could you please provide the text you would like me to summarize? Thank you!
42 Thought:I need to ensure that the text related to ethical concerns of AI adoption and responsible development is provided for chunk processing to generate a summary.
43 Action: External Query
44 Action Input: Retrieve the text related to the ethical concerns of AI adoption and responsible development
45 Observation: External response based on the query: Retrieve the text related to the ethical concerns of AI adoption and responsible development
46 Thought:Now that I have the text related to ethical concerns of AI adoption and responsible development, I can proceed with chunk processing to generate a summary.
47 Action: Chunk Processing
48 Action Input: The retrieved textProcessing chunk:
49 The retrieved text
50
51 Observation: I'm sorry, but you haven't provided any text to be summarized. Can you please provide the text so I can help you with the summarization?
52 Thought:I need to make sure that the text related to ethical concerns of AI adoption and responsible development is entered for chunk processing to summarize.
53 Action: Chunk Processing
54 Action Input: Text related to ethical concerns of AI adoption and responsible developmentProcessing chunk:
55 Text related to ethical concerns of AI adoption and responsible development
56
57 Observation: The text discusses the ethical concerns surrounding the adoption of artificial intelligence (AI) and the importance of responsible development. It highlights issues such as bias in AI algorithms, privacy violations, and the potential for autonomous AI systems to make harmful decisions. The text emphasizes the need for transparency, accountability, and ethical guidelines to ensure that AI technologies are developed and deployed in a responsible manner.
58 Thought:The text provides information on ethical concerns related to AI adoption and responsible development, emphasizing the need for regulation, transparency, and accountability.
59 Final Answer: The text discusses the ethical concerns surrounding the adoption of artificial intelligence (AI) and the importance of responsible development.
60
61 > Finished chain.
62 Processing chunk 3/3...
63
64 > Entering new AgentExecutor chain...
65 This question seems to be about the impact of AI on the global economy and the potential implications.
66 Action: Chunk Processing
67 Action Input: The text providedProcessing chunk:
68 The text provided
69
70 Observation: I'm sorry, but you did not provide any text for me to summarize. Please provide the text that you would like me to summarize.
71 Thought:I need to provide the text for Chunk Processing to summarize.
72 Action: External Query
73 Action Input: Fetch the text about the impact of AI on the global economy and its implications.
74 Observation: External response based on the query: Fetch the text about the impact of AI on the global economy and its implications.
75 Thought:Now that I have the text about the impact of AI on the global economy and its implications, I can proceed with Chunk Processing.
76 Action: Chunk Processing
77 Action Input: The text about the impact of AI on the global economy and its implications.Processing chunk:
78 The text about the impact of AI on the global economy and its implications.
79
80 Observation: The text discusses the significant impact that artificial intelligence (AI) is having on the global economy. It highlights how AI is revolutionizing industries by increasing productivity, reducing costs, and creating new job opportunities. However, there are concerns about job displacement and the need for retraining workers to adapt to the changing landscape. Overall, AI is reshaping the economy and prompting a shift in the way businesses operate.
81 Thought:Based on the summary generated by Chunk Processing, the impact of AI on the global economy seems to be significant, with both positive and negative implications.
82 Final Answer: The impact of AI on the global economy is significant, revolutionizing industries, increasing productivity, reducing costs, creating new job opportunities, but also raising concerns about job displacement and the need for worker retraining.
83
84 > Finished chain.
85
86 Final Output:
87 Artificial intelligence is revolutionizing industries like healthcare and finance by enhancing efficiency, driving innovation, and enabling machines to perform tasks that previously required human intelligence. In healthcare, AI aids in diagnosing diseases, interpreting medical images, and predicting patient outcomes, while in finance, it helps detect fraud, manage investments, and automate customer service.
88 The text discusses the ethical concerns surrounding the adoption of artificial intelligence (AI) and the importance of responsible development.
89 The impact of AI on the global economy is significant, revolutionizing industries, increasing productivity, reducing costs, creating new job opportunities, but also raising concerns about job displacement and the need for worker retraining.
分塊策略的比較
為了更容易理解不同的分塊方法,下表比較了固定大小分塊、遞歸分塊、語義分塊和代理分塊的工作原理、何時(shí)使用它們以及它們的局限性。
分塊類型 | 描述 | 方法 | 適用場景 | 局限性 |
固定大小 分塊 | 將文本分成大小相等的塊,而不考慮內(nèi)容。 | 基于固定的單詞或字符限制創(chuàng)建的塊。 | 簡單、結(jié)構(gòu)化的文本,場景連續(xù)性并不重要。 | 可能會丟失場景或拆分句子/想法。 |
遞歸分塊 | 不斷地將文本分成更小的塊,直到達(dá)到可管理的大小。 | 分層拆分,如果太大,將部分進(jìn)一步拆分。 | 冗長、復(fù)雜或分層的文檔(例如技術(shù)手冊)。 | 如果部分過于寬泛,仍可能會丟失場景。 |
語義分塊 | 根據(jù)意義或相關(guān)主題將文本分成塊。 | 使用句子嵌入等NLP技術(shù)對相關(guān)內(nèi)容進(jìn)行拆分。 | 場景敏感的任務(wù),連貫性和主題連續(xù)性至關(guān)重要。 | 需要NLP技術(shù);實(shí)施起來更復(fù)雜。 |
代理分塊 | 利用人工智能模型(如GPT)將內(nèi)容自主拆分為有意義的部分。 | 基于模型的理解和特定任務(wù)的場景采用人工智能驅(qū)動的拆分。 | 在內(nèi)容結(jié)構(gòu)多變的復(fù)雜任務(wù)中,人工智能可以優(yōu)化分塊。 | 可能具有不可預(yù)測性,并需要進(jìn)行調(diào)整。 |
結(jié)論
分塊策略與檢索增強(qiáng)生成(RAG)對于提升LLM性能至關(guān)重要。分塊策略有助于將復(fù)雜數(shù)據(jù)簡化為更小、更易管理的部分,從而促進(jìn)更高效的處理;而RAG通過在生成工作流中融入實(shí)時(shí)數(shù)據(jù)檢索來改進(jìn)LLM。總的來說,這些方法通過將有組織的數(shù)據(jù)與生動、實(shí)時(shí)的信息相結(jié)合,使LLM能夠提供更精確、更貼合場景的回復(fù)。
原文標(biāo)題:Chunking Strategies for Optimizing Large Language Models (LLMs),作者:Usama Jamil