自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

<thead id="ugfrk"><rt id="ugfrk"></rt></thead>

<style id="ugfrk"></style>

<blockquote id="ugfrk"><p id="ugfrk"></p></blockquote>

51CTO首頁(yè)

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會(huì)

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫(kù)

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營(yíng)

鴻蒙開發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫(kù)

AI.x社區(qū)

登錄/注冊(cè)
51CTO

中國(guó)優(yōu)質(zhì)的IT技術(shù)網(wǎng)站

51CTO博客

專業(yè)IT技術(shù)創(chuàng)作平臺(tái)

51CTO學(xué)堂

IT職業(yè)在線教育平臺(tái)

RAG高級(jí)優(yōu)化：基于問題生成的文檔檢索增強(qiáng) 原創(chuàng)

發(fā)布于 2024-9-14 14:18

瀏覽

0收藏

我們將在本文中介紹一種文本增強(qiáng)技術(shù)，該技術(shù)利用額外的問題生成來改進(jìn)矢量數(shù)據(jù)庫(kù)中的文檔檢索。通過生成和合并與每個(gè)文本片段相關(guān)的問題，增強(qiáng)系統(tǒng)標(biāo)準(zhǔn)檢索過程，從而增加了找到相關(guān)文檔的可能性，這些文檔可以用作生成式問答的上下文。

實(shí)現(xiàn)步驟

通過用相關(guān)問題豐富文本片段，我們的目標(biāo)是顯著提高識(shí)別文檔中包含用戶查詢答案的最相關(guān)部分的準(zhǔn)確性。具體的方案實(shí)現(xiàn)一般包含以下步驟：

文檔解析和文本分塊:處理PDF文檔并將其劃分為可管理的文本片段。
問題增強(qiáng):使用語(yǔ)言模型在文檔和片段級(jí)別生成相關(guān)問題。
矢量存儲(chǔ)創(chuàng)建:使用??向量模型?計(jì)算文檔的嵌入，并創(chuàng)建FAISS矢量存儲(chǔ)。
檢索和答案生成:使用FAISS查找最相關(guān)的文檔，并根據(jù)提供的上下文生成答案。

我們可以通過設(shè)置，指定在文檔級(jí)或片段級(jí)進(jìn)行問題增強(qiáng)。

class QuestionGeneration(Enum):
    """
    Enum class to specify the level of question generation for document processing.


    Attributes:
        DOCUMENT_LEVEL (int): Represents question generation at the entire document level.
        FRAGMENT_LEVEL (int): Represents question generation at the individual text fragment level.
    """
    DOCUMENT_LEVEL = 1
    FRAGMENT_LEVEL = 2

RAG高級(jí)優(yōu)化：基于問題生成的文檔檢索增強(qiáng)-AI.x社區(qū)

方案實(shí)現(xiàn)

問題生成

def generate_questions(text: str) -> List[str]:
    """
    Generates a list of questions based on the provided text using OpenAI.


    Args:
        text (str): The context data from which questions are generated.


    Returns:
        List[str]: A list of unique, filtered questions.
    """
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    prompt = PromptTemplate(
        input_variables=["context", "num_questions"],
        template="Using the context data: {context}\n\nGenerate a list of at least {num_questions} "
                 "possible questions that can be asked about this context. Ensure the questions are "
                 "directly answerable within the context and do not include any answers or headers. "
                 "Separate the questions with a new line character."
    )
    chain = prompt | llm.with_structured_output(QuestionList)
    input_data = {"context": text, "num_questions": QUESTIONS_PER_DOCUMENT}
    result = chain.invoke(input_data)
    
    # Extract the list of questions from the QuestionList object
    questions = result.question_list
    
    filtered_questions = clean_and_filter_questions(questions)
    return list(set(filtered_questions))

處理主流程

def process_documents(content: str, embedding_model: OpenAIEmbeddings):
    """
    Process the document content, split it into fragments, generate questions,
    create a FAISS vector store, and return a retriever.


    Args:
        content (str): The content of the document to process.
        embedding_model (OpenAIEmbeddings): The embedding model to use for vectorization.


    Returns:
        VectorStoreRetriever: A retriever for the most relevant FAISS document.
    """
    # Split the whole text content into text documents
    text_documents = split_document(content, DOCUMENT_MAX_TOKENS, DOCUMENT_OVERLAP_TOKENS)
    print(f'Text content split into: {len(text_documents)} documents')


    documents = []
    counter = 0
    for i, text_document in enumerate(text_documents):
        text_fragments = split_document(text_document, FRAGMENT_MAX_TOKENS, FRAGMENT_OVERLAP_TOKENS)
        print(f'Text document {i} - split into: {len(text_fragments)} fragments')
        
        for j, text_fragment in enumerate(text_fragments):
            documents.append(Document(
                page_cnotallow=text_fragment,
                metadata={"type": "ORIGINAL", "index": counter, "text": text_document}
            ))
            counter += 1
            
            if QUESTION_GENERATION == QuestionGeneration.FRAGMENT_LEVEL:
                questions = generate_questions(text_fragment)
                documents.extend([
                    Document(page_cnotallow=question, metadata={"type": "AUGMENTED", "index": counter + idx, "text": text_document})
                    for idx, question in enumerate(questions)
                ])
                counter += len(questions)
                print(f'Text document {i} Text fragment {j} - generated: {len(questions)} questions')
        
        if QUESTION_GENERATION == QuestionGeneration.DOCUMENT_LEVEL:
            questions = generate_questions(text_document)
            documents.extend([
                Document(page_cnotallow=question, metadata={"type": "AUGMENTED", "index": counter + idx, "text": text_document})
                for idx, question in enumerate(questions)
            ])
            counter += len(questions)
            print(f'Text document {i} - generated: {len(questions)} questions')


    for document in documents:
        print_document("Dataset", document)


    print(f'Creating store, calculating embeddings for {len(documents)} FAISS documents')
    vectorstore = FAISS.from_documents(documents, embedding_model)


    print("Creating retriever returning the most relevant FAISS document")
    return vectorstore.as_retriever(search_kwargs={"k": 1})

該技術(shù)為提高基于向量的文檔檢索系統(tǒng)的信息檢索質(zhì)量提供了一種方法。此實(shí)現(xiàn)使用了大模型的API，這可能會(huì)根據(jù)使用情況產(chǎn)生成本。

本文轉(zhuǎn)載自公眾號(hào)哎呀AIYA

原文鏈接：??https://mp.weixin.qq.com/s/bjI02uOeAGXSelCApb0yOQ??

?著作權(quán)歸作者所有，如需轉(zhuǎn)載，請(qǐng)注明出處，否則將追究法律責(zé)任

標(biāo)簽

已于2024-9-14 14:18:55修改

贊

收藏

回復(fù)

舉報(bào)

回復(fù)

相關(guān)推薦

通過檢索增強(qiáng)生成(RAG) 增強(qiáng)LLM的實(shí)戰(zhàn)演練

51CTO內(nèi)容精選 ? 3216瀏覽 ? 0回復(fù)
15種先進(jìn)的檢索增強(qiáng)生成（RAG）技術(shù)

玄姐聊AGI ? 2413瀏覽 ? 0回復(fù)
RAG高級(jí)優(yōu)化：檢索策略探討Fusion, HyDE安排上(含代碼)

恰似驚鴻 ? 3649瀏覽 ? 0回復(fù)
RAG高級(jí)優(yōu)化：檢索后處理模塊成竹在胸

恰似驚鴻 ? 2060瀏覽 ? 0回復(fù)
RAG 的未來 - 自動(dòng)文檔檢索

探索AGI ? 2148瀏覽 ? 0回復(fù)
15種先進(jìn)的檢索增強(qiáng)生成（RAG）技術(shù)

Halo咯咯 ? 1840瀏覽 ? 0回復(fù)
多模態(tài)RAG-VisRAG：基于視覺的檢索增強(qiáng)生成在多模態(tài)文檔上的應(yīng)用

大模型自然語(yǔ)言處理 ? 2415瀏覽 ? 0回復(fù)
再談大模型檢索增強(qiáng)生成——RAG

AI探索時(shí)代 ? 1837瀏覽 ? 0回復(fù)
Extract-Refine-Retrieve-Read (ERRR)：優(yōu)化大語(yǔ)言模型的RAG（檢索增強(qiáng)查詢）

芝士AI吃魚 ? 2469瀏覽 ? 0回復(fù)
提升RAG性能的全攻略：優(yōu)化檢索增強(qiáng)生成系統(tǒng)的策略大揭秘

Halo咯咯 ? 5187瀏覽 ? 0回復(fù)
基于Agent的金融問答系統(tǒng)：RAG的檢索增強(qiáng)之ElasticSearch

一起AI技術(shù) ? 2424瀏覽 ? 0回復(fù)
RAG再進(jìn)化？基于長(zhǎng)期記憶的檢索增強(qiáng)生成新范式-MemoRAG

大模型自然語(yǔ)言處理 ? 2145瀏覽 ? 0回復(fù)
怎么解決大模型知識(shí)庫(kù)的檢索問題，RAG檢索增強(qiáng)之ReRank(重新排序)

AI探索時(shí)代 ? 3010瀏覽 ? 0回復(fù)
RAG檢索增強(qiáng)生成和大模型微調(diào)的抉擇

AI探索時(shí)代 ? 1972瀏覽 ? 0回復(fù)
大模型檢索增強(qiáng)生成之向量數(shù)據(jù)庫(kù)的問題

AI探索時(shí)代 ? 2077瀏覽 ? 0回復(fù)
九種不同類型的檢索增強(qiáng)生成 (RAG)

Halo咯咯 ? 2373瀏覽 ? 0回復(fù)
什么是檢索增強(qiáng)生成 (RAG)？

Halo咯咯 ? 1605瀏覽 ? 0回復(fù)
基于代理知識(shí)蒸餾技術(shù)克服文檔提取和RAG策略失敗問題?

51CTO內(nèi)容精選 ? 1013瀏覽 ? 0回復(fù)
萬(wàn)字解析非結(jié)構(gòu)化文檔中的隱藏價(jià)值：多模態(tài)檢索增強(qiáng)生成（RAG）的前景

柏企閱文 ? 916瀏覽 ? 0回復(fù)

這個(gè)用戶很懶，還沒有個(gè)人簡(jiǎn)介

帖子

聲望

粉絲

關(guān)注

最近發(fā)布

LLM-R：基于RAG和層次化Agent落地案例解析 2024-11-15 09:58:18發(fā)布
TextIn：一款優(yōu)秀的文檔解析神器，提升RAG性能必備 2024-11-13 09:10:07發(fā)布

熱門推薦

大半精銳盡出！o1下線！滿血o3之后，模型本身就是Manus，最大賣點(diǎn)：替代人干真活！ 1回復(fù)

王炸！MCP 架構(gòu)設(shè)計(jì)深度剖析 & 使用 Spring AI + MCP 四步教你實(shí)現(xiàn) Agent 智能體開發(fā) 0回復(fù)

Dify從入門到高階系列二：手把手教學(xué)！超詳細(xì)的Dify知識(shí)庫(kù)配置全攻略 0回復(fù)

Crawl4AI：GitHub榜首40K星標(biāo)！LLM專屬極速開源爬蟲神器 0回復(fù)

只需5分鐘，教你用Python搭建MCP Server 0回復(fù)

下一篇：支持大模型流式輸出的JSON提取工具

社區(qū)精華內(nèi)容

目錄

<ul id="otk90"></ul>

<cite id="otk90"></cite>