使用知識圖譜，大幅提升RAG準確性

作者：學研君 2024-08-06 08:43:17

通過引入LLMGraphTransformer，知識圖譜的生成過程將變得更加流暢和易于訪問，這使得任何希望使用知識圖譜提供的深度和上下文來增強他們RAG應用程序的人都能更加輕松地使用它。

圖形檢索—增強生成（GraphRAG）的發(fā)展勢頭日益強勁，已成為傳統(tǒng)向量搜索檢索方法的有力補充。這種方法利用圖數(shù)據(jù)庫的結構化特性，將數(shù)據(jù)組織為節(jié)點和關系，從而增強了檢索信息的深度和上下文關聯(lián)性。

知識圖譜示例

圖形擅長以結構化方式表示和存儲異構和互連的信息，能夠輕松捕獲不同數(shù)據(jù)類型之間的復雜關系和屬性。相比之下，向量數(shù)據(jù)庫通常難以處理此類結構化信息，因為它們的優(yōu)勢在于通過高維向量處理非結構化數(shù)據(jù)。在RAG應用程序中，可以將結構化的圖數(shù)據(jù)與非結構文本的向量搜索相結合，以獲得雙方的優(yōu)勢。這就是將在這篇文章中展示的內容。

一、知識圖譜很好，但如何創(chuàng)建？

構建知識圖譜通常是最富挑戰(zhàn)性的一步。它涉及數(shù)據(jù)的收集和結構化，這需要對領域和圖建模有深入的理解。

為了簡化這一過程，可以嘗試使用大語言模型（LLM）。憑借對語言和上下文的深刻理解，LLM可以自動完成知識圖譜創(chuàng)建過程的大部分工作。通過分析文本數(shù)據(jù)，這些模型可以識別實體、理解它們之間的關系，并建議如何在圖結構中最好地表示它們。

作為這些實驗的結果，我們已經(jīng)在LangChain中添加了第一版的圖構建模塊，將在這篇文章中進行演示。

代碼可在GitHub上獲取。

【GitHub】：https://github.com/tomasonjo/blogs/blob/master/llm/enhancing_rag_with_graph.ipynb

Neo4j環(huán)境設置

需要設置一個Neo4j實例，請按照本文章中的示例操作。最簡單的方法是在Neo4j Aura（https://neo4j.com/cloud/platform/aura-graph-database/）上啟動一個免費實例，它提供Neo4j數(shù)據(jù)庫的云實例?；蛘?，也可以通過下載Neo4j Desktop應用程序（https://neo4j.com/download/），并創(chuàng)建一個本地數(shù)據(jù)庫實例，從而設置Neo4j數(shù)據(jù)庫的本地實例。

os.environ["OPENAI_API_KEY"] = "sk-"
os.environ["NEO4J_URI"] = "bolt://localhost:7687"
os.environ["NEO4J_USERNAME"] = "neo4j"
os.environ["NEO4J_PASSWORD"] = "password"

graph = Neo4jGraph()

此外，還必須提供一個OpenAI密鑰，因為我們將在這篇文章中使用他們的模型。

二、數(shù)據(jù)導入

在本演示中，我們將使用伊麗莎白一世的維基百科頁面。我們可以使用LangChain加載器無縫地獲取和拆分來自維基百科的文檔。

【伊麗莎白一世維基百科】：https://en.wikipedia.org/wiki/Elizabeth_I

【LangChain加載器】：https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/

# 讀取維基百科文章
raw_documents = WikipediaLoader(query="Elizabeth I").load()
# 定義分塊策略
text_splitter = TokenTextSplitter(chunk_size=512, chunk_overlap=24)
documents = text_splitter.split_documents(raw_documents[:3])

現(xiàn)在是時候根據(jù)檢索到的文檔構建一個圖了。為此，我們實現(xiàn)了一個LLMGraphTransformer模塊，它大大簡化了在圖數(shù)據(jù)庫中構建和存儲知識圖譜的過程。

llm=ChatOpenAI(temperature=0, model_name="gpt-4-0125-preview")
llm_transformer = LLMGraphTransformer(llm=llm)

# 提取圖數(shù)據(jù)
graph_documents = llm_transformer.convert_to_graph_documents(documents)
# 存儲到neo4j
graph.add_graph_documents(
  graph_documents, 
  baseEntityLabel=True, 
  include_source=True
)

可以定義希望知識圖譜生成鏈使用的LLM。目前，我們只支持來自OpenAI和Mistral的函數(shù)調用模型。但是，我們計劃在未來擴展LLM的選擇范圍。在這個例子中，我們使用的是最新的GPT-4。請注意，生成的圖的質量很大程度上取決于使用的模型。理論上，總是希望使用最強大的模型。LLM圖轉換器返回圖文檔，這些文檔可以通過add_graph_documents方法導入到Neo4j中。baseEntityLabel參數(shù)為每個節(jié)點分配一個額外的__Entity__標簽，以增強索引和查詢性能。include_source參數(shù)將節(jié)點鏈接到它們的源文檔，以便于數(shù)據(jù)追溯和上下文理解。

可以在Neo4j瀏覽器中查看生成的圖。

圖片

生成圖的一部分。

請注意，該圖像只代表生成的圖的一部分。

三、用于RAG的混合檢索

在圖生成之后，我們將使用混合檢索方法，將向量和關鍵字索引與圖檢索相結合，用于RAG應用程序。

圖片

混合（向量+關鍵字）和圖檢索方法的結合。

該圖展示了一個檢索過程，首先是用戶提出問題，然后將問題導向RAG檢索器。該檢索器使用關鍵字和向量搜索來搜索非結構化文本數(shù)據(jù)，并將其與從知識圖譜收集的信息相結合。由于Neo4j同時支持關鍵字和向量索引，因此可以使用單一的數(shù)據(jù)庫系統(tǒng)實現(xiàn)所有三種檢索選項。從這些來源收集的數(shù)據(jù)被輸入到LLM中，以生成并提供最終答案。

3.1 非結構化數(shù)據(jù)檢索器

可以使用Neo4jVector.from_existing_graph方法為文檔添加關鍵字和向量檢索。該方法為混合搜索方法配置關鍵字和向量搜索索引，目標是標有Document的節(jié)點。此外，如果缺少文本嵌入值，它還會計算這些值。

vector_index = Neo4jVector.from_existing_graph(
    OpenAIEmbeddings(),
    search_type="hybrid",
    node_label="Document",
    text_node_properties=["text"],
    embedding_node_property="embedding"
)

然后就可以使用similarity_search方法調用向量索引。

3.2 圖檢索器

另一方面，配置圖檢索器更為復雜，但提供了更多自由度。這個例子將使用全文索引來識別相關節(jié)點，并返回它們的直接鄰域。

圖片

圖檢索器。

圖檢索器首先識別輸入信息中的相關實體。為簡單起見，我們指示LLM識別人物、組織和位置。為此，我們將使用LCEL和新添加的with_structured_output方法來實現(xiàn)這一目標。

# 從文本中提取實體
class Entities(BaseModel):
    """Identifying information about entities."""

    names: List[str] = Field(
        ...,
        descriptinotallow="All the person, organization, or business entities that "
        "appear in the text",
    )

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are extracting organization and person entities from the text.",
        ),
        (
            "human",
            "Use the given format to extract information from the following "
            "input: {question}",
        ),
    ]
)

entity_chain = prompt | llm.with_structured_output(Entities)

讓我們來測試一下。

entity_chain.invoke({"question": "Where was Amelia Earhart born?"}).names
# ['Amelia Earhart']

太好了，現(xiàn)在我們可以檢測問題中的實體了，讓我們使用全文索引將它們映射到知識圖譜中。首先，我們需要定義一個全文索引和一個可以生成全文查詢的函數(shù)，其中允許有一些拼寫錯誤，這里就不詳細介紹了。

graph.query(
    "CREATE FULLTEXT INDEX entity IF NOT EXISTS FOR (e:__Entity__) ON EACH [e.id]")

def generate_full_text_query(input: str) -> str:
    """
    Generate a full-text search query for a given input string.

    This function constructs a query string suitable for a full-text search.
    It processes the input string by splitting it into words and appending a
    similarity threshold (~2 changed characters) to each word, then combines 
    them using the AND operator. Useful for mapping entities from user questions
    to database values, and allows for some misspelings.
    """
    full_text_query = ""
    words = [el for el in remove_lucene_chars(input).split() if el]
    for word in words[:-1]:
        full_text_query += f" {word}~2 AND"
    full_text_query += f" {words[-1]}~2"
    return full_text_query.strip()

現(xiàn)在，讓我們把它們全部組合起來。

# 全文索引查詢
def structured_retriever(question: str) -> str:
    """
    Collects the neighborhood of entities mentioned
    in the question
    """
    result = ""
    entities = entity_chain.invoke({"question": question})
    for entity in entities.names:
        response = graph.query(
            """CALL db.index.fulltext.queryNodes('entity', $query, {limit:2})
            YIELD node,score
            CALL {
              MATCH (node)-[r:!MENTIONS]->(neighbor)
              RETURN node.id + ' - ' + type(r) + ' -> ' + neighbor.id AS output
              UNION
              MATCH (node)<-[r:!MENTIONS]-(neighbor)
              RETURN neighbor.id + ' - ' + type(r) + ' -> ' +  node.id AS output
            }
            RETURN output LIMIT 50
            """,
            {"query": generate_full_text_query(entity)},
        )
        result += "\n".join([el['output'] for el in response])
    return result

structured_retriever函數(shù)首先檢測用戶問題中的實體。接下來，它會遍歷檢測到的實體，并使用Cypher模板檢索相關節(jié)點的鄰域。讓我們來測試一下！

print(structured_retriever("Who is Elizabeth I?"))
# Elizabeth I - BORN_ON -> 7 September 1533
# Elizabeth I - DIED_ON -> 24 March 1603
# Elizabeth I - TITLE_HELD_FROM -> Queen Of England And Ireland
# Elizabeth I - TITLE_HELD_UNTIL -> 17 November 1558
# Elizabeth I - MEMBER_OF -> House Of Tudor
# Elizabeth I - CHILD_OF -> Henry Viii
# and more...

3.3 最終檢索器

正如開頭提到的，我們將結合非結構化檢索器和圖形檢索器，創(chuàng)建傳遞給LLM的最終上下文。

def retriever(question: str):
    print(f"Search query: {question}")
    structured_data = structured_retriever(question)
    unstructured_data = [el.page_content for el in vector_index.similarity_search(question)]
    final_data = f"""Structured data:
{structured_data}
Unstructured data:
{"#Document ". join(unstructured_data)}
    """
    return final_data

由于我們使用的是Python，因此只需使用f-string將輸出連接起來即可。

四、定義RAG鏈

我們已經(jīng)成功實現(xiàn)了RAG的檢索組件。接下來，我們將引入一個提示，利用集成混合檢索器提供的上下文來生成響應，從而完成RAG鏈的實現(xiàn)。

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

chain = (
    RunnableParallel(
        {
            "context": _search_query | retriever,
            "question": RunnablePassthrough(),
        }
    )
    | prompt
    | llm
    | StrOutputParser()
)

最后，我們可以繼續(xù)測試我們的混合RAG實現(xiàn)。

chain.invoke({"question": "Which house did Elizabeth I belong to?"})
# 搜索查詢：Which house did Elizabeth I belong to?（伊麗莎白一世屬于哪個王朝?）
# 'Elizabeth I belonged to the House of Tudor.'（'伊麗莎白一世屬于都鐸王朝。'）

此示例還加入了一個查詢重寫功能，使RAG鏈能夠適應并支持追問的對話環(huán)境。鑒于我們使用的是向量和關鍵字搜索方法，我們必須重寫后續(xù)問題以優(yōu)化我們的搜索過程。

chain.invoke(
    {
        "question": "When was she born?",
        "chat_history": [("Which house did Elizabeth I belong to?", "House Of Tudor")],
    }
)
# 搜索查詢：When was Elizabeth I born?（她出生于何時）
# 'Elizabeth I was born on 7 September 1533.'（'伊麗莎白一世于1533年9月7日出生。'）

可以觀察到，When was she born?首先被重寫為When was Elizabeth I born? 。然后使用重寫后的查詢來檢索相關上下文并回答問題。

五、輕松增強RAG應用程序

【GitHub】：https://github.com/tomasonjo/blogs/blob/master/llm/enhancing_rag_with_graph.ipynb

責任編輯：武曉燕來源： Python學研大本營

自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

使用知識圖譜，大幅提升RAG準確性

一、知識圖譜很好，但如何創(chuàng)建？