四種通過LLM進(jìn)行文本知識(shí)圖譜的構(gòu)建方法對(duì)比介紹
我們?cè)谝郧暗奈恼轮幸呀?jīng)介紹了使用大語言模型將非結(jié)構(gòu)化文本轉(zhuǎn)換為知識(shí)圖譜。但是對(duì)于知識(shí)圖譜的創(chuàng)建是一個(gè)很復(fù)雜的過程,比如需要對(duì)屬性增加限制,創(chuàng)建符合特定主題/模式的圖譜,并且有時(shí)文檔非常大,無法作為單個(gè)提示處理,所以在切分后的提示中創(chuàng)建的圖譜需要前后一致。
所以本文將介紹和比較使用LLM轉(zhuǎn)換非結(jié)構(gòu)化文本的四種方法,這些方法在不同的場(chǎng)景中都可能會(huì)用到。
使用LLM預(yù)訓(xùn)練本體(ontologies)
LLM似乎已經(jīng)接受了各種標(biāo)準(zhǔn)本體(如SCHEMA)的預(yù)先培訓(xùn)。ORG, FOAF, SKOS, RDF, RDFS, OWL等。所以通過適當(dāng)?shù)南到y(tǒng)提示來指導(dǎo)使用這個(gè)預(yù)訓(xùn)練的本體,再加上包含非結(jié)構(gòu)化文本的用戶提示,我們可以獲得想要的轉(zhuǎn)換后的圖譜結(jié)構(gòu)。
提示如下
Translate the following user text to an RDF graph using the RDF, RDFS, and OWL ontologies formatted as TTL.
Use the prefix ex: with IRI <http://example.com/> for any created entities.
LLM的相應(yīng)如下
@prefix ex: <http://example.com/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
ex:PeterJones rdf:type ex:Person ;
rdf:label "Peter Jones" .
ex:BethHampton rdf:type ex:Person ;
rdf:label "Beth Hampton" .
ex:BrunelUniversity rdf:type ex:University ;
rdf:label "Brunel University" .
ex:George rdf:type ex:Person ;
rdf:label "George" .
ex:James rdf:type ex:Person ;
rdf:label "James" .
ex:Harold rdf:type ex:Person ;
rdf:label "Harold" .
ex:PeterJones ex:studiedAt ex:BrunelUniversity ;
ex:marriedTo ex:BethHampton ;
ex:hasChild ex:George, ex:James, ex:Harold .
對(duì)于gpt-3.5-turbo在創(chuàng)建新屬性時(shí)沒有問題。如果我們修改系統(tǒng)提示符,只使用SCHEMA.ORG
Translate the following user text to an RDF graph using the SCHEMA.ORG ontologies formatted as TTL.
Use the prefix ex: with IRI <http://example.com/> for any created entities.
也會(huì)得到相應(yīng)的輸出
@prefix ex: <http://example.com/> .
@prefix schema: <https://schema.org/> .
ex:PeterJones a schema:Person ;
schema:spouse ex:BethHampton ;
schema:alumniOf ex:BrunelUniversity .
ex:BethHampton a schema:Person ;
schema:spouse ex:PeterJones .
ex:BrunelUniversity a schema:EducationalOrganization .
ex:George a schema:Person .
ex:James a schema:Person .
ex:Harold a schema:Person .
ex:PeterJones schema:children ex:George, ex:James, ex:Harold .
ex:BethHampton schema:children ex:George, ex:James, ex:Harold .
這是最簡(jiǎn)單的方法,因?yàn)槲覀儾恍枰鋈魏蔚牟僮鳎耆氖褂肔LM已經(jīng)訓(xùn)練好的內(nèi)容,并且也能得到很好的輸出。提示也非常簡(jiǎn)潔(約41個(gè)token),不會(huì)占用太多的字符。
但是轉(zhuǎn)換僅限于LLM預(yù)先訓(xùn)練過的那些“標(biāo)準(zhǔn)”本體。如果你問ChatGPT是在什么標(biāo)準(zhǔn)本體上進(jìn)行訓(xùn)練的,他不會(huì)給你想要的答案,這樣對(duì)我們來說就是一個(gè)黑盒。并且在文本到圖譜的轉(zhuǎn)換中生成的實(shí)體還需要在各個(gè)圖譜之間進(jìn)行對(duì)齊。
在LLM提示中添加本體
在大多多情況下,我們希望使用非標(biāo)準(zhǔn)或自定義本體。LLM不太可能在這樣的本體上進(jìn)行預(yù)訓(xùn)練,因此我們需要在提示中包含完整的本體。
Translate the following user text to an RDF graph using the following schema1: <http://inova8.com/schema/1/> ontologies formatted as TTL.
Use the prefix ex: with IRI <http://example.com/> for any created entities.
Only use pre-defined classes and properties from the schema1: <http://inova8.com/schema/1/> ontology.
Use the properties and classes in the schema1: ontology.
Include individuals, their data, and relationships.
... the full ontology in TTL format ...
我們需要對(duì)轉(zhuǎn)換的內(nèi)容進(jìn)行詳細(xì)的說明,這導(dǎo)致提示token增加到了~3567,使用與之前相同的輸入提示,LLM也可以很好地轉(zhuǎn)換文本:
ex:PeterJones rdf:type schema1:CC ;
:dc "Peter" ;
:de "Jones" ;
:oa ex:BrunelUniversity ;
:oh ex:BethHampton ;
:of ex:business .
ex:BethHampton rdf:type schema1:CC ;
:dc "Beth" ;
:de "Hampton" ;
:oa ex:BrunelUniversity ;
:oh ex:PeterJones ;
:of ex:business .
ex:BrunelUniversity rdf:type schema1:CA ;
rdfs:label "Brunel University" .
ex:George rdf:type schema1:CC ;
rdfs:label "George" ;
:od ex:PeterJones ;
:od ex:BethHampton .
ex:James rdf:type schema1:CC ;
rdfs:label "James" ;
:od ex:PeterJones ;
:od ex:BethHampton .
ex:Harold rdf:type schema1:CC ;
rdfs:label "Harold" ;
:od ex:PeterJones ;
:od ex:BethHampton .
ex:PeterJones :oh ex:BethHampton .
ex:BethHampton :oh ex:PeterJones .
當(dāng)我們將自定義的內(nèi)容包含在提示中時(shí),LLM似乎可以理解用RDF、RDFS和OWL表示的本體,并且能夠?qū)⒎墙Y(jié)構(gòu)化文本轉(zhuǎn)換為自定義本體。
但是這導(dǎo)致提示現(xiàn)在非常長(zhǎng),以為系統(tǒng)提示token開銷很大。這將增加成本也會(huì)減慢響應(yīng)時(shí)間,因?yàn)闀r(shí)間與要處理的token成正比。并且這個(gè)結(jié)果仍然需要對(duì)齊。
使用本體進(jìn)行微調(diào)
前兩種方法的主要問題是局限于預(yù)訓(xùn)練的本體,或者在提示中包含自定義本體時(shí)開銷很大。所以我們可以對(duì)LLM進(jìn)行微調(diào)使用KG對(duì)LLM進(jìn)行微調(diào)是非常簡(jiǎn)單的,因?yàn)閳D的本質(zhì)是三元組:
{:subject :predicate :object}
我們可以將其映射到提示中進(jìn)行訓(xùn)練。下面的內(nèi)容都是可以從圖中自動(dòng)生成的。
{“messages”: [
{"role": "system", "content": "Complete the following graph edge"},
{"role": "user", "content": "What is <:subject> <predicate>?"},
{"role": "assistant", "content": " <:subject> is <:predicate> <:object>."}]
}
…
這個(gè)問題就變成了訓(xùn)練LLM將一種語言(非結(jié)構(gòu)化英語文本)翻譯成另一種語言(使用微調(diào)本體語義的高度結(jié)構(gòu)化rdf圖)。
messages”: [
{"role": "system", "content": "Translate the following user text to an RDF graph using the Schema1 ontology."},
{"role": "user", "content": "{example unstructured text}"},
{"role": "assistant", "content": "{RDF graph using custom Schema1 ontology semantics}"}]
}
…
所以問題就變成了創(chuàng)建一個(gè)具有代表性的微調(diào)訓(xùn)練集(100~200個(gè))和RDF轉(zhuǎn)換集。但是如何設(shè)計(jì)提示呢?我們來直接問問LLM,因?yàn)樗盍私庾约?,哈?/span>
I want to fine-tune an LLM on this Schema1 ontology.
This training aims to take an example user text and the assistant translates it into an RDF graph based on the Schema1 ontology.
Please create a comprehensive set of 50 example system, user, and assistant messages in JSONL message conversational chat format format for fine-tuning an LLM to translate text to Schema1 ontology RDF.
The system content should be the instruction to translate the user text to the assistant RDF graph response using Sxchema1 ontology.
The user text should be examples that cover all the concepts within the Schema1 ontology.
The assistant content should be an RDF graph using the Schema1 ontology of the translation of the user text examples.
Use the prefix ex: with IRI <http://example.com/> for any created entities.
Use the message format as follows:
{"messages": [
{"role": "system", "content": "<system_content>"},
{"role": "user", "content": "<user_content>},
{"role": "assistant", "content": "<assistant_content>"}
]}
gpt-3.5-turbo以正確的格式做出了回應(yīng)。
{“messages”: [
{“role”: “system”, “content”: “
Translate the following user text to an RDF graph using the Schema1 ontology.”},
{“role”: “user”, “content”: “
John Smith is a professor who graduated from a reputed college.”},
{“role”: “assistant”, “content”: “
@prefix ex: <http://example.com/> .
@prefix schema1: <http://inova8.com/schema/1/> .
ex:JohnSmith rdf:type schema1:CC ;
schema1:dc \”John\” ;
schema1:de \”Smith\” ;
schema1:of schema1:professor ;
schema1:oa ex:ReputedCollege .
ex:ReputedCollege rdf:type schema1:CA .”}]}
… and so on
我們來試試效果,使用它生成了一個(gè)適當(dāng)大小的(160)轉(zhuǎn)換調(diào)優(yōu)數(shù)據(jù)集。使用3和10個(gè)epoch對(duì)LLM進(jìn)行了微調(diào)。
System-prompt (34 tokens):
Translate the following user text to an RDF graph using the Schema1 ontology.
Use the prefix ex: with IRI <http://example.com/> for any created entities.
User-prompt (28 tokens):
Peter Jones and Beth Hampton both studied at Brunel University, then got married, and have three children: George, James, and Harold.
LLM的結(jié)果
@prefix ex: <http://example.com/> .
ex:PeterJones a schema1:CC ;
schema1:dc "Peter" ;
schema1:de "Jones" ;
schema1:of schema1:student .
ex:BethHampton a schema1:CC ;
schema1:dc "Beth" ;
schema1:de "Hampton" ;
schema1:of schema1:student .
ex:JonesHamptonMarriage a schema1:CG ;
schema1:og ex:PeterJones ;
schema1:oh ex:BethHampton .
ex:GeorgeJones a schema1:CD ;
schema1:dc "George" ;
schema1:od ex:PeterJones ;
schema1:of ex:BethHampton .
ex:JamesJones a schema1:CD ;
schema1:dc "James" ;
schema1:od ex:PeterJones ;
schema1:of ex:BethHampton .
ex:HaroldJones a schema1:CD ;
schema1:dc "Harold" ;
schema1:od ex:PeterJones ;
schema1:of ex:BethHampton .
可以看到,工作的還不錯(cuò),LLM自己給自己上了一課,我們只是動(dòng)動(dòng)嘴(手)。另外在質(zhì)量上3 epoch和10 epoch微調(diào)的LLM之間的轉(zhuǎn)換質(zhì)量似乎沒有什么差別。
LLM可以使用自定義本體進(jìn)行微調(diào),調(diào)優(yōu)LLM的提示可以非常簡(jiǎn)潔(約41個(gè)token),因此大部分token成本由要轉(zhuǎn)換的非結(jié)構(gòu)化文本組成。
但是微調(diào)需要度量llm理解自定義本體的準(zhǔn)確性和圖譜轉(zhuǎn)換的準(zhǔn)確性的指標(biāo),如何判斷好壞就是一個(gè)需要仔細(xì)研究的問題,我們生成的圖譜仍然需要對(duì)齊。
改進(jìn)微調(diào)的提示
通常情況下,單個(gè)本體/模式不足以捕獲文本的完整語義,所以需要使用兩個(gè)或更多本體。這里的一個(gè)好消息是經(jīng)過微調(diào),LLM并沒有忘記預(yù)訓(xùn)練的本體,所以我們可以將它們結(jié)合使用。
Translate the following user text to an RDF graph using both the FOAF, and Schema1 ontologies.
Use the prefix ex: with IRI <http://example.com/> for any created entities.
Peter Jones and Beth Hampton both studied at Brunel University, then got married, and have three children: George, James, and Harold.
Peter, email address peter.jones@inova8.com, has been friends with Paul Williams because of their shared interest in ornithology.
結(jié)果如下:
@prefix ex: <http://example.com/> .
ex:PeterJones a schema1:Thing, foaf:Person ;
schema1:email "peter.jones@inova8.com" ;
foaf:name "Peter Jones" ;
foaf:interest ex:Ornithology .
ex:BethHampton a schema1:Thing, foaf:Person ;
foaf:name "Beth Hampton" .
ex:GeorgeJones a schema1:Thing, foaf:Person ;
foaf:name "George Jones" ;
foaf:maker ex:PeterJones .
ex:JamesJones a schema1:Thing, foaf:Person ;
foaf:name "James Jones" ;
foaf:maker ex:PeterJones .
ex:HaroldJones a schema1:Thing, foaf:Person ;
foaf:name "Harold Jones" ;
foaf:maker ex:PeterJones .
ex:BrunelUniversity a schema1:Thing, foaf:Document ;
foaf:name "Brunel University" .
ex:JonesHamptonMarriage a schema1:Thing ;
schema1:spouse ex:PeterJones, ex:BethHampton .
ex:JonesHamptonFamily a schema1:Thing ;
schema1:parent ex:PeterJones, ex:BethHampton ;
schema1:child ex:GeorgeJones, ex:JamesJones, ex:HaroldJones .
ex:PeterJones foaf:knows ex:PaulWilliams .
ex:PaulWilliams a schema1:Thing, foaf:Person ;
foaf:name "Paul Williams" .
可以看到,回復(fù)中不僅包含了我們微調(diào)的結(jié)果,還包含了模型預(yù)訓(xùn)練時(shí)返回的結(jié)果
但是這里有一個(gè)問題,當(dāng)同一概念在本體之間重疊時(shí),我們需要控制LLM返回使用哪個(gè)。
總結(jié)
對(duì)于上面幾種方法的對(duì)比,我們總結(jié)了一個(gè)圖表:
llm可以有效地將非結(jié)構(gòu)化文本轉(zhuǎn)換為RDF圖。自定義本體微調(diào)模型的token效率要高得多,因?yàn)樗恍枰诿總€(gè)轉(zhuǎn)換請(qǐng)求提示符中提供完整本體的開銷,當(dāng)需要轉(zhuǎn)換多個(gè)文本時(shí),這可以降低生產(chǎn)環(huán)境中的轉(zhuǎn)換成本。
但是我們還沒有提到如何建立文本到KG轉(zhuǎn)換的“準(zhǔn)確性”測(cè)試,并且轉(zhuǎn)換后如何進(jìn)行實(shí)體對(duì)齊,我們將在后面的文章中繼續(xù)介紹。