自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

淺談基于LLM的三階段自動知識圖譜構(gòu)建方法 原創(chuàng)

發(fā)布于 2024-11-22 10:37
瀏覽
0收藏

文章指出,在以前的方法中,使用LLM生成三元組時,必須預(yù)定義好schema,假如schema數(shù)量很多/復(fù)雜,很容易超過LLM的上下文窗口長度。并且,在某些情況下,沒有可用的固定預(yù)定義schema。

方法

一、EDC框架

淺談基于LLM的三階段自動知識圖譜構(gòu)建方法-AI.x社區(qū)

EDC框架

提出了一個名為提取-定義-規(guī)范化(EDC)的三階段框架:先進(jìn)行開放信息提取,然后定義schema,最后進(jìn)行規(guī)范化。解決知識圖譜構(gòu)建問題。

1.開放信息提?。∣pen Information Extraction): 利用LLMs進(jìn)行開放信息提取,通過少量的提示,LLMs從輸入文本中識別并提取關(guān)系三元組([主體, 關(guān)系, 對象]),不依賴于任何特定Schema。

OIE Prompt示例:

Given a piece of text, extract relational triplets in
the form of [Subject, Relation, Object] from it.
Here are some examples:
Example 1:
Text: The 17068.8 millimeter long ALCO RS-3
has a diesel-electric transmission.
Triplets: [[‘ALCO RS-3’, ‘powerType’, ‘Dieselelectric transmission’], [‘ALCO RS-3’, ‘length’,
‘17068.8 (millimetres)’]] ...
Now please extract triplets from the following
text: Alan Shepard was born on Nov 18, 1923
and selected by NASA in 1959. He was a member of the Apollo 14 crew.

提取的三元組:[‘Alan Shepard’, ‘bornOn’, ‘Nov 18, 1923’], [‘Alan Shepard’, ‘participatedIn’, ‘Apollo 14’]

2.Schema定義(Schema Definition): 提示LLMs為提取的Schema組件(如實(shí)體類型和關(guān)系類型)提供自然語言定義。然后將這些定義作為用于規(guī)范化的輔助信息傳遞到下一階段。

Schema Definition Prompt示例:

Given a piece of text and a list of relational triplets
extracted from it, write a definition for each relation present.
Example 1:
Text: The 17068.8 millimeter long ALCO RS-3
has a diesel-electric transmission.
Triplets: [[‘ALCO RS-3’, ‘powerType’, ‘Dieselelectric transmission’], [‘ALCO RS-3’, ‘length’,
‘17068.8 (millimetres)’]]
Definitions:
powerType: The subject entity uses the type of
power or energy source specified by the object
entity.
...
Now write a definition for each relation present
in the triplets extracted from the following text:
Text: Alan Shepard was an American who was
born on Nov 18, 1923 in New Hampshire, was
selected by NASA in 1959, was a member of the
Apollo 14 crew and died in California
Triplets: [[‘Alan Shepard’, ‘bornOn’, ‘Nov 18,
1923’], [‘Alan Shepard’, ‘participatedIn’, ‘Apollo14’]]

結(jié)果: (bornOn: The subject entity was born on the date specified by the object entity.) and (participatedIn: The subject entity took part in the event or mission specified by the object entity.)

3.Schema標(biāo)準(zhǔn)化(Schema Canonicalization): 第三階段將開放知識庫(KG)精煉成規(guī)范化的形式,消除冗余和歧義。首先使用句子變換器對每個schema組件的定義進(jìn)行向量化,創(chuàng)建嵌入。然后根據(jù)目標(biāo)Schema的可用性,規(guī)范化以兩種方式之一進(jìn)行:

  • 目標(biāo)對齊(Target Alignment):如果有預(yù)定義的目標(biāo)Schema,識別目標(biāo)Schema中與每個元素最相關(guān)的組件進(jìn)行標(biāo)準(zhǔn)化。LLMs評估每個潛在轉(zhuǎn)換的可行性,以確保不會過度泛化。
  • 自我標(biāo)準(zhǔn)化(Self Canonicalization):如果沒有預(yù)定義的目標(biāo)Schema,目標(biāo)是合并語義相似(向量相似性)的組件,并將它們標(biāo)準(zhǔn)化為一個單一表示。通過向量和LLM驗(yàn)證來搜索潛在的合并候選者。與目標(biāo)對齊不同,認(rèn)為不可轉(zhuǎn)換的組件被添加到規(guī)范Schema中,從而擴(kuò)展它。

Schema Canonicalization提示示例:

Given a piece of text, a relational triplet extracted
from it, and the definition of the relation in it,
choose the most appropriate relation to replace it
in this context if there is any.
Text: Alan Shepard was born on Nov 18, 1923
and selected by NASA in 1959. He was a member
of the Apollo 14 crew.
Triplets: [‘Alan Shepard’, ‘participatedIn’,
‘Apollo 14’]
Definition of ‘participatedIn’: The subject entitytook part in the event or mission specified by the
object entity.
Choices:
A. ‘mission’: The subject entity participated in
the event or operation specified by the object entity.
B. ‘season’: The subject entity participated in the
season of a series specified by the object entity.
...
F. None of the above

結(jié)果:[‘Alan Shepard’, ‘birthDate’, ‘Nov 18, 1923’],[‘Alan Shepard’, ‘mission’, ‘Apollo 14’],構(gòu)成了規(guī)范化的知識圖譜。

二、EDC+R:迭代使用Schema檢索器精煉EDC

EDC+R 是對 EDC 的改進(jìn),通過引入一個額外的迭代步驟來進(jìn)一步提升知識圖譜的質(zhì)量。這個過程類似于RAG,通過在初始提取階段的提示(prompt)中提供先前提取的三元組和相關(guān)Schema部分來實(shí)現(xiàn)。目標(biāo)是利用從 EDC 過程中產(chǎn)生的數(shù)據(jù)來提高提取三元組的質(zhì)量。

精煉過程由以下兩個主要元素組成:

  • 候選實(shí)體:這是之前迭代中由 EDC 提取的實(shí)體,以及使用 LLM 從文本中提取的實(shí)體。
  • 候選關(guān)系:這是之前由 EDC 提取的關(guān)系,以及通過訓(xùn)練有素的 Schema Retriever 從預(yù)定義/規(guī)范化的Schema中檢索到的關(guān)系。

Schema Retriever 的作用:Schema Retriever是可以訓(xùn)練的,Schema Retriever 通過將Schema組件和輸入文本投影到向量空間中,使得余弦相似度能夠捕捉二者之間的相關(guān)性,即Schema組件在輸入文本中出現(xiàn)的概率。

訓(xùn)練數(shù)據(jù)集由文本和它們對應(yīng)的定義關(guān)系對組成。微調(diào)的是一個嵌入模型,目標(biāo)是區(qū)分與給定文本相關(guān)聯(lián)的正確關(guān)系和其他不相關(guān)的關(guān)系。

淺談基于LLM的三階段自動知識圖譜構(gòu)建方法-AI.x社區(qū)

效果

參考文獻(xiàn)

  • paper:Extract, Define, Canonicalize: An LLM-based Framework for Knowledge Graph Construction,https://arxiv.org/pdf/2404.03868v2
  • code:https://github.com/clear-nus/edc


本文轉(zhuǎn)載自公眾號大模型自然語言處理  作者:余俊暉

原文鏈接:??https://mp.weixin.qq.com/s/RITUjcKiEy66SL3PK8mFGg??

?著作權(quán)歸作者所有,如需轉(zhuǎn)載,請注明出處,否則將追究法律責(zé)任
標(biāo)簽
已于2024-11-28 18:51:17修改
收藏
回復(fù)
舉報
回復(fù)
相關(guān)推薦