自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

<ol id="gom1y"></ol>

<style id="gom1y"></style>

<p id="gom1y"><li id="gom1y"></li></p>

<p id="gom1y"></p>

51CTO首頁(yè)

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會(huì)

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫(kù)

在線學(xué)習(xí)

文章資源問(wèn)答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營(yíng)

鴻蒙開發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫(kù)

賬號(hào)設(shè)置退出

NLP 與 Python：構(gòu)建知識(shí)圖譜實(shí)戰(zhàn)案例

作者：MobotStone 2023-04-26 06:22:45

開發(fā) 前端

網(wǎng)絡(luò)圖是一種數(shù)學(xué)結(jié)構(gòu)，用于表示點(diǎn)之間的關(guān)系，可通過(guò)無(wú)向/有向圖結(jié)構(gòu)進(jìn)行可視化展示。它是一種將相關(guān)節(jié)點(diǎn)映射的數(shù)據(jù)庫(kù)形式。

概括

積累了一兩周，好久沒做筆記了，今天，我將展示在之前兩周的實(shí)戰(zhàn)經(jīng)驗(yàn)：如何使用 Python 和自然語(yǔ)言處理構(gòu)建知識(shí)圖譜。

網(wǎng)絡(luò)圖是一種數(shù)學(xué)結(jié)構(gòu)，用于表示點(diǎn)之間的關(guān)系，可通過(guò)無(wú)向/有向圖結(jié)構(gòu)進(jìn)行可視化展示。它是一種將相關(guān)節(jié)點(diǎn)映射的數(shù)據(jù)庫(kù)形式。

知識(shí)庫(kù)是來(lái)自不同來(lái)源信息的集中存儲(chǔ)庫(kù)，如維基百科、百度百科等。

知識(shí)圖譜是一種采用圖形數(shù)據(jù)模型的知識(shí)庫(kù)。簡(jiǎn)單來(lái)說(shuō)，它是一種特殊類型的網(wǎng)絡(luò)圖，用于展示現(xiàn)實(shí)世界實(shí)體、事實(shí)、概念和事件之間的關(guān)系。2012年，谷歌首次使用“知識(shí)圖譜”這個(gè)術(shù)語(yǔ)，用于介紹他們的模型。

目前，大多數(shù)公司都在建立數(shù)據(jù)湖，這是一個(gè)中央數(shù)據(jù)庫(kù)，它可以收集來(lái)自不同來(lái)源的各種類型的原始數(shù)據(jù)（包括結(jié)構(gòu)化和非結(jié)構(gòu)化數(shù)據(jù)）。因此，人們需要工具來(lái)理解所有這些不同信息的意義。知識(shí)圖譜越來(lái)越受歡迎，因?yàn)樗梢院?jiǎn)化大型數(shù)據(jù)集的探索和發(fā)現(xiàn)。簡(jiǎn)單來(lái)說(shuō)，知識(shí)圖譜將數(shù)據(jù)和相關(guān)元數(shù)據(jù)連接起來(lái)，因此可以用來(lái)構(gòu)建組織信息資產(chǎn)的全面表示。例如，知識(shí)圖譜可以替代您需要查閱的所有文件，以查找特定的信息。

知識(shí)圖譜被視為自然語(yǔ)言處理領(lǐng)域的一部分，因?yàn)橐獦?gòu)建“知識(shí)”，需要進(jìn)行“語(yǔ)義增強(qiáng)”過(guò)程。由于沒有人想要手動(dòng)執(zhí)行此任務(wù)，因此我們需要使用機(jī)器和自然語(yǔ)言處理算法來(lái)完成此任務(wù)。

我將解析維基百科并提取一個(gè)頁(yè)面，用作本教程的數(shù)據(jù)集（下面的鏈接）。

俄烏戰(zhàn)爭(zhēng) - 維基百科 俄烏戰(zhàn)爭(zhēng)是俄羅斯與俄羅斯支持的分離主義者之間持續(xù)的國(guó)際沖突，以及...... en.wikipedia.org

特別是將通過(guò)：

設(shè)置：使用維基百科API進(jìn)行網(wǎng)頁(yè)爬取以讀取包和數(shù)據(jù)。
NLP使用SpaCy:對(duì)文本進(jìn)行分句、詞性標(biāo)注、依存句法分析和命名實(shí)體識(shí)別。
提取實(shí)體及其關(guān)系：使用Textacy庫(kù)來(lái)識(shí)別實(shí)體并建立它們之間的關(guān)系。
網(wǎng)絡(luò)圖構(gòu)建：使用NetworkX庫(kù)來(lái)創(chuàng)建和操作圖形結(jié)構(gòu)。
時(shí)間軸圖：使用DateParser庫(kù)來(lái)解析日期信息并生成時(shí)間軸圖。

設(shè)置

首先導(dǎo)入以下庫(kù)：

## for data
import pandas as pd  #1.1.5
import numpy as np  #1.21.0

## for plotting
import matplotlib.pyplot as plt  #3.3.2

## for text
import wikipediaapi  #0.5.8
import nltk  #3.8.1
import re   

## for nlp
import spacy  #3.5.0
from spacy import displacy
import textacy  #0.12.0

## for graph
import networkx as nx  #3.0 (also pygraphviz==1.10)

## for timeline
import dateparser #1.1.7

Wikipedia-api是一個(gè)Python庫(kù)，可輕松解析Wikipedia頁(yè)面。我們將使用這個(gè)庫(kù)來(lái)提取所需的頁(yè)面，但會(huì)排除頁(yè)面底部的所有“注釋”和“參考文獻(xiàn)”內(nèi)容。

簡(jiǎn)單地寫出頁(yè)面的名稱：

topic = "Russo-Ukrainian War"

wiki = wikipediaapi.Wikipedia('en')
page = wiki.page(topic)
txt = page.text[:page.text.find("See also")]
txt[0:500] + " ..."

通過(guò)從文本中識(shí)別和提取subjects-actions-objects來(lái)繪制歷史事件的關(guān)系圖譜（因此動(dòng)詞是關(guān)系）。

自然語(yǔ)言處理

要構(gòu)建知識(shí)圖譜，首先需要識(shí)別實(shí)體及其關(guān)系。因此，需要使用自然語(yǔ)言處理技術(shù)處理文本數(shù)據(jù)集。

目前，最常用于此類任務(wù)的庫(kù)是SpaCy，它是一種開源軟件，用于高級(jí)自然語(yǔ)言處理，利用Cython（C+Python）進(jìn)行加速。SpaCy使用預(yù)訓(xùn)練的語(yǔ)言模型對(duì)文本進(jìn)行標(biāo)記化，并將其轉(zhuǎn)換為“文檔”對(duì)象，該對(duì)象包含模型預(yù)測(cè)的所有注釋。

#python -m spacy download en_core_web_sm

nlp = spacy.load("en_core_web_sm")
doc = nlp(txt)

NLP模型的第一個(gè)輸出是句子分割(中文有自己的分詞規(guī)則)：即確定句子的起始和結(jié)束位置的問(wèn)題。通常，它是通過(guò)基于標(biāo)點(diǎn)符號(hào)對(duì)段落進(jìn)行分割來(lái)完成的?，F(xiàn)在我們來(lái)看看SpaCy將文本分成了多少個(gè)句子：

# from text to a list of sentences
lst_docs = [sent for sent in doc.sents]
print("tot sentences:", len(lst_docs))

現(xiàn)在，對(duì)于每個(gè)句子，我們將提取實(shí)體及其關(guān)系。為了做到這一點(diǎn)，首先需要了解詞性標(biāo)注（POS）：即用適當(dāng)?shù)恼Z(yǔ)法標(biāo)簽標(biāo)記句子中的每個(gè)單詞的過(guò)程。以下是可能標(biāo)記的完整列表（截至今日）：

ADJ: 形容詞，例如big，old，green，incomprehensible，first
ADP: 介詞，例如in，to，during
ADV: 副詞，例如very，tomorrow，down，where，there
AUX: 助動(dòng)詞，例如is，has（done），will（do），should（do）
CONJ: 連詞，例如and，or，but
CCONJ: 并列連詞，例如and，or，but
DET: 限定詞，例如a，an，the
INTJ: 感嘆詞，例如psst，ouch，bravo，hello
NOUN: 名詞，例如girl，cat，tree，air，beauty
NUM: 數(shù)詞，例如1，2017，one，seventy-seven，IV，MMXIV
PART: 助詞，例如's，not
PRON: 代詞，例如I，you，he，she，myself，themselves，somebody
PROPN: 專有名詞，例如Mary，John，London，NATO，HBO
PUNCT: 標(biāo)點(diǎn)符號(hào)，例如.，（，），？
SCONJ: 從屬連詞，例如if，while，that
SYM: 符號(hào)，例如$，%，§，?，+，-，×，÷，=，:)，表情符號(hào)
VERB: 動(dòng)詞，例如run，runs，running，eat，ate，eating
X: 其他，例如sfpksdpsxmsa
SPACE: 空格，例如

僅有詞性標(biāo)注是不夠的，模型還會(huì)嘗試?yán)斫鈫卧~對(duì)之間的關(guān)系。這個(gè)任務(wù)稱為依存句法分析（Dependency Parsing，DEP）。以下是可能的標(biāo)簽完整列表（截至今日）。

ACL：作為名詞從句的修飾語(yǔ)
ACOMP：形容詞補(bǔ)語(yǔ)
ADVCL：狀語(yǔ)從句修飾語(yǔ)
ADVMOD：狀語(yǔ)修飾語(yǔ)
AGENT：主語(yǔ)中的動(dòng)作執(zhí)行者
AMOD：形容詞修飾語(yǔ)
APPOS：同位語(yǔ)
ATTR：主謂結(jié)構(gòu)中的謂語(yǔ)部分
AUX：助動(dòng)詞
AUXPASS：被動(dòng)語(yǔ)態(tài)中的助動(dòng)詞
CASE：格標(biāo)記
CC：并列連詞
CCOMP：從句補(bǔ)足語(yǔ)
COMPOUND：復(fù)合修飾語(yǔ)
CONJ：連接詞
CSUBJ：主語(yǔ)從句
CSUBJPASS：被動(dòng)語(yǔ)態(tài)中的主語(yǔ)從句
DATIVE：與雙賓語(yǔ)動(dòng)詞相關(guān)的間接賓語(yǔ)
DEP：未分類的依賴
DET：限定詞
DOBJ：直接賓語(yǔ)
EXPL：人稱代詞
INTJ：感嘆詞
MARK：標(biāo)記
META：元素修飾語(yǔ)
NEG：否定修飾語(yǔ)
NOUNMOD：名詞修飾語(yǔ)
NPMOD：名詞短語(yǔ)修飾語(yǔ)
NSUBJ：名詞從句主語(yǔ)
NSUBJPASS：被動(dòng)語(yǔ)態(tài)中的名詞從句主語(yǔ)
NUMMOD：數(shù)字修飾語(yǔ)
OPRD：賓語(yǔ)補(bǔ)足語(yǔ)
PARATAXIS：并列結(jié)構(gòu)
PCOMP：介詞的補(bǔ)足語(yǔ)
POBJ：介詞賓語(yǔ)
POSS：所有格修飾語(yǔ)
PRECONJ：前置連詞
PREDET：前置限定詞
PREP：介詞修飾語(yǔ)
PRT：小品詞
PUNCT：標(biāo)點(diǎn)符號(hào)
QUANTMOD：量詞修飾語(yǔ)
RELCL：關(guān)系從句修飾語(yǔ)
ROOT：句子主干
XCOMP：開放性從句補(bǔ)足語(yǔ)

舉個(gè)例子來(lái)理解POS標(biāo)記和DEP解析：

# take a sentence
i = 3
lst_docs[i]

檢查 NLP 模型預(yù)測(cè)的 POS 和 DEP 標(biāo)簽：

for token in lst_docs[i]:
    print(token.text, "-->", "pos: "+token.pos_, "|", "dep: "+token.dep_, "")

SpaCy提供了一個(gè)圖形工具來(lái)可視化這些注釋：

from spacy import displacy

displacy.render(lst_docs[i], style="dep", options={"distance":100})

最重要的標(biāo)記是動(dòng)詞 ( POS=VERB )，因?yàn)樗蔷渥又泻x的詞根 ( DEP=ROOT )。

助詞，如副詞和副詞 ( POS=ADV/ADP )，通常作為修飾語(yǔ) ( *DEP=mod ) 與動(dòng)詞相關(guān)聯(lián)，因?yàn)樗鼈兛梢孕揎梽?dòng)詞的含義。例如，“ travel to ”和“ travel from ”具有不同的含義，即使詞根相同（“ travel ”）。

在與動(dòng)詞相連的單詞中，必須有一些名詞（POS=PROPN/NOUN）作為句子的主語(yǔ)和賓語(yǔ)（ *DEP=nsubj/obj ）。

名詞通常位于形容詞 ( POS=ADJ ) 附近，作為其含義的修飾語(yǔ) ( DEP=amod )。例如，在“好人”和“壞人”中，形容詞賦予名詞“人”相反的含義。

SpaCy執(zhí)行的另一個(gè)很酷的任務(wù)是命名實(shí)體識(shí)別（NER）。命名實(shí)體是“真實(shí)世界中的對(duì)象”（例如人、國(guó)家、產(chǎn)品、日期），模型可以在文檔中識(shí)別各種類型的命名實(shí)體。以下是可能的所有標(biāo)簽的完整列表（截至今日）：

人名: 包括虛構(gòu)人物。
國(guó)家、宗教或政治團(tuán)體：民族、宗教或政治團(tuán)體。
地點(diǎn)：建筑、機(jī)場(chǎng)、高速公路、橋梁等。
公司、機(jī)構(gòu)等：公司、機(jī)構(gòu)等。
地理位置：國(guó)家、城市、州。
地點(diǎn)：非國(guó)家地理位置，山脈、水域等。
產(chǎn)品：物體、車輛、食品等（不包括服務(wù)）。
事件：命名颶風(fēng)、戰(zhàn)斗、戰(zhàn)爭(zhēng)、體育賽事等。
藝術(shù)作品：書籍、歌曲等的標(biāo)題。
法律：成為法律的指定文件。
語(yǔ)言：任何命名的語(yǔ)言。
日期：絕對(duì)或相對(duì)日期或期間。
時(shí)間：小于一天的時(shí)間。
百分比：百分比，包括“%”。
貨幣：貨幣價(jià)值，包括單位。
數(shù)量：衡量重量或距離等。
序數(shù)： “第一”，“第二”等。
基數(shù)：不屬于其他類型的數(shù)字。

for tag in lst_docs[i].ents:
    print(tag.text, f"({tag.label_})")

或者使用SpaCy圖形工具更好：

displacy.render(lst_docs[i], style="ent")

這對(duì)于我們想要向知識(shí)圖譜添加多個(gè)屬性的情況非常有用。

接下來(lái)，使用NLP模型預(yù)測(cè)的標(biāo)簽，我們可以提取實(shí)體及其關(guān)系。

實(shí)體和關(guān)系抽取

這個(gè)想法很簡(jiǎn)單，但實(shí)現(xiàn)起來(lái)可能會(huì)有些棘手。對(duì)于每個(gè)句子，我們將提取主語(yǔ)和賓語(yǔ)以及它們的修飾語(yǔ)、復(fù)合詞和它們之間的標(biāo)點(diǎn)符號(hào)。

可以通過(guò)兩種方式完成：

手動(dòng)方式：可以從基準(zhǔn)代碼開始，該代碼可能必須稍作修改并針對(duì)您特定的數(shù)據(jù)集/用例進(jìn)行調(diào)整。

def extract_entities(doc):
    a, b, prev_dep, prev_txt, prefix, modifier = "", "", "", "", "", ""
    for token in doc:
        if token.dep_ != "punct":
            ## prexif --> prev_compound + compound
            if token.dep_ == "compound":
                prefix = prev_txt +" "+ token.text if prev_dep == "compound" else token.text
            
            ## modifier --> prev_compound + %mod
            if token.dep_.endswith("mod") == True:
                modifier = prev_txt +" "+ token.text if prev_dep == "compound" else token.text
            
            ## subject --> modifier + prefix + %subj
            if token.dep_.find("subj") == True:
                a = modifier +" "+ prefix + " "+ token.text
                prefix, modifier, prev_dep, prev_txt = "", "", "", ""
            
            ## if object --> modifier + prefix + %obj
            if token.dep_.find("obj") == True:
                b = modifier +" "+ prefix +" "+ token.text
            
            prev_dep, prev_txt = token.dep_, token.text
    
    # clean
    a = " ".join([i for i in a.split()])
    b = " ".join([i for i in b.split()])
    return (a.strip(), b.strip())


# The relation extraction requires the rule-based matching tool, 
# an improved version of regular expressions on raw text.
def extract_relation(doc, nlp):
    matcher = spacy.matcher.Matcher(nlp.vocab)
    p1 = [{'DEP':'ROOT'}, 
          {'DEP':'prep', 'OP':"?"},
          {'DEP':'agent', 'OP':"?"},
          {'POS':'ADJ', 'OP':"?"}] 
    matcher.add(key="matching_1", patterns=[p1]) 
    matches = matcher(doc)
    k = len(matches) - 1
    span = doc[matches[k][1]:matches[k][2]] 
    return span.text

讓我們?cè)谶@個(gè)數(shù)據(jù)集上試試看，看看通常的例子：

## extract entities
lst_entities = [extract_entities(i) for i in lst_docs]

## example
lst_entities[i]

## extract relations
lst_relations = [extract_relation(i,nlp) for i in lst_docs]

## example
lst_relations[i]

## extract attributes (NER)
lst_attr = []
for x in lst_docs:
    attr = ""
    for tag in x.ents:
        attr = attr+tag.text if tag.label_=="DATE" else attr+""
    lst_attr.append(attr)

## example
lst_attr[i]

第二種方法是使用Textacy，這是一個(gè)基于SpaCy構(gòu)建的庫(kù)，用于擴(kuò)展其核心功能。這種方法更加用戶友好，通常也更準(zhǔn)確。

## extract entities and relations
dic = {"id":[], "text":[], "entity":[], "relation":[], "object":[]}

for n,sentence in enumerate(lst_docs):
    lst_generators = list(textacy.extract.subject_verb_object_triples(sentence))  
    for sent in lst_generators:
        subj = "_".join(map(str, sent.subject))
        obj  = "_".join(map(str, sent.object))
        relation = "_".join(map(str, sent.verb))
        dic["id"].append(n)
        dic["text"].append(sentence.text)
        dic["entity"].append(subj)
        dic["object"].append(obj)
        dic["relation"].append(relation)


## create dataframe
dtf = pd.DataFrame(dic)

## example
dtf[dtf["id"]==i]

讓我們也使用 NER 標(biāo)簽（即日期）提取屬性：

## extract attributes
attribute = "DATE"
dic = {"id":[], "text":[], attribute:[]}

for n,sentence in enumerate(lst_docs):
    lst = list(textacy.extract.entities(sentence, include_types={attribute}))
    if len(lst) > 0:
        for attr in lst:
            dic["id"].append(n)
            dic["text"].append(sentence.text)
            dic[attribute].append(str(attr))
    else:
        dic["id"].append(n)
        dic["text"].append(sentence.text)
        dic[attribute].append(np.nan)

dtf_att = pd.DataFrame(dic)
dtf_att = dtf_att[~dtf_att[attribute].isna()]

## example
dtf_att[dtf_att["id"]==i]

已經(jīng)提取了“知識(shí)”，接下來(lái)可以構(gòu)建圖表了。

網(wǎng)絡(luò)圖

Python標(biāo)準(zhǔn)庫(kù)中用于創(chuàng)建和操作圖網(wǎng)絡(luò)的是NetworkX。我們可以從整個(gè)數(shù)據(jù)集開始創(chuàng)建圖形，但如果節(jié)點(diǎn)太多，可視化將變得混亂：

## create full graph
G = nx.from_pandas_edgelist(dtf, source="entity", target="object", 
                            edge_attr="relation", 
                            create_using=nx.DiGraph())


## plot
plt.figure(figsize=(15,10))

pos = nx.spring_layout(G, k=1)
node_color = "skyblue"
edge_color = "black"

nx.draw(G, pos=pos, with_labels=True, node_color=node_color, 
        edge_color=edge_color, cmap=plt.cm.Dark2, 
        node_size=2000, connectionstyle='arc3,rad=0.1')

nx.draw_networkx_edge_labels(G, pos=pos, label_pos=0.5, 
                         edge_labels=nx.get_edge_attributes(G,'relation'),
                         font_size=12, font_color='black', alpha=0.6)
plt.show()

知識(shí)圖譜可以讓我們從大局的角度看到所有事物的相關(guān)性，但是如果直接看整張圖就沒有什么用處。因此，最好根據(jù)我們所需的信息應(yīng)用一些過(guò)濾器。對(duì)于這個(gè)例子，我將只選擇涉及最常見實(shí)體的部分（基本上是最連接的節(jié)點(diǎn)）：

dtf["entity"].value_counts().head()

## filter
f = "Russia"
tmp = dtf[(dtf["entity"]==f) | (dtf["object"]==f)]


## create small graph
G = nx.from_pandas_edgelist(tmp, source="entity", target="object", 
                            edge_attr="relation", 
                            create_using=nx.DiGraph())


## plot
plt.figure(figsize=(15,10))

pos = nx.nx_agraph.graphviz_layout(G, prog="neato")
node_color = ["red" if node==f else "skyblue" for node in G.nodes]
edge_color = ["red" if edge[0]==f else "black" for edge in G.edges]

nx.draw(G, pos=pos, with_labels=True, node_color=node_color, 
        edge_color=edge_color, cmap=plt.cm.Dark2, 
        node_size=2000, node_shape="o", connectionstyle='arc3,rad=0.1')

nx.draw_networkx_edge_labels(G, pos=pos, label_pos=0.5, 
                        edge_labels=nx.get_edge_attributes(G,'relation'),
                        font_size=12, font_color='black', alpha=0.6)
plt.show()

上面的效果已經(jīng)不錯(cuò)了。如果想讓它成為 3D 的話，可以使用以下代碼：

from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(15,10))
ax = fig.add_subplot(111, projection="3d")
pos = nx.spring_layout(G, k=2.5, dim=3)

nodes = np.array([pos[v] for v in sorted(G) if v!=f])
center_node = np.array([pos[v] for v in sorted(G) if v==f])

edges = np.array([(pos[u],pos[v]) for u,v in G.edges() if v!=f])
center_edges = np.array([(pos[u],pos[v]) for u,v in G.edges() if v==f])

ax.scatter(*nodes.T, s=200, ec="w", c="skyblue", alpha=0.5)
ax.scatter(*center_node.T, s=200, c="red", alpha=0.5)

for link in edges:
    ax.plot(*link.T, color="grey", lw=0.5)
for link in center_edges:
    ax.plot(*link.T, color="red", lw=0.5)
    
for v in sorted(G):
    ax.text(*pos[v].T, s=v)
for u,v in G.edges():
    attr = nx.get_edge_attributes(G, "relation")[(u,v)]
    ax.text(*((pos[u]+pos[v])/2).T, s=attr)

ax.set(xlabel=None, ylabel=None, zlabel=None, 
       xticklabels=[], yticklabels=[], zticklabels=[])
ax.grid(False)
for dim in (ax.xaxis, ax.yaxis, ax.zaxis):
    dim.set_ticks([])
plt.show()

需要注意一點(diǎn)，圖形網(wǎng)絡(luò)可能很有用且漂亮，但它不是本教程的重點(diǎn)。知識(shí)圖譜最重要的部分是“知識(shí)”（文本處理），然后可以在數(shù)據(jù)幀、圖形或其他圖表上顯示結(jié)果。例如，我可以使用NER識(shí)別的日期來(lái)構(gòu)建時(shí)間軸圖。

時(shí)間軸圖

首先，需要將被識(shí)別為“日期”的字符串轉(zhuǎn)換為日期時(shí)間格式。DateParser庫(kù)可以解析幾乎在網(wǎng)頁(yè)上常見的任何字符串格式中的日期。

def utils_parsetime(txt):
    x = re.match(r'.*([1-3][0-9]{3})', txt) #<--check if there is a year
    if x is not None:
        try:
            dt = dateparser.parse(txt)
        except:
            dt = np.nan
    else:
        dt = np.nan
    return dt

將它應(yīng)用于屬性的數(shù)據(jù)框：

dtf_att["dt"] = dtf_att["date"].apply(lambda x: utils_parsetime(x))

## example
dtf_att[dtf_att["id"]==i]

將把它與實(shí)體關(guān)系的主要數(shù)據(jù)框結(jié)合起來(lái)：

tmp = dtf.copy()
tmp["y"] = tmp["entity"]+" "+tmp["relation"]+" "+tmp["object"]

dtf_att = dtf_att.merge(tmp[["id","y"]], how="left", on="id")
dtf_att = dtf_att[~dtf_att["y"].isna()].sort_values("dt", 
                 ascending=True).drop_duplicates("y", keep='first')
dtf_att.head()

最后，我可以繪制時(shí)間軸(繪制完整的圖表可能不會(huì)用到)：

dates = dtf_att["dt"].values
names = dtf_att["y"].values
l = [10,-10, 8,-8, 6,-6, 4,-4, 2,-2]
levels = np.tile(l, int(np.ceil(len(dates)/len(l))))[:len(dates)]

fig, ax = plt.subplots(figsize=(20,10))
ax.set(title=topic, yticks=[], yticklabels=[])

ax.vlines(dates, ymin=0, ymax=levels, color="tab:red")
ax.plot(dates, np.zeros_like(dates), "-o", color="k", markerfacecolor="w")

for d,l,r in zip(dates,levels,names):
    ax.annotate(r, xy=(d,l), xytext=(-3, np.sign(l)*3), 
                textcoords="offset points",
                horizontalalignment="center",
                verticalalignment="bottom" if l>0 else "top")

plt.xticks(rotation=90) 
plt.show()

過(guò)濾特定時(shí)間：

yyyy = "2022"
dates = dtf_att[dtf_att["dt"]>yyyy]["dt"].values
names = dtf_att[dtf_att["dt"]>yyyy]["y"].values
l = [10,-10, 8,-8, 6,-6, 4,-4, 2,-2]
levels = np.tile(l, int(np.ceil(len(dates)/len(l))))[:len(dates)]

fig, ax = plt.subplots(figsize=(20,10))
ax.set(title=topic, yticks=[], yticklabels=[])

ax.vlines(dates, ymin=0, ymax=levels, color="tab:red")
ax.plot(dates, np.zeros_like(dates), "-o", color="k", markerfacecolor="w")

for d,l,r in zip(dates,levels,names):
    ax.annotate(r, xy=(d,l), xytext=(-3, np.sign(l)*3), 
                textcoords="offset points",
                horizontalalignment="center",
                verticalalignment="bottom" if l>0 else "top")

plt.xticks(rotation=90) 
plt.show()

提取“知識(shí)”后，可以根據(jù)自己喜歡的風(fēng)格重新繪制它。

結(jié)論

本文是關(guān)于**如何使用 Python 構(gòu)建知識(shí)圖譜的教程。**從維基百科解析的數(shù)據(jù)使用了幾種 NLP 技術(shù)來(lái)提取“知識(shí)”（即實(shí)體和關(guān)系）并將其存儲(chǔ)在網(wǎng)絡(luò)圖對(duì)象中。

現(xiàn)利用 NLP 和知識(shí)圖來(lái)映射來(lái)自多個(gè)來(lái)源的相關(guān)數(shù)據(jù)并找到對(duì)業(yè)務(wù)有用的見解。試想一下，將這種模型應(yīng)用于與單個(gè)實(shí)體（即 Apple Inc）相關(guān)的所有文檔（即財(cái)務(wù)報(bào)告、新聞、推文）可以提取多少價(jià)值。您可以快速了解與該實(shí)體直接相關(guān)的所有事實(shí)、人員和公司。然后，通過(guò)擴(kuò)展網(wǎng)絡(luò)，即使信息不直接連接到起始實(shí)體 (A — > B — > C)。

責(zé)任編輯：姜華來(lái)源：今日頭條

NLP Python 知識(shí)圖譜

點(diǎn)贊

51CTO技術(shù)棧公眾號(hào)

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營(yíng)

<legend id="ad697"><track id="ad697"></track></legend>

<style id="ad697"></style>

<style id="ad697"></style>

<style id="ad697"></style>

<sub id="ad697"></sub>