自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

<blockquote id="zw6dj"></blockquote>

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會(huì)

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營

鴻蒙開發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫

賬號(hào)設(shè)置退出

用Keras+LSTM+CRF的實(shí)踐命名實(shí)體識(shí)別NER

作者：佚名 2020-08-28 12:00:47

人工智能深度學(xué)習(xí)

命名實(shí)體識(shí)別屬于序列標(biāo)注任務(wù)，其實(shí)更像是分類任務(wù)，NER是在一段文本中，將預(yù)先定義好的實(shí)體類型識(shí)別出來。

文本分詞、詞性標(biāo)注和命名實(shí)體識(shí)別都是自然語言處理領(lǐng)域里面很基礎(chǔ)的任務(wù)，他們的精度決定了下游任務(wù)的精度，其實(shí)在這之前我并沒有真正意義上接觸過命名實(shí)體識(shí)別這項(xiàng)工作，雖然說讀研期間斷斷續(xù)續(xù)也參與了這樣的項(xiàng)目，但是畢業(yè)之后始終覺得一知半解的感覺，最近想重新?lián)炱饋恚詫?shí)踐為學(xué)習(xí)的主要手段來比較系統(tǒng)地對(duì)命名實(shí)體識(shí)別這類任務(wù)進(jìn)行理解、學(xué)習(xí)和實(shí)踐應(yīng)用。

當(dāng)今的各個(gè)應(yīng)用里面幾乎不會(huì)說哪個(gè)任務(wù)會(huì)沒有深度學(xué)習(xí)的影子，很多子任務(wù)的發(fā)展歷程都是驚人的相似，最初大部分的研究和應(yīng)用都是集中在機(jī)器學(xué)習(xí)領(lǐng)域里面，之后隨著深度學(xué)習(xí)模型的發(fā)展，也被廣泛應(yīng)用起來了，命名實(shí)體識(shí)別這樣的序列標(biāo)注任務(wù)自然也是不例外的，早就有了基于LSTM+CRF的深度學(xué)習(xí)實(shí)體識(shí)別的相關(guān)研究了，只不過與我之前的方向不一致，所以一直沒有化太多的時(shí)間去關(guān)注過它，最近正好在學(xué)習(xí)NER，在之前的相關(guān)文章中已經(jīng)基于機(jī)器學(xué)習(xí)的方法實(shí)踐了簡(jiǎn)單的命名實(shí)體識(shí)別了，這里以深度學(xué)習(xí)模型為基礎(chǔ)來實(shí)現(xiàn)NER。

命名實(shí)體識(shí)別屬于序列標(biāo)注任務(wù)，其實(shí)更像是分類任務(wù)，NER是在一段文本中，將預(yù)先定義好的實(shí)體類型識(shí)別出來。

NER是一種序列標(biāo)注問題，因此他們的數(shù)據(jù)標(biāo)注方式也遵照序列標(biāo)注問題的方式，主要是BIO和BIOES兩種。這里直接介紹BIOES，明白了BIOES，BIO也就掌握了。

先列出來BIOES分別代表什么意思：

B，即Begin，表示開始  
I，即Intermediate，表示中間  
E，即End，表示結(jié)尾  
S，即Single，表示單個(gè)字符  
O，即Other，表示其他，用于標(biāo)記無關(guān)字符

比如對(duì)于下面的一句話：

姚明去哈爾濱工業(yè)大學(xué)體育館打球了

標(biāo)注結(jié)果為：

姚明 去 哈爾濱工業(yè)大學(xué) 體育館 打球 了  
B-PER E-PER O B-ORG I-ORG I-ORG I-ORG I-ORG I-ORG E-ORG B-LOC I-LOC E-LOC O O O

簡(jiǎn)單的溫習(xí)就到這里了，接下來進(jìn)入到本文的實(shí)踐部分，首先是數(shù)據(jù)集部分，數(shù)據(jù)集來源于網(wǎng)絡(luò)獲取，簡(jiǎn)單看下樣例數(shù)據(jù)，如下所示：

train_data部分樣例數(shù)據(jù)如下所示：

當(dāng) O  
希 O  
望 O  
工 O  
程 O  
救 O  
助 O  
的 O  
百 O  
萬 O  
兒 O 
童 O  
成 O  
長(zhǎng) O  
起 O  
來 O  
， O  
科 O  
教 O  
興 O  
國 O  
蔚 O  
然 O  
成 O  
風(fēng) O  
時(shí) O  
， O  
今 O  
天 O  
有 O  
收 O  
藏 O  
價(jià) O  
值 O  
的 O  
書 O  
你 O  
沒 O  
買 O  
， O  
明 O  
日 O  
就 O  
叫 O  
你 O  
悔 O  
不 O  
當(dāng) O  
初 O  
！O

test_data部分樣例數(shù)據(jù)如下所示：

高 O  
舉 O  
愛 O  
國 O  
主 O  
義 O 
和 O  
社 O  
會(huì) O  
主 O  
義 O  
兩 O  
面 O  
旗 O  
幟 O  
， O  
團(tuán) O  
結(jié) O  
全 O  
體 O  
成 O  
員 O  
以 O  
及 O  
所 O  
聯(lián) O  
系 O  
的 O  
歸 O  
僑 O 
、 O  
僑 O  
眷 O  
， O  
發(fā) O  
揚(yáng) O  
愛 O  
國 O  
革 O  
命 O  
的 O  
光 O  
榮 O  
傳 O  
統(tǒng) O  
， O  
為 O  
統(tǒng) O  
一 O  
祖 O  
國 O  
、 O  
振 O  
興 O  
中 B-LOC  
華 I-LOC  
而 O  
努 O  
力 O  
奮 O  
斗 O  
；O

簡(jiǎn)單了解訓(xùn)練集數(shù)據(jù)和測(cè)試集數(shù)據(jù)結(jié)構(gòu)后就可以進(jìn)行后面的數(shù)據(jù)處理，主要的目的就是生成特征數(shù)據(jù)，核心代碼實(shí)現(xiàn)如下所示：

with open('test_data.txt',encoding='utf-8') as f:  
    test_data_list=[one.strip().split('\t') for one in f.readlines() if one.strip()]  
with open('train_data.txt',encoding='utf-8') as f:  
    train_data_list=[one.strip().split('\t') for one in f.readlines() if one.strip()]  
char_list=[one[0] for one in test_data_list]+[one[0] for one in train_data_list]  
label_list=[one[-1] for one in test_data_list]+[one[-1] for one in train_data_list]  
print('char_list_length: ', len(char_list))  
print('label_list_length: ', len(label_list))  
print('char_num: ', len(list(set(char_list))))  
print('label_num: ', len(list(set(label_list)))) 
char_count,label_count={},{}  
#字符頻度統(tǒng)計(jì)  
for one in char_list:  
    if one in char_count:  
        char_count[one]+=1  
    else:  
        char_count[one]=1  
for one in label_list:  
    if one in label_count:  
        label_count[one]+=1  
    else:  
        label_count[one]=1    
#按頻度降序排序  
sortedsorted_char=sorted(char_count.items(),key=lambda e:e[1],reverse=True)  
sortedsorted_label=sorted(label_count.items(),key=lambda e:e[1],reverse=True)   
#字符-id映射關(guān)系構(gòu)建  
char_map_dict={}  
label_map_dict={}  
for i in range(len(sorted_char)):  
    char_map_dict[sorted_char[i][0]]=i  
    char_map_dict[str(i)]=sorted_char[i][0]  
for i in range(len(sorted_label)):  
    label_map_dict[sorted_label[i][0]]=i  
    label_map_dict[str(i)]=sorted_label[i][0]  
#結(jié)果存儲(chǔ)  
with open('charMap.json','w') as f:  
    f.write(json.dumps(char_map_dict))  
with open('labelMap.json','w') as f:  
    f.write(json.dumps(label_map_dict))

代碼實(shí)現(xiàn)的很清晰，關(guān)鍵的部分也都有對(duì)應(yīng)的注釋內(nèi)容，這里就不多解釋了，核心的思想就是將字符或者是標(biāo)簽類別數(shù)據(jù)映射為對(duì)應(yīng)的index數(shù)據(jù)，這里我沒有對(duì)頻度設(shè)置過濾閾值，有的實(shí)現(xiàn)里面會(huì)過濾掉只出現(xiàn)了1次的數(shù)據(jù)，這個(gè)可以根據(jù)自己的需要進(jìn)行對(duì)應(yīng)的修改。

charMap數(shù)據(jù)樣例如下所示：

labelMap數(shù)據(jù)樣例如下所示：

在生成上述映射數(shù)據(jù)之后，就可以對(duì)原始的文本數(shù)據(jù)進(jìn)行轉(zhuǎn)化計(jì)算，進(jìn)而生成我們所需要的特征數(shù)據(jù)了，核心代碼實(shí)現(xiàn)如下所示：

X_train,y_train,X_test,y_test=[],[],[],[]  
#訓(xùn)練數(shù)據(jù)集  
for i in range(len(trainData)):  
    one_sample=[one.strip().split('\t') for one in trainData[i]]  
    char_list=[O[0] for O in one_sample]  
    label_list=[O[1] for O in one_sample]  
    char_vec=[char_map_dict[char_list[v]] for v in range(len(char_list))]  
    label_vec=[label_map_dict[label_list[l]] for l in range(len(label_list))]  
    X_train.append(char_vec)  
    y_train.append(label_vec)  
#測(cè)試數(shù)據(jù)集  
for i in range(len(testData)):  
    one_sample=[one.strip().split('\t') for one in testData[i]] 
    char_list=[O[0] for O in one_sample]  
    label_list=[O[1] for O in one_sample]  
    char_vec=[char_map_dict[char_list[v]] for v in range(len(char_list))]  
    label_vec=[label_map_dict[label_list[l]] for l in range(len(label_list))]  
    X_test.append(char_vec)  
    y_test.append(label_vec)  
feature={}  
feature['X_train'],feature['y_train']=X_train,y_train  
feature['X_test'],feature['y_test']=X_test,y_test  
#結(jié)果存儲(chǔ)  
with open('feature.json','w') as f:  
    f.write(json.dumps(feature))

到這里我們已經(jīng)得到了我們所需要的特征數(shù)據(jù)，且已經(jīng)劃分好了測(cè)試集數(shù)據(jù)和訓(xùn)練集數(shù)據(jù)。

接下來就可以構(gòu)建模型了，這里為了簡(jiǎn)化實(shí)現(xiàn)，我采用的是Keras框架，相比于原生態(tài)的Tensorflow框架來說，上手門檻更低，核心代碼實(shí)現(xiàn)如下所示：

#加載數(shù)據(jù)集  
with open('feature.json') as f:  
    F=json.load(f)  
X_train,X_test,y_train,y_test=F['X_train'],F['X_test'],F['y_train'],F['y_test']  
#數(shù)據(jù)對(duì)齊操作 
 X_train = pad_sequences(X_train, maxlen=max_len, value=0)  
y_train = pad_sequences(y_train, maxlen=max_len, value=-1)  
y_train = np.expand_dims(y_train, 2)  
X_test = pad_sequences(X_test, maxlen=max_len, value=0)  
y_test = pad_sequences(y_test, maxlen=max_len, value=-1)  
y_test = np.expand_dims(y_test, 2)  
#模型初始化、訓(xùn)練  
if not os.path.exists(saveDir):  
    os.makedirs(saveDir)  
#模型初始化  
model = Sequential() 
model.add(Embedding(voc_size, 256, mask_zero=True))  
model.add(Bidirectional(LSTM(128, return_sequences=True)))  
model.add(Dropout(rate=0.5))  
model.add(Dense(tag_size))  
crf = CRF(tag_size, sparse_target=True)  
model.add(crf)  
model.summary() 
model.compile('adam', loss=crf.loss_function, metrics=[crf.accuracy])  
#訓(xùn)練擬合  
history=model.fit(X_train,y_train,batch_size=100,epochs=500,validation_data=[X_test,y_test])  
model.save(saveDir+'model.h5')  
#模型結(jié)構(gòu)可視化  
try: 
     plot_model(model,to_file=saveDir+"model_structure.png",show_shapes=True)  
except Exception as e:  
    print('Exception: ', e) 
 #結(jié)果可視化  
plt.clf()  
plt.plot(history.history['acc'])  
plt.plot(history.history['val_acc'])  
plt.title('model accuracy')  
plt.ylabel('accuracy')  
plt.xlabel('epochs')  
plt.legend(['train','test'], loc='upper left')  
plt.savefig(saveDir+'train_validation_acc.png')  
plt.clf()  
plt.plot(history.history['loss'])  
plt.plot(history.history['val_loss'])  
plt.title('model loss')  
plt.ylabel('loss')  
plt.xlabel('epochs')  
plt.legend(['train', 'test'], loc='upper left')  
plt.savefig(saveDir+'train_validation_loss.png')  
scores=model.evaluate(X_test,y_test,verbose=0)  
print("Accuracy: %.2f%%" % (scores[1]*100))  
modelmodel_json=model.to_json()  
with open(saveDir+'structure.json','w') as f:  
    f.write(model_json)  
model.save_weights(saveDir+'weight.h5')  
print('===Finish====')

訓(xùn)練完成后，結(jié)構(gòu)目錄文件結(jié)構(gòu)如下所示：

模型結(jié)構(gòu)圖如下所示：

訓(xùn)練過程中準(zhǔn)確度曲線如下所示：

訓(xùn)練過程中損失值曲線如下所示：

由于訓(xùn)練計(jì)算資源占用比較大，且時(shí)間比較長(zhǎng)，我這里只是簡(jiǎn)單地設(shè)置了20次的迭代計(jì)算，這個(gè)可以根據(jù)自己的實(shí)際情況設(shè)置更高的或者是更低的迭代次數(shù)來實(shí)現(xiàn)不同的需求。

簡(jiǎn)單的預(yù)測(cè)實(shí)例如下所示：

到這里，本文的實(shí)踐就結(jié)束了，后面有時(shí)間繼續(xù)深入研究，希望對(duì)您有所幫助，祝您工作順利，學(xué)有所成！

責(zé)任編輯：龐桂玉來源： Python中文社區(qū)

NER 人工智能深度學(xué)習(xí)

點(diǎn)贊

51CTO技術(shù)棧公眾號(hào)

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營

<style id="vtd9b"></style>