面向深度學(xué)習(xí)的文本預(yù)處理方法
譯文【51CTO.com快譯】如今,深度學(xué)習(xí)引起了人們極大的興趣,尤其是自然語(yǔ)言處理(NLP)。不久前,Kaggle公司開展一場(chǎng)自然語(yǔ)言處理(NLP)競(jìng)賽,其名稱為“Quora不真誠(chéng)問(wèn)題挑戰(zhàn)(Quora Question insincerity Challenge)”。這個(gè)競(jìng)賽指出解決文本分類問(wèn)題,其目的是通過(guò)競(jìng)賽以及Kaggle專家提供的寶貴內(nèi)核,使其變得更容易理解。
首先從解釋競(jìng)賽中的文本分類問(wèn)題開始。
文本分類是自然語(yǔ)言處理中的一項(xiàng)常見任務(wù),它將不確定長(zhǎng)度的文本序列轉(zhuǎn)換為文本類別。那么文本分類有什么作用?可以:
- 了解評(píng)論時(shí)的情緒
- 在Facebook等平臺(tái)上查找有害評(píng)論
- 在Quora上查找不真誠(chéng)的問(wèn)題,而目前Kaggle公司正在進(jìn)行的一項(xiàng)競(jìng)賽
- 在網(wǎng)站上查找虛假評(píng)論
- 確定文本廣告是否會(huì)被點(diǎn)擊
現(xiàn)在,這些問(wèn)題都有一些共同點(diǎn)。而從機(jī)器學(xué)習(xí)的角度來(lái)看,這些問(wèn)題本質(zhì)上是相同的,只是目標(biāo)標(biāo)簽發(fā)生了變化,并沒(méi)有其他的變化。話雖如此,業(yè)務(wù)知識(shí)的添加可以幫助使這些模型更加健壯,這就是在預(yù)處理數(shù)據(jù)以進(jìn)行測(cè)試分類時(shí)想要包含的內(nèi)容。
雖然本文關(guān)注的預(yù)處理管道主要圍繞深度學(xué)習(xí),但其中大部分也適用于傳統(tǒng)的機(jī)器學(xué)習(xí)模型。
首先,在完成所有步驟之前,先了解一下文本數(shù)據(jù)深度學(xué)習(xí)管道的流程,以便更進(jìn)一步了解整個(gè)過(guò)程。
通常從清理文本數(shù)據(jù)和執(zhí)行基本 事件驅(qū)動(dòng)架構(gòu)(EDA)開始。在這里,嘗試通過(guò)清理數(shù)據(jù)來(lái)提高數(shù)據(jù)質(zhì)量。還嘗試通過(guò)刪除詞匯表外(OOV)的單詞來(lái)提高Word2Vec嵌入的質(zhì)量。前兩個(gè)步驟之間通常沒(méi)有什么順序,并且通常在這兩個(gè)步驟之間來(lái)回切換。
接下來(lái),為可以輸入深度學(xué)習(xí)模型的文本創(chuàng)建一個(gè)表示。然后開始創(chuàng)建模型并訓(xùn)練它們。最后,在此使用適當(dāng)?shù)闹笜?biāo)評(píng)估模型,并獲得領(lǐng)導(dǎo)者的批準(zhǔn)以部署模型。如果這些術(shù)語(yǔ)現(xiàn)在沒(méi)有多大意義,那么不要擔(dān)心,可以嘗試通過(guò)本文闡述的過(guò)程來(lái)解釋它們。
在這里,先談?wù)剢卧~嵌入。在為深度學(xué)習(xí)模型預(yù)處理數(shù)據(jù)時(shí),就必須考慮一下。
Word2Vec嵌入入門
現(xiàn)在需要有一種方法來(lái)表示詞匯中的單詞。一種方法是使用one-hot編碼的單詞向量,但這并不是一個(gè)很好的選擇。其一個(gè)主要原因是one-hot單詞向量無(wú)法準(zhǔn)確表達(dá)不同單詞之間的相似度,例如余弦相似度。
鑒于one-hot編碼向量的結(jié)構(gòu),不同單詞之間的相似度總是為0。另一個(gè)原因是,隨著詞匯量的增加,這些one-hot編碼向量變得非常大。
Word2Vec通過(guò)提供單詞的固定長(zhǎng)度向量表示以及捕獲不同單詞之間的相似性和類比關(guān)系,克服了上述困難。
Word2vec單詞向量的學(xué)習(xí)方式允許學(xué)習(xí)不同的類比。它使人們能夠?qū)σ郧安豢赡艿膯卧~進(jìn)行代數(shù)運(yùn)算。例如:什么是國(guó)王——男人+女人?出來(lái)是女王。
Word2Vec向量也幫助找出單詞之間的相似性。如果試圖找到與“good”相似的詞,會(huì)發(fā)現(xiàn)awesome、great等。正是word2vec的這一特性使其對(duì)于文本分類非常寶貴?,F(xiàn)在的深度學(xué)習(xí)網(wǎng)絡(luò)可以明白“good”和“great”本質(zhì)上是含義相似的詞。
因此,簡(jiǎn)單來(lái)說(shuō),word2vec為單詞創(chuàng)建向量。因此,對(duì)字典中的每個(gè)單詞都有一個(gè)d維向量。通常使用其他人在維基百科、推特等大型文本語(yǔ)料庫(kù)上訓(xùn)練后提供的預(yù)訓(xùn)練詞向量。最常用的預(yù)訓(xùn)練詞向量是具有300維詞向量的Glove和Fasttext。而在這篇文章中將使用Glove。
文本數(shù)據(jù)的基本預(yù)處理技術(shù)
在大多數(shù)情況下,觀察到的文本數(shù)據(jù)并不完全干凈。來(lái)自不同來(lái)源的數(shù)據(jù)具有不同的特征,這使得文本預(yù)處理成為分類管道中最重要的步驟之一。
例如,來(lái)自Twitter的文本數(shù)據(jù)與Quora或某些新聞/博客平臺(tái)上的文本數(shù)據(jù)完全不同,因此需要區(qū)別對(duì)待。有用的是,將在本文中討論的技術(shù)對(duì)于在自然語(yǔ)言處理(NLP)中可能遇到的任何類型的數(shù)據(jù)都足夠通用。
(1)清除特殊字符和刪除標(biāo)點(diǎn)符號(hào)
預(yù)處理管道很大程度上取決于將用于分類任務(wù)的word2vec嵌入。原則上,預(yù)處理應(yīng)該與訓(xùn)練詞嵌入之前使用的預(yù)處理相匹配。由于大多數(shù)嵌入不提供標(biāo)點(diǎn)符號(hào)和其他特殊字符的向量值,因此要做的第一件事就是去除文本數(shù)據(jù)中的特殊字符。這些是Quora不真誠(chéng)問(wèn)題數(shù)據(jù)中存在的一些特殊字符,使用替換功能來(lái)擺脫這些特殊字符。
#將看到的所有文本分類方法都會(huì)用到的一些預(yù)處理。
Python
- 1 puncts = [',', '.', '"', ':', ')', '(', '-', '!', '?', '|', ';', "'", '$', '&', '/', '[', ']', '>', '%', '=', '#', '*', '+', '\\', '•', '~', '@', '£', '·', '_', '{', '}', '©', '^', '®', '`', '<', '→', '°', '€', '™', '›', '♥', '←', '×', '§', '″', '′', ' ', '█', '½', 'à', '…', '“', '★', '”', '–', '●', 'â', '►', '−', '¢', '²', '¬', '░', '¶', '↑', '±', '¿', '▾', '═', '¦', '║', '―', '¥', '▓', '—', '‹', '─', '▒', ':', '¼', '⊕', '▼', '▪', '†', '■', '’', '▀', '¨', '▄', '♫', '☆', 'é', '¯', '♦', '¤', '▲', 'è', '¸', '¾', 'Ã', '⋅', '‘', '∞', '∙', ')', '↓', '、', '│', '(', '»', ',', '♪', '╩', '╚', '³', '・', '╦', '╣', '╔', '╗', '▬', '❤', 'ï', 'Ø', '¹', '≤', '‡', '√',
Python
- 1def clean_text(x): x = str(x) for punct in puncts: if punct in x: x = x.replace(punct, '') return
這也可以在一個(gè)簡(jiǎn)單的正則表達(dá)式的幫助下完成。但是人們通常喜歡上述做事方式,因?yàn)樗兄诶斫鈴臄?shù)據(jù)中刪除的字符類型。
Python
- 1def clean_numbers(x): if bool(re.search(r'\d', x)): x = re.sub('[0-9]{5,}', '#####', x) x = re.sub('[0-9]{4}', '####', x) x = re.sub('[0-9]{3}', '###', x) x = re.sub('[0-9]{2}', '##', x) return x
(2)清除數(shù)字
為什么要用#s替換數(shù)字?因?yàn)榇蠖鄶?shù)嵌入都像這樣預(yù)處理了它們的文本。
Python小技巧:在下面的代碼中使用if語(yǔ)句來(lái)預(yù)先檢查文本中是否存在數(shù)字。就像if總是比re.sub命令快,而且大部分文本都不包含數(shù)字。
Python
- 1 def clean_numbers(x): if bool(re.search(r'\d', x)): x = re.sub('[0-9]{5,}', '#####', x)
(3)刪除拼寫錯(cuò)誤
找出數(shù)據(jù)中的拼寫錯(cuò)誤總是有幫助的。由于word2vec中不存在這些詞的嵌入,應(yīng)該用正確的拼寫替換單詞以獲得更好的嵌入覆蓋率。
以下代碼工件是對(duì)Peter Norvig的拼寫檢查器的改編。它使用單詞的word2vec排序來(lái)近似單詞概率,因?yàn)楣雀鑧ord2vec顯然在訓(xùn)練語(yǔ)料庫(kù)中按照頻率降序排列單詞??梢允褂盟鼇?lái)找出擁有的數(shù)據(jù)中的一些拼寫錯(cuò)誤的單詞。
以下是來(lái)自Quora問(wèn)題相似性挑戰(zhàn)中的CPMP腳本。
Python
- 1 import re from collections import Counter import gensim import heapq from operator import itemgetter from multiprocessing import Pool
- 2model = gensim.models.KeyedVectors.load_word2vec_format('../input/embeddings/GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin', binary=True) words = model.index2word
- 3 w_rank = {} for i,word in enumerate(words): w_rank[word] = i
- 4 WORDS = w_rank
- 5 def words(text): return re.findall(r'\w+', text.lower())
- 6 def P(word): "Probability of `word`." # use inverse of rank as proxy # returns 0 if the word isn't in the dictionary return - WORDS.get(word, 0)
- 7 def correction(word): "Most probable spelling correction for word." return max(candidates(word), key=P)
- 8 def candidates(word): "Generate possible spelling corrections for word." return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])
- 9 def known(words): "The subset of `words` that appear in the dictionary of WORDS." return set(w for w in words if w in WORDS)
- 10 def edits1(word): "All edits that are one edit away from `word`." letters = 'abcdefghijklmnopqrstuvwxyz' splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [L + R[1:] for L, R in splits if R] transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1] replaces = [L + c + R[1:] for L, R in splits if R for c in letters] inserts = [L + c + R for L, R in splits for c in letters] return set(deletes + transposes + replaces + inserts)
- 11 def edits2(word): "All edits that are two edits away from `word`." return (e2 for e1 in edits1(word) for e2 in edits1(e1))
- 12 def build_vocab(texts): sentences = texts.apply(lambda x: x.split()).values vocab = {} for sentence in sentences: for word in sentence: try: vocab[word] += 1 except KeyError: vocab[word] = 1 return vocab
- 13 vocab = build_vocab(train.question_text)
- 14 top_90k_words = dict(heapq.nlargest(90000, vocab.items(), key=itemgetter(1)))
- 15 pool = Pool(4) corrected_words = pool.map(correction,list(top_90k_words.keys()))
- 16 for word,corrected_word in zip(top_90k_words,corrected_words): if word!=corrected_word: print
一旦完成了查找拼寫錯(cuò)誤的數(shù)據(jù),接下來(lái)要做的就是使用拼寫錯(cuò)誤映射和正則表達(dá)式函數(shù)來(lái)替換它們。
Python
- 1 mispell_dict = {'colour': 'color', 'centre': 'center', 'favourite': 'favorite', 'travelling': 'traveling', 'counselling': 'counseling', 'theatre': 'theater', 'cancelled': 'canceled', 'labour': 'labor', 'organisation': 'organization', 'wwii': 'world war 2', 'citicise': 'criticize', 'youtu ': 'youtube ', 'Qoura': 'Quora', 'sallary': 'salary', 'Whta': 'What', 'narcisist': 'narcissist', 'howdo': 'how do', 'whatare': 'what are', 'howcan': 'how can', 'howmuch': 'how much', 'howmany': 'how many', 'whydo': 'why do', 'doI': 'do I', 'theBest': 'the best', 'howdoes': 'how does', 'mastrubation': 'masturbation', 'mastrubate': 'masturbate', "mastrubating": 'masturbating', 'pennis': 'penis', 'Etherium': 'Ethereum', 'narcissit': 'narcissist', 'bigdata': 'big data', '2k17': '2017', '2k18': '2018', 'qouta': 'quota', 'exboyfriend': 'ex boyfriend', 'airhostess': 'air hostess', "whst": 'what', 'watsapp': 'whatsapp', 'demonitisation': 'demonetization', 'demonitization': 'demonetization', 'demonetisation': 'demonetization'
Python
- 1def _get_mispell(mispell_dict): mispell_re = re.compile('(%s)' % '|'.join(mispell_dict.keys())) return mispell_dict, mispell_re
- 2 mispellings, mispellings_re = _get_mispell(mispell_dict) def replace_typical_misspell(text): def replace(match): return mispellings[match.group(0)] return mispellings_re.sub(replace, text)
- 3 # Usage replace_typical_misspell("Whta is demonitisation")
(4)消除縮略語(yǔ)
縮略語(yǔ)是采用撇號(hào)書寫的單詞??s略語(yǔ)的例子是“ain’t”或“aren’t”。因?yàn)橄霕?biāo)準(zhǔn)化文本,所以擴(kuò)展這些縮略語(yǔ)是有意義的。下面使用壓縮映射和正則表達(dá)式函數(shù)完成這項(xiàng)工作。
Python
- 1 contraction_dict = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is", "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will", "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are", "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have"}
Python
- 1 def _get_contractions(contraction_dict): contraction_re = re.compile('(%s)' % '|'.join(contraction_dict.keys())) return contraction_dict, contraction_re
- 2 contractions, contractions_re = _get_contractions(contraction_dict)
- 3 def replace_contractions(text): def replace(match): return contractions[match.group(0)] return contractions_re.sub(replace, text)
- 4 # Usage replace_contractions("this's a text with contraction")
除了上述技術(shù)外,還有其他文本預(yù)處理技術(shù),如詞干提取、詞形還原和停用詞去除。由于這些技術(shù)不與深度學(xué)習(xí)NLP模型一起使用,在這里不會(huì)討論它們。
表示:序列創(chuàng)建
使深度學(xué)習(xí)成為自然語(yǔ)言處理(NLP)的“go-to”選擇的原因之一是,實(shí)際上不必從文本數(shù)據(jù)中人工設(shè)計(jì)特征。深度學(xué)習(xí)算法將一系列文本作為輸入,像人類一樣學(xué)習(xí)文本結(jié)構(gòu)。由于機(jī)器不能理解單詞,因此它們期望以數(shù)字形式提供數(shù)據(jù)。所以想將文本數(shù)據(jù)表示為一系列數(shù)字。
要了解這是如何完成的,需要對(duì)Keras Tokenizer功能有所了解??梢允褂萌魏纹渌衷~器,但Keras分詞器是一種流行的選擇。
(1)標(biāo)記器
簡(jiǎn)單來(lái)說(shuō),標(biāo)記器(tokenizer)是一個(gè)將句子拆分成單詞的實(shí)用函數(shù)。keras.preprocessing.text.Tokenizer將文本標(biāo)記(拆分)為標(biāo)記(單詞),同時(shí)僅保留文本語(yǔ)料庫(kù)中出現(xiàn)次數(shù)最多的單詞。
Python
- 1#Signature: Tokenizer(num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ', char_level=False, oov_token=None, document_count=0, **kwargs)
num_words參數(shù)僅在文本中保留預(yù)先指定的單詞數(shù)。這很有幫助,因?yàn)椴幌M@個(gè)模型通過(guò)考慮很少出現(xiàn)的單詞而產(chǎn)生大量噪音。在現(xiàn)實(shí)世界的數(shù)據(jù)中,使用num_words參數(shù)留下的大多數(shù)單詞通常是拼寫錯(cuò)誤的。在默認(rèn)情況下,標(biāo)記器還會(huì)過(guò)濾一些不需要的標(biāo)記并將文本轉(zhuǎn)換為小寫。
一旦適合數(shù)據(jù)的標(biāo)記器還會(huì)保留一個(gè)單詞索引(可以用來(lái)為單詞分配唯一編號(hào)的單詞字典),可以通過(guò)以下方式訪問(wèn)它:
- tokenizer.word_index
索引字典中的單詞按頻率排序。
所以使用標(biāo)記器的整個(gè)代碼如下:
Python
- from keras.preprocessing.text import Tokenizer ## Tokenize the sentences tokenizer = Tokenizer(num_words=max_features) tokenizer.fit_on_texts(list(train_X)+list(test_X)) train_X = tokenizer.texts_to_sequences(train_X) test_X = tokenizer.texts_to_sequences(test_X)
其中train_X和test_X是語(yǔ)料庫(kù)中的文檔列表。
(2)序列預(yù)處理
通常模型期望每個(gè)序列(每個(gè)訓(xùn)練示例)具有相同的長(zhǎng)度(相同數(shù)量的單詞/標(biāo)記)。可以使用maxlen參數(shù)來(lái)控制它。
例如:
Python
- train_X = pad_sequences(train_X, maxlen=maxlen) test_X = pad_sequences(test_X, maxlen=maxlen)
現(xiàn)在訓(xùn)練數(shù)據(jù)包含一個(gè)數(shù)字列表。每個(gè)列表具有相同的長(zhǎng)度。還有word_index,它是文本語(yǔ)料庫(kù)中出現(xiàn)次數(shù)最多的單詞的字典。
(3)嵌入富集
如上所述,將使用GLoVE Word2Vec嵌入來(lái)解釋富集。GLoVE預(yù)訓(xùn)練向量在維基百科語(yǔ)料庫(kù)上進(jìn)行訓(xùn)練。
這意味著數(shù)據(jù)中可能出現(xiàn)的某些詞可能不會(huì)出現(xiàn)在嵌入中。那么怎么處理呢?先加載Glove Embeddings。
Python
- 1 def load_glove_index(): EMBEDDING_FILE = '../input/embeddings/glove.840B.300d/glove.840B.300d.txt' def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')[:300] embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE)) return embeddings_index
- 2 glove_embedding_index = load_glove_index()
確保將下載這些GLoVE向量的文件夾的路徑。
這個(gè)glove_embedding_index包含什么?它只是一個(gè)字典,其中鍵是詞,值是詞向量,而一個(gè)長(zhǎng)度為300的np.array,其字典的長(zhǎng)度大約是10億。由于只需要word_index中單詞的嵌入,將創(chuàng)建一個(gè)只包含所需嵌入的矩陣。
Python
- 1 def create_glove(word_index,embeddings_index): emb_mean,emb_std = -0.005838499,0.48782197 all_embs = np.stack(embeddings_index.values()) embed_size = all_embs.shape[1] nb_words = min(max_features, len(word_index)) embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size)) count_found = nb_words for word, i in tqdm(word_index.items()): if i >= max_features: continue embedding_vector = embeddings_index.get(word) if embedding_vector is not None: embedding_matrix[i] = embedding_vector else: count_found-=1 print("Got embedding for ",count_found," words.") return embedding_matrix
上面的代碼工作正常,但有沒(méi)有一種方法可以讓利用GLoVE中的預(yù)處理來(lái)發(fā)揮優(yōu)勢(shì)?
是的。在為glove進(jìn)行預(yù)處理時(shí),創(chuàng)作者沒(méi)有將單詞轉(zhuǎn)換為小寫。這意味著它包含“USA”、“usa”和“Usa”等單詞的多種變體。這也意味著在某些情況下,雖然存在像“Word”這樣的單詞,但不存在小寫形式的類似物,即“word”。
在這里可以通過(guò)使用下面的代碼來(lái)解決這種情況。
Python
- 1 def create_glove(word_index,embeddings_index): emb_mean,emb_std = -0.005838499,0.48782197 all_embs = np.stack(embeddings_index.values()) embed_size = all_embs.shape[1] nb_words = min(max_features, len(word_index)) embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size)) count_found = nb_words for word, i in tqdm(word_index.items()): if i >= max_features: continue embedding_vector = embeddings_index.get(word) if embedding_vector is not None: embedding_matrix[i] = embedding_vector else: if word.islower(): # try to get the embedding of word in titlecase if lowercase is not present embedding_vector = embeddings_index.get(word.capitalize()) if embedding_vector is not None: embedding_matrix[i] = embedding_vector else: count_found-=1 else: count_found-=1 print("Got embedding for ",count_found," words.") return embedding_matrix
上面只是一個(gè)例子,說(shuō)明如何利用嵌入知識(shí)來(lái)獲得更好的覆蓋率。有時(shí),根據(jù)問(wèn)題的不同,人們還可以通過(guò)使用一些領(lǐng)域知識(shí)和自然語(yǔ)言處理(NLP)技能向嵌入中添加額外信息來(lái)獲得價(jià)值。
例如,可以通過(guò)在Python中的TextBlob包中添加單詞的極性和主觀性,向嵌入本身添加外部知識(shí)。
Python
- 1 from textblob import TextBlob word_sent = TextBlob("good").sentiment print(word_sent.polarity,word_sent.subjectivity) # 0.7 0.6
可以使用TextBlob獲取任何單詞的極性和主觀性。因此,可以嘗試將這些額外信息添加到嵌入中。
Python
- 1 def create_glove(word_index,embeddings_index): emb_mean,emb_std = -0.005838499,0.48782197 all_embs = np.stack(embeddings_index.values()) embed_size = all_embs.shape[1] nb_words = min(max_features, len(word_index)) embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size+4)) count_found = nb_words for word, i in tqdm(word_index.items()): if i >= max_features: continue embedding_vector = embeddings_index.get(word) word_sent = TextBlob(word).sentiment # Extra information we are passing to our embeddings extra_embed = [word_sent.polarity,word_sent.subjectivity] if embedding_vector is not None: embedding_matrix[i] = np.append(embedding_vector,extra_embed) else: if word.islower(): embedding_vector = embeddings_index.get(word.capitalize()) if embedding_vector is not None: embedding_matrix[i] = np.append(embedding_vector,extra_embed) else: embedding_matrix[i,300:] = extra_embed count_found-=1 else: embedding_matrix[i,300:] = extra_embed count_found-=1 print("Got embedding for ",count_found," words.") return embedding_matrix
工程嵌入是在后期從深度學(xué)習(xí)模型中獲得更好性能的重要組成部分。通常,會(huì)在項(xiàng)目階段多次重新訪問(wèn)這部分代碼,同時(shí)嘗試進(jìn)一步改進(jìn)的模型。在這里可以展示很多創(chuàng)造力,以提高對(duì)word_index的覆蓋率,并在嵌入中包含額外的功能。
更多工程特性
嵌入矩陣的文本預(yù)處理方法
人們總是可以添加句子特定的特征,如句子長(zhǎng)度、唯一詞的數(shù)量等,作為另一個(gè)輸入層,為深度神經(jīng)網(wǎng)絡(luò)提供額外的信息。
例如,創(chuàng)建了這些額外的特征,作為Quora Insincerity分類挑戰(zhàn)的特征工程管道的一部分。
Python
- 1 def add_features(df): df['question_text'] = df['question_text'].progress_apply(lambda x:str(x)) df["lower_question_text"] = df["question_text"].apply(lambda x: x.lower()) df['total_length'] = df['question_text'].progress_apply(len) df['capitals'] = df['question_text'].progress_apply(lambda comment: sum(1 for c in comment if c.isupper())) df['caps_vs_length'] = df.progress_apply(lambda row: float(row['capitals'])/float(row['total_length']), axis=1) df['num_words'] = df.question_text.str.count('\S+') df['num_unique_words'] = df['question_text'].progress_apply(lambda comment: len(set(w for w in comment.split()))) df['words_vs_unique'] = df['num_unique_words'] / df['num_words'] return df
結(jié)論
自然語(yǔ)言處理(NLP)在深度學(xué)習(xí)領(lǐng)域仍然是一個(gè)非常有趣的問(wèn)題,因此希望更多的人進(jìn)行大量的實(shí)驗(yàn),看看哪些有效,哪些無(wú)效。而試圖為任何自然語(yǔ)言處理(NLP)問(wèn)題的深度學(xué)習(xí)神經(jīng)網(wǎng)絡(luò)的預(yù)處理步驟可以提供有益的視角。
原文標(biāo)題:Text Preprocessing Methods for Deep Learning,作者:Kevin Vu
【51CTO譯稿,合作站點(diǎn)轉(zhuǎn)載請(qǐng)注明原文譯者和出處為51CTO.com】