如何用Python清理文本數(shù)據(jù)?
不是所有數(shù)據(jù)格式都會(huì)采用表格格式。隨著我們進(jìn)入大數(shù)據(jù)時(shí)代,數(shù)據(jù)的格式非常多樣化,包括圖像、文本、圖形等等。
因?yàn)楦袷椒浅6鄻樱瑥囊粋€(gè)數(shù)據(jù)到另一個(gè)數(shù)據(jù),所以將這些數(shù)據(jù)預(yù)處理為計(jì)算機(jī)可讀的格式是非常必要的。
在本文中,將展示如何使用Python預(yù)處理文本數(shù)據(jù),我們需要用到 NLTK 和 re-library 庫(kù)。

過程
1.文本小寫
在我們開始處理文本之前,最好先將所有字符都小寫。我們這樣做的原因是為了避免區(qū)分大小寫的過程。
假設(shè)我們想從字符串中刪除停止詞,正常操作是將非停止詞合并成一個(gè)句子。如果不使用小寫,則無法檢測(cè)到停止詞,并將導(dǎo)致相同的字符串。這就是為什么降低文本大小寫這么重要了。
用Python實(shí)現(xiàn)這一點(diǎn)很容易。代碼是這樣的:
- # 樣例
- x = "Watch This Airport Get Swallowed Up By A Sandstorm In Under A Minute http://t.co/TvYQczGJdy"
- # 將文本小寫
- x = x.lower()
- print(x)
- >>> watch this airport get swallowed up by a sandstorm in under a minute http://t.co/tvyqczgjdy
2.刪除 Unicode 字符
一些文章中可能包含 Unicode 字符,當(dāng)我們?cè)?ASCII 格式上看到它時(shí),它是不可讀的。大多數(shù)情況下,這些字符用于表情符號(hào)和非 ASCII 字符。要?jiǎng)h除該字符,我們可以使用這樣的代碼:
- # 示例
- x = "Reddit Will Now QuarantineÛ_ http://t.co/pkUAMXw6pm #onlinecommunities #reddit #amageddon #freespeech #Business http://t.co/PAWvNJ4sAP"
- # 刪除 unicode 字符
- x = x.encode('ascii', 'ignore').decode()
- print(x)
- >>> Reddit Will Now Quarantine_ http://t.co/pkUAMXw6pm #onlinecommunities #reddit #amageddon #freespeech #Business http://t.co/PAWvNJ4sAP
3.刪除停止詞
停止詞是一種對(duì)文本意義沒有顯著貢獻(xiàn)的詞。因此,我們可以刪除這些詞。為了檢索停止詞,我們可以從 NLTK 庫(kù)中下載一個(gè)資料庫(kù)。以下為實(shí)現(xiàn)代碼:
- import nltk
- nltk.download()
- # 只需下載所有nltk
- stop_words = stopwords.words("english")
- # 示例
- x = "America like South Africa is a traumatised sick country - in different ways of course - but still messed up."
- # 刪除停止詞
- x = ' '.join([word for word in x.split(' ') if word not in stop_words])
- print(x)
- >>> America like South Africa traumatised sick country - different ways course - still messed up.
4.刪除諸如提及、標(biāo)簽、鏈接等術(shù)語。
除了刪除 Unicode 和停止詞外,還有幾個(gè)術(shù)語需要?jiǎng)h除,包括提及、哈希標(biāo)記、鏈接、標(biāo)點(diǎn)符號(hào)等。
要去除這些,如果我們僅依賴于已經(jīng)定義的字符,很難做到這些操作。因此,我們需要通過使用正則表達(dá)式(Regex)來匹配我們想要的術(shù)語的模式。
Regex 是一個(gè)特殊的字符串,它包含一個(gè)可以匹配與該模式相關(guān)聯(lián)的單詞的模式。通過使用名為 re. 的 Python 庫(kù)搜索或刪除這些模式。以下為實(shí)現(xiàn)代碼:
- import re
- # 刪除提及
- x = "@DDNewsLive @NitishKumar and @ArvindKejriwal can't survive without referring @@narendramodi . Without Mr Modi they are BIG ZEROS"
- x = re.sub("@\S+", " ", x)
- print(x)
- >>> and can't survive without referring . Without Mr Modi they are BIG ZEROS
- # 刪除 URL 鏈接
- x = "Severe Thunderstorm pictures from across the Mid-South http://t.co/UZWLgJQzNS"
- x = re.sub("https*\S+", " ", x)
- print(x)
- >>> Severe Thunderstorm pictures from across the Mid-South
- # 刪除標(biāo)簽
- x = "Are people not concerned that after #SLAB's obliteration in Scotland #Labour UK is ripping itself apart over #Labourleadership contest?"
- x = re.sub("#\S+", " ", x)
- print(x)
- >>> Are people not concerned that after obliteration in Scotland UK is ripping itself apart over contest?
- # 刪除記號(hào)和下一個(gè)字符
- x = "Notley's tactful yet very direct response to Harper's attack on Alberta's gov't. Hell YEAH Premier! http://t.co/rzSUlzMOkX #ableg #cdnpoli"
- x = re.sub("\'\w+", '', x)
- print(x)
- >>> Notley tactful yet very direct response to Harper attack on Alberta gov. Hell YEAH Premier! http://t.co/rzSUlzMOkX #ableg #cdnpoli
- # 刪除標(biāo)點(diǎn)符號(hào)
- x = "In 2014 I will only smoke crqck if I becyme a mayor. This includes Foursquare."
- x = re.sub('[%s]' % re.escape(string.punctuation), ' ', x)
- print(x)
- >>> In 2014 I will only smoke crqck if I becyme a mayor. This includes Foursquare.
- # 刪除數(shù)字
- x = "C-130 specially modified to land in a stadium and rescue hostages in Iran in 1980... http://t.co/tNI92fea3u http://t.co/czBaMzq3gL"
- x = re.sub(r'\w*\d+\w*', '', x)
- print(x)
- >>> C- specially modified to land in a stadium and rescue hostages in Iran in ... http://t.co/ http://t.co/
- #替換空格
- x = " and can't survive without referring . Without Mr Modi they are BIG ZEROS"
- x = re.sub('\s{2,}', " ", x)
- print(x)
- >>> and can't survive without referring . Without Mr Modi they are BIG ZEROS
5.功能組合
在我們了解了文本預(yù)處理的每個(gè)步驟之后,讓我們將其應(yīng)用于列表。如果仔細(xì)看這些步驟,你會(huì)發(fā)現(xiàn)其實(shí)每個(gè)方法都是相互關(guān)聯(lián)的。因此,必須將其應(yīng)用于函數(shù),以便我們可以按順序同時(shí)處理所有問題。在應(yīng)用預(yù)處理步驟之前,以下是文本示例:
- Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
- Forest fire near La Ronge Sask. Canada
- All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected
- 13,000 people receive #wildfires evacuation orders in California
- Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school
在預(yù)處理文本列表時(shí),我們應(yīng)先執(zhí)行幾個(gè)步驟:
- 創(chuàng)建包含所有預(yù)處理步驟的函數(shù),并返回預(yù)處理的字符串
- 使用名為"apply"的方法應(yīng)用函數(shù),并使用該方法將列表鏈接在一起。
代碼如下:
- # 導(dǎo)入錯(cuò)誤的情況下
- # ! pip install nltk
- # ! pip install textblob
- import numpy as np
- import matplotlib.pyplot as plt
- import pandas as pd
- import re
- import nltk
- import string
- from nltk.corpus import stopwords
- # # 如果缺少語料庫(kù)
- # 下載 all-nltk
- nltk.download()
- df = pd.read_csv('train.csv')
- stop_words = stopwords.words("english")
- wordnet = WordNetLemmatizer()
- def text_preproc(x):
- x = x.lower()
- x = ' '.join([word for word in x.split(' ') if word not in stop_words])
- x = x.encode('ascii', 'ignore').decode()
- x = re.sub(r'https*\S+', ' ', x)
- x = re.sub(r'@\S+', ' ', x)
- x = re.sub(r'#\S+', ' ', x)
- x = re.sub(r'\'\w+', '', x)
- x = re.sub('[%s]' % re.escape(string.punctuation), ' ', x)
- x = re.sub(r'\w*\d+\w*', '', x)
- x = re.sub(r'\s{2,}', ' ', x)
- return x
- df['clean_text'] = df.text.apply(text_preproc)
上面的文本預(yù)處理結(jié)果如下:
- deeds reason may allah forgive us
- forest fire near la ronge sask canada
- residents asked place notified officers evacuation shelter place orders expected
- people receive evacuation orders california
- got sent photo ruby smoke pours school
最后
以上內(nèi)容就是使用 Python 進(jìn)行文本預(yù)處理的具體步驟,希望能夠幫助大家用它來解決與文本數(shù)據(jù)相關(guān)的問題,提高文本數(shù)據(jù)的規(guī)范性以及模型的準(zhǔn)確度。