自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會(huì)

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫(kù)

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營(yíng)

鴻蒙開發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫(kù)

賬號(hào)設(shè)置退出

如何用Python清理文本數(shù)據(jù)？

作者：為AI吶喊 2020-11-06 17:42:02

開發(fā) 后端

不是所有數(shù)據(jù)格式都會(huì)采用表格格式。隨著我們進(jìn)入大數(shù)據(jù)時(shí)代，數(shù)據(jù)的格式非常多樣化，包括圖像、文本、圖形等等。

不是所有數(shù)據(jù)格式都會(huì)采用表格格式。隨著我們進(jìn)入大數(shù)據(jù)時(shí)代，數(shù)據(jù)的格式非常多樣化，包括圖像、文本、圖形等等。

因?yàn)楦袷椒浅６鄻樱瑥囊粋€(gè)數(shù)據(jù)到另一個(gè)數(shù)據(jù)，所以將這些數(shù)據(jù)預(yù)處理為計(jì)算機(jī)可讀的格式是非常必要的。

在本文中，將展示如何使用Python預(yù)處理文本數(shù)據(jù)，我們需要用到 NLTK 和 re-library 庫(kù)。

如何用Python清理文本數(shù)據(jù)？

過程

1.文本小寫

在我們開始處理文本之前，最好先將所有字符都小寫。我們這樣做的原因是為了避免區(qū)分大小寫的過程。

假設(shè)我們想從字符串中刪除停止詞，正常操作是將非停止詞合并成一個(gè)句子。如果不使用小寫，則無法檢測(cè)到停止詞，并將導(dǎo)致相同的字符串。這就是為什么降低文本大小寫這么重要了。

用Python實(shí)現(xiàn)這一點(diǎn)很容易。代碼是這樣的：

# 樣例 
x = "Watch This Airport Get Swallowed Up By A Sandstorm In Under A Minute http://t.co/TvYQczGJdy" 
# 將文本小寫 
x = x.lower() 
print(x) 
>>> watch this airport get swallowed up by a sandstorm in under a minute http://t.co/tvyqczgjdy

2.刪除 Unicode 字符

一些文章中可能包含 Unicode 字符，當(dāng)我們?cè)?ASCII 格式上看到它時(shí)，它是不可讀的。大多數(shù)情況下，這些字符用于表情符號(hào)和非 ASCII 字符。要?jiǎng)h除該字符，我們可以使用這樣的代碼：

# 示例 
x = "Reddit Will Now Quarantine‰Û_ http://t.co/pkUAMXw6pm #onlinecommunities #reddit #amageddon #freespeech #Business http://t.co/PAWvNJ4sAP" 
# 刪除 unicode 字符 
x = x.encode('ascii', 'ignore').decode() 
print(x) 
>>> Reddit Will Now Quarantine_ http://t.co/pkUAMXw6pm #onlinecommunities #reddit #amageddon #freespeech #Business http://t.co/PAWvNJ4sAP

3.刪除停止詞

停止詞是一種對(duì)文本意義沒有顯著貢獻(xiàn)的詞。因此，我們可以刪除這些詞。為了檢索停止詞，我們可以從 NLTK 庫(kù)中下載一個(gè)資料庫(kù)。以下為實(shí)現(xiàn)代碼：

import nltk 
nltk.download() 
# 只需下載所有nltk 
stop_words = stopwords.words("english") 
# 示例 
x = "America like South Africa is a traumatised sick country - in different ways of course - but still messed up." 
# 刪除停止詞 
x = ' '.join([word for word in x.split(' ') if word not in stop_words]) 
print(x) 
>>> America like South Africa traumatised sick country - different ways course - still messed up.

4.刪除諸如提及、標(biāo)簽、鏈接等術(shù)語。

除了刪除 Unicode 和停止詞外，還有幾個(gè)術(shù)語需要?jiǎng)h除，包括提及、哈希標(biāo)記、鏈接、標(biāo)點(diǎn)符號(hào)等。

要去除這些，如果我們僅依賴于已經(jīng)定義的字符，很難做到這些操作。因此，我們需要通過使用正則表達(dá)式(Regex)來匹配我們想要的術(shù)語的模式。

Regex 是一個(gè)特殊的字符串，它包含一個(gè)可以匹配與該模式相關(guān)聯(lián)的單詞的模式。通過使用名為 re. 的 Python 庫(kù)搜索或刪除這些模式。以下為實(shí)現(xiàn)代碼：

import re 
# 刪除提及 
x = "@DDNewsLive @NitishKumar  and @ArvindKejriwal can't survive without referring @@narendramodi . Without Mr Modi they are BIG ZEROS" 
x = re.sub("@\S+", " ", x) 
print(x) 
>>>      and   can't survive without referring   . Without Mr Modi they are BIG ZEROS 
# 刪除 URL 鏈接 
x = "Severe Thunderstorm pictures from across the Mid-South http://t.co/UZWLgJQzNS" 
x = re.sub("https*\S+", " ", x) 
print(x) 
>>> Severe Thunderstorm pictures from across the Mid-South 
# 刪除標(biāo)簽 
x = "Are people not concerned that after #SLAB's obliteration in Scotland #Labour UK is ripping itself apart over #Labourleadership contest?" 
x = re.sub("#\S+", " ", x) 
print(x) 
>>> Are people not concerned that after   obliteration in Scotland   UK is ripping itself apart over   contest? 
# 刪除記號(hào)和下一個(gè)字符 
x = "Notley's tactful yet very direct response to Harper's attack on Alberta's gov't. Hell YEAH Premier! http://t.co/rzSUlzMOkX #ableg #cdnpoli" 
x = re.sub("\'\w+", '', x) 
print(x) 
>>> Notley tactful yet very direct response to Harper attack on Alberta gov. Hell YEAH Premier! http://t.co/rzSUlzMOkX #ableg #cdnpoli 
# 刪除標(biāo)點(diǎn)符號(hào) 
x = "In 2014 I will only smoke crqck if I becyme a mayor. This includes Foursquare." 
x = re.sub('[%s]' % re.escape(string.punctuation), ' ', x) 
print(x) 
>>> In 2014 I will only smoke crqck if I becyme a mayor. This includes Foursquare. 
# 刪除數(shù)字 
x = "C-130 specially modified to land in a stadium and rescue hostages in Iran in 1980... http://t.co/tNI92fea3u http://t.co/czBaMzq3gL" 
x = re.sub(r'\w*\d+\w*', '', x) 
print(x) 
>>> C- specially modified to land in a stadium and rescue hostages in Iran in ... http://t.co/ http://t.co/ 
#替換空格 
x = "     and   can't survive without referring   . Without Mr Modi they are BIG ZEROS" 
x = re.sub('\s{2,}', " ", x) 
print(x) 
>>>  and can't survive without referring . Without Mr Modi they are BIG ZEROS

5.功能組合

在我們了解了文本預(yù)處理的每個(gè)步驟之后，讓我們將其應(yīng)用于列表。如果仔細(xì)看這些步驟，你會(huì)發(fā)現(xiàn)其實(shí)每個(gè)方法都是相互關(guān)聯(lián)的。因此，必須將其應(yīng)用于函數(shù)，以便我們可以按順序同時(shí)處理所有問題。在應(yīng)用預(yù)處理步驟之前，以下是文本示例：

Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all 
Forest fire near La Ronge Sask. Canada 
All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected 
13,000 people receive #wildfires evacuation orders in California  
Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school

在預(yù)處理文本列表時(shí)，我們應(yīng)先執(zhí)行幾個(gè)步驟：

創(chuàng)建包含所有預(yù)處理步驟的函數(shù)，并返回預(yù)處理的字符串
使用名為"apply"的方法應(yīng)用函數(shù)，并使用該方法將列表鏈接在一起。

代碼如下：

# 導(dǎo)入錯(cuò)誤的情況下 
# ! pip install nltk 
# ! pip install textblob 
import numpy as np 
import matplotlib.pyplot as plt 
import pandas as pd 
import re 
import nltk 
import string 
from nltk.corpus import stopwords 
# # 如果缺少語料庫(kù) 
# 下載 all-nltk 
nltk.download() 
df = pd.read_csv('train.csv') 
stop_words = stopwords.words("english") 
wordnet = WordNetLemmatizer() 
def text_preproc(x): 
  x = x.lower() 
  x = ' '.join([word for word in x.split(' ') if word not in stop_words]) 
  x = x.encode('ascii', 'ignore').decode() 
  x = re.sub(r'https*\S+', ' ', x) 
  x = re.sub(r'@\S+', ' ', x) 
  x = re.sub(r'#\S+', ' ', x) 
  x = re.sub(r'\'\w+', '', x) 
  x = re.sub('[%s]' % re.escape(string.punctuation), ' ', x) 
  x = re.sub(r'\w*\d+\w*', '', x) 
  x = re.sub(r'\s{2,}', ' ', x) 
  return x 
df['clean_text'] = df.text.apply(text_preproc)

上面的文本預(yù)處理結(jié)果如下：

deeds reason may allah forgive us 
forest fire near la ronge sask canada 
residents asked place notified officers evacuation shelter place orders expected 
 people receive evacuation orders california  
got sent photo ruby smoke pours school

最后

以上內(nèi)容就是使用 Python 進(jìn)行文本預(yù)處理的具體步驟，希望能夠幫助大家用它來解決與文本數(shù)據(jù)相關(guān)的問題，提高文本數(shù)據(jù)的規(guī)范性以及模型的準(zhǔn)確度。

責(zé)任編輯：華軒來源：今日頭條

Python 開發(fā)工具

點(diǎn)贊

51CTO技術(shù)棧公眾號(hào)

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營(yíng)

<sub id="7rdsl"><p id="7rdsl"></p></sub>