自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會

公眾號矩陣

移動端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設(shè)置退出

Python 文本清洗和預(yù)處理的 15 項(xiàng)技術(shù)

作者：手把手PythonAI編程 2024-12-20 13:00:00

本文詳細(xì)介紹了15項(xiàng)Python文本清洗和預(yù)處理技術(shù)，通過實(shí)際代碼示例，我們展示了如何應(yīng)用這些技術(shù)來清洗和預(yù)處理文本數(shù)據(jù)。

文本清洗和預(yù)處理是自然語言處理（NLP）中的重要步驟。無論你是處理社交媒體數(shù)據(jù)、新聞文章還是用戶評論，都需要先對文本進(jìn)行清洗和預(yù)處理，以確保后續(xù)的分析或建模能夠順利進(jìn)行。本文將詳細(xì)介紹15項(xiàng)Python文本清洗和預(yù)處理技術(shù)，并通過實(shí)際代碼示例來幫助你更好地理解和應(yīng)用這些技術(shù)。

1. 去除空白字符

空白字符包括空格、制表符、換行符等，這些字符通常不會影響文本內(nèi)容的意義，但會增加數(shù)據(jù)的復(fù)雜性。使用 strip() 和 replace() 方法可以輕松去除這些字符。

text = "  Hello, World! \n"
clean_text = text.strip()  # 去除首尾空白字符
print(clean_text)  # 輸出: Hello, World!

text_with_tabs = "Hello\tWorld!"
clean_text = text_with_tabs.replace("\t", " ")  # 將制表符替換為空格
print(clean_text)  # 輸出: Hello World!

2. 轉(zhuǎn)換為小寫

將所有文本轉(zhuǎn)換為小寫可以避免因大小寫不同而引起的不一致問題。

text = "Hello, World!"
lower_text = text.lower()
print(lower_text)  # 輸出: hello, world!

3. 去除標(biāo)點(diǎn)符號

標(biāo)點(diǎn)符號通常不會對文本的語義產(chǎn)生實(shí)質(zhì)性的影響，但在某些情況下（如情感分析）可能會有影響。使用 string 模塊中的 punctuation 可以輕松去除標(biāo)點(diǎn)符號。

import string

text = "Hello, World!"
clean_text = text.translate(str.maketrans("", "", string.punctuation))
print(clean_text)  # 輸出: Hello World

4. 分詞

分詞是將文本分割成單詞或短語的過程。使用 nltk 庫的 word_tokenize 方法可以實(shí)現(xiàn)這一點(diǎn)。

import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt')
text = "Hello, World! This is a test."
tokens = word_tokenize(text)
print(tokens)  # 輸出: ['Hello', ',', 'World', '!', 'This', 'is', 'a', 'test', '.']

5. 去除停用詞

停用詞是那些在文本中頻繁出現(xiàn)但對語義貢獻(xiàn)不大的詞匯，如“the”、“is”等。使用 nltk 庫的 stopwords 模塊可以去除這些詞。

from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
tokens = ['Hello', 'World', 'This', 'is', 'a', 'test']
filtered_tokens = [token for token in tokens if token not in stop_words]
print(filtered_tokens)  # 輸出: ['Hello', 'World', 'test']

6. 詞干提取

詞干提取是將單詞還原為其基本形式的過程。使用 nltk 庫的 PorterStemmer 可以實(shí)現(xiàn)這一點(diǎn)。

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ['running', 'jumps', 'easily']
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)  # 輸出: ['run', 'jump', 'easili']

7. 詞形還原

詞形還原是將單詞還原為其詞典形式的過程。使用 nltk 庫的 WordNetLemmatizer 可以實(shí)現(xiàn)這一點(diǎn)。

from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
words = ['running', 'jumps', 'easily']
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized_words)  # 輸出: ['running', 'jump', 'easily']

8. 去除數(shù)字

數(shù)字通常不會對文本的語義產(chǎn)生實(shí)質(zhì)性的影響。使用正則表達(dá)式可以輕松去除數(shù)字。

import re

text = "Hello, World! 123"
clean_text = re.sub(r'\d+', '', text)
print(clean_text)  # 輸出: Hello, World!

9. 去除特殊字符

特殊字符如 @、#、$ 等通常不會對文本的語義產(chǎn)生實(shí)質(zhì)性的影響。使用正則表達(dá)式可以輕松去除這些字符。

text = "Hello, @World! #Python $123"
clean_text = re.sub(r'[^\w\s]', '', text)
print(clean_text)  # 輸出: Hello  World  Python 123

10. 去除 HTML 標(biāo)簽

如果文本來自網(wǎng)頁，可能包含 HTML 標(biāo)簽。使用 BeautifulSoup 庫可以輕松去除這些標(biāo)簽。

from bs4 import BeautifulSoup

html_text = "<html><body><h1>Hello, World!</h1></body></html>"
soup = BeautifulSoup(html_text, 'html.parser')
clean_text = soup.get_text()
print(clean_text)  # 輸出: Hello, World!

11. 去除 URL

URL 通常不會對文本的語義產(chǎn)生實(shí)質(zhì)性的影響。使用正則表達(dá)式可以輕松去除 URL。

text = "Check out this link: https://example.com"
clean_text = re.sub(r'http\S+|www.\S+', '', text)
print(clean_text)  # 輸出: Check out this link:

12. 去除重復(fù)單詞

重復(fù)單詞可能會增加文本的復(fù)雜性。使用集合可以輕松去除重復(fù)單詞。

tokens = ['Hello', 'World', 'Hello', 'Python', 'Python']
unique_tokens = list(set(tokens))
print(unique_tokens)  # 輸出: ['Hello', 'Python', 'World']

13. 去除短詞

短詞通常不會對文本的語義產(chǎn)生實(shí)質(zhì)性的影響?？梢栽O(shè)置一個(gè)閾值來去除長度小于該閾值的單詞。

tokens = ['Hello', 'World', 'a', 'is', 'Python']
min_length = 3
filtered_tokens = [token for token in tokens if len(token) >= min_length]
print(filtered_tokens)  # 輸出: ['Hello', 'World', 'Python']

14. 去除罕見詞

罕見詞可能會增加文本的復(fù)雜性?？梢栽O(shè)置一個(gè)頻率閾值來去除出現(xiàn)次數(shù)少于該閾值的單詞。

from collections import Counter

tokens = ['Hello', 'World', 'Hello', 'Python', 'Python', 'test', 'test', 'test']
word_counts = Counter(tokens)
min_frequency = 2
filtered_tokens = [token for token in tokens if word_counts[token] >= min_frequency]
print(filtered_tokens)  # 輸出: ['Hello', 'Hello', 'Python', 'Python', 'test', 'test', 'test']

15. 使用正則表達(dá)式進(jìn)行復(fù)雜清洗

正則表達(dá)式是一種強(qiáng)大的工具，可以用于復(fù)雜的文本清洗任務(wù)。例如，去除特定模式的字符串。

text = "Hello, World! 123-456-7890"
clean_text = re.sub(r'\d{3}-\d{3}-\d{4}', 'PHONE', text)
print(clean_text)  # 輸出: Hello, World! PHONE

實(shí)戰(zhàn)案例：清洗社交媒體評論

假設(shè)你有一個(gè)包含社交媒體評論的數(shù)據(jù)集，需要對其進(jìn)行清洗和預(yù)處理。我們將綜合運(yùn)用上述技術(shù)來完成這個(gè)任務(wù)。

import pandas as pd
import re
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup

# 下載必要的NLTK資源
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# 示例數(shù)據(jù)
data = {
    'comment': [
        "Check out this link: https://example.com",
        "Hello, @World! #Python $123",
        "<html><body><h1>Hello, World!</h1></body></html>",
        "Running jumps easily 123-456-7890"
    ]
}

df = pd.DataFrame(data)

def clean_text(text):
    # 去除HTML標(biāo)簽
    text = BeautifulSoup(text, 'html.parser').get_text()
    
    # 去除URL
    text = re.sub(r'http\S+|www.\S+', '', text)
    
    # 去除特殊字符
    text = re.sub(r'[^\w\s]', '', text)
    
    # 去除數(shù)字
    text = re.sub(r'\d+', '', text)
    
    # 轉(zhuǎn)換為小寫
    text = text.lower()
    
    # 分詞
    tokens = word_tokenize(text)
    
    # 去除停用詞
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    
    # 詞形還原
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    # 去除短詞
    tokens = [token for token in tokens if len(token) >= 3]
    
    # 去除罕見詞
    word_counts = Counter(tokens)
    min_frequency = 2
    tokens = [token for token in tokens if word_counts[token] >= min_frequency]
    
    return ' '.join(tokens)

# 應(yīng)用清洗函數(shù)
df['cleaned_comment'] = df['comment'].apply(clean_text)
print(df)

總結(jié)

本文詳細(xì)介紹了15項(xiàng)Python文本清洗和預(yù)處理技術(shù)，包括去除空白字符、轉(zhuǎn)換為小寫、去除標(biāo)點(diǎn)符號、分詞、去除停用詞、詞干提取、詞形還原、去除數(shù)字、去除特殊字符、去除HTML標(biāo)簽、去除URL、去除重復(fù)單詞、去除短詞、去除罕見詞以及使用正則表達(dá)式進(jìn)行復(fù)雜清洗。通過實(shí)際代碼示例，我們展示了如何應(yīng)用這些技術(shù)來清洗和預(yù)處理文本數(shù)據(jù)。最后，我們通過一個(gè)實(shí)戰(zhàn)案例，綜合運(yùn)用這些技術(shù)對社交媒體評論進(jìn)行了清洗和預(yù)處理。

責(zé)任編輯：趙寧寧來源：手把手PythonAI編程

Python 文本清洗預(yù)處理

點(diǎn)贊

51CTO技術(shù)棧公眾號

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營