利用 Python 進(jìn)行自然語言處理的 12 個實(shí)用案例
自然語言處理(NLP)是人工智能領(lǐng)域的一個重要分支,它讓計算機(jī)能夠理解、解釋和生成人類語言。Python 是進(jìn)行 NLP 的首選語言之一,因?yàn)樗胸S富的庫和工具支持。今天,我們就來探討 12 個實(shí)用的 NLP 案例,幫助你更好地理解和應(yīng)用 NLP 技術(shù)。
1. 文本預(yù)處理
文本預(yù)處理是 NLP 的第一步,包括去除標(biāo)點(diǎn)符號、轉(zhuǎn)換為小寫、分詞等操作。
import re
import string
def preprocess_text(text):
# 去除標(biāo)點(diǎn)符號
text = text.translate(str.maketrans('', '', string.punctuation))
# 轉(zhuǎn)換為小寫
text = text.lower()
# 分詞
words = text.split()
return words
text = "Hello, World! This is a test."
preprocessed_text = preprocess_text(text)
print(preprocessed_text) # 輸出: ['hello', 'world', 'this', 'is', 'a', 'test']
2. 詞干提取和詞形還原
詞干提取(Stemming)和詞形還原(Lemmatization)是將單詞還原為其基本形式的技術(shù)。
from nltk.stem import PorterStemmer, WordNetLemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
words = ["running", "jumps", "better", "worse"]
stemmed_words = [stemmer.stem(word) for word in words]
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print(stemmed_words) # 輸出: ['run', 'jump', 'better', 'worst']
print(lemmatized_words) # 輸出: ['run', 'jump', 'better', 'worse']
3. 停用詞去除
停用詞(Stop Words)是指在文本中頻繁出現(xiàn)但對語義貢獻(xiàn)不大的詞匯,如“the”、“is”等。
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
def remove_stopwords(words):
filtered_words = [word for word in words if word not in stop_words]
return filtered_words
words = ["this", "is", "a", "test", "of", "stop", "words"]
filtered_words = remove_stopwords(words)
print(filtered_words) # 輸出: ['test', 'stop', 'words']
4. 詞頻統(tǒng)計
詞頻統(tǒng)計可以幫助我們了解文本中最常見的詞匯。
from collections import Counter
words = ["apple", "banana", "apple", "orange", "banana", "banana"]
word_counts = Counter(words)
print(word_counts) # 輸出: Counter({'banana': 3, 'apple': 2, 'orange': 1})
5. TF-IDF 計算
TF-IDF(Term Frequency-Inverse Document Frequency)是一種統(tǒng)計方法,用于評估一個詞在一個文檔或一組文檔中的重要性。
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [
"The cat in the hat.",
"A cat is a fine pet.",
"Dogs and cats make good pets."
]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
print(tfidf_matrix.toarray())
# 輸出: [[0. 0.57735027 0.57735027 0.57735027 0. 0. ]
# [0.57735027 0.57735027 0.57735027 0. 0. 0. ]
# [0.40824829 0.40824829 0. 0. 0.57735027 0.57735027]]from sklearn.feature_extraction.text import TfidfVectorizer
documents = [
"The cat in the hat.",
"A cat is a fine pet.",
"Dogs and cats make good pets."
]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
print(tfidf_matrix.toarray())
# 輸出: [[0. 0.57735027 0.57735027 0.57735027 0. 0. ]
# [0.57735027 0.57735027 0.57735027 0. 0. 0. ]
# [0.40824829 0.40824829 0. 0. 0.57735027 0.57735027]]
6. 命名實(shí)體識別(NER)
命名實(shí)體識別(NER)是從文本中識別出特定類型的實(shí)體,如人名、地名、組織名等。
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(text)
for ent in doc.ents:
print(ent.text, ent.label_)
# 輸出:
# Apple ORG
# U.K. GPE
# $1 billion MONEY
7. 情感分析
情感分析是判斷文本的情感傾向,如正面、負(fù)面或中性。
from textblob import TextBlob
text = "I love this movie!"
blob = TextBlob(text)
sentiment = blob.sentiment.polarity
print(sentiment) # 輸出: 0.875
8. 文本分類
文本分類是將文本歸類到預(yù)定義的類別中,如垃圾郵件檢測、新聞分類等。
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# 加載數(shù)據(jù)集
data = fetch_20newsgroups(subset='all')
# 劃分訓(xùn)練集和測試集
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)
# 創(chuàng)建管道
pipeline = Pipeline([
('vectorizer', CountVectorizer()),
('classifier', MultinomialNB())
])
# 訓(xùn)練模型
pipeline.fit(X_train, y_train)
# 預(yù)測
y_pred = pipeline.predict(X_test)
# 評估
print(classification_report(y_test, y_pred, target_names=data.target_names))
9. 文本聚類
文本聚類是將相似的文本分組到同一個類別中。
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
documents = [
"The cat in the hat.",
"A cat is a fine pet.",
"Dogs and cats make good pets.",
"I love my pet dog.",
"My dog loves to play with cats."
]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
kmeans = KMeans(n_clusters=2)
kmeans.fit(tfidf_matrix)
labels = kmeans.labels_
print(labels) # 輸出: [0 0 1 1 1]
10. 機(jī)器翻譯
機(jī)器翻譯是將一種語言的文本自動翻譯成另一種語言。
from transformers import MarianMTModel, MarianTokenizer
model_name = "Helsinki-NLP/opus-mt-en-de"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
text = "I love this movie!"
translated = model.generate(**tokenizer(text, return_tensors="pt", padding=True))
translated_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
print(translated_text) # 輸出: ['Ich liebe diesen Film!']
11. 問答系統(tǒng)
問答系統(tǒng)是根據(jù)給定的上下文回答問題。
from transformers import pipeline
qa_pipeline = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")
context = "The Transformer is a deep learning model introduced in 2017 by Vaswani et al. that revolutionized natural language processing."
question = "What year was the Transformer model introduced?"
answer = qa_pipeline(question=question, context=context)
print(answer['answer']) # 輸出: 2017
12. 文本生成
文本生成是使用模型生成新的文本,如生成詩歌、故事等。
from transformers import GPT2LMHeadModel, GPT2Tokenizer
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
prompt = "Once upon a time"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(input_ids, max_length=50, num_return_sequences=1)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)
# 輸出: Once upon a time in a land far away, there lived a brave knight who set out on a quest to find a magical treasure.
實(shí)戰(zhàn)案例:垃圾郵件檢測
假設(shè)你是一家電子郵件服務(wù)提供商,需要開發(fā)一個系統(tǒng)來檢測并過濾垃圾郵件。我們將使用樸素貝葉斯分類器來實(shí)現(xiàn)這個任務(wù)。
數(shù)據(jù)準(zhǔn)備
首先,我們需要準(zhǔn)備一些訓(xùn)練數(shù)據(jù)。這里我們使用 sklearn 提供的 20newsgroups 數(shù)據(jù)集的一部分作為示例。
from sklearn.datasets import fetch_20newsgroups
# 加載數(shù)據(jù)集
data = fetch_20newsgroups(subset='all', categories=['alt.atheism', 'soc.religion.christian'])
# 將數(shù)據(jù)劃分為訓(xùn)練集和測試集
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)
特征提取
使用 CountVectorizer 將文本轉(zhuǎn)換為特征向量。
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)
模型訓(xùn)練
使用樸素貝葉斯分類器進(jìn)行訓(xùn)練。
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train_vectorized, y_train)
模型評估
評估模型的性能。
from sklearn.metrics import classification_report
y_pred = model.predict(X_test_vectorized)
print(classification_report(y_test, y_pred, target_names=data.target_names))
應(yīng)用
現(xiàn)在我們可以使用訓(xùn)練好的模型來檢測新的郵件是否為垃圾郵件。
def predict_spam(email_text):
email_vectorized = vectorizer.transform([email_text])
prediction = model.predict(email_vectorized)
return "Spam" if prediction[0] == 1 else "Not Spam"
email_text = "Get rich quick! Click here to win a million dollars!"
print(predict_spam(email_text)) # 輸出: Spam
總結(jié)
本文介紹了 12 個實(shí)用的自然語言處理(NLP)案例,涵蓋了文本預(yù)處理、詞干提取、詞形還原、停用詞去除、詞頻統(tǒng)計、TF-IDF 計算、命名實(shí)體識別、情感分析、文本分類、文本聚類、機(jī)器翻譯、問答系統(tǒng)和文本生成。通過這些案例,你可以更好地理解和應(yīng)用 NLP 技術(shù)。