自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會

公眾號矩陣

移動端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設(shè)置退出

利用 Python 進(jìn)行自然語言處理的 12 個實(shí)用案例

作者：小白PythonAI編程 2024-11-07 15:49:34

開發(fā) 后端

Python 是進(jìn)行 NLP 的首選語言之一，因?yàn)樗胸S富的庫和工具支持。今天，我們就來探討 12 個實(shí)用的 NLP 案例，幫助你更好地理解和應(yīng)用 NLP 技術(shù)。

自然語言處理(NLP)是人工智能領(lǐng)域的一個重要分支，它讓計算機(jī)能夠理解、解釋和生成人類語言。Python 是進(jìn)行 NLP 的首選語言之一，因?yàn)樗胸S富的庫和工具支持。今天，我們就來探討 12 個實(shí)用的 NLP 案例，幫助你更好地理解和應(yīng)用 NLP 技術(shù)。

1. 文本預(yù)處理

文本預(yù)處理是 NLP 的第一步，包括去除標(biāo)點(diǎn)符號、轉(zhuǎn)換為小寫、分詞等操作。

import re
import string

def preprocess_text(text):
    # 去除標(biāo)點(diǎn)符號
    text = text.translate(str.maketrans('', '', string.punctuation))
    # 轉(zhuǎn)換為小寫
    text = text.lower()
    # 分詞
    words = text.split()
    return words

text = "Hello, World! This is a test."
preprocessed_text = preprocess_text(text)
print(preprocessed_text)  # 輸出: ['hello', 'world', 'this', 'is', 'a', 'test']

2. 詞干提取和詞形還原

詞干提取(Stemming)和詞形還原(Lemmatization)是將單詞還原為其基本形式的技術(shù)。

from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words = ["running", "jumps", "better", "worse"]

stemmed_words = [stemmer.stem(word) for word in words]
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

print(stemmed_words)  # 輸出: ['run', 'jump', 'better', 'worst']
print(lemmatized_words)  # 輸出: ['run', 'jump', 'better', 'worse']

3. 停用詞去除

停用詞(Stop Words)是指在文本中頻繁出現(xiàn)但對語義貢獻(xiàn)不大的詞匯，如“the”、“is”等。

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

def remove_stopwords(words):
    filtered_words = [word for word in words if word not in stop_words]
    return filtered_words

words = ["this", "is", "a", "test", "of", "stop", "words"]
filtered_words = remove_stopwords(words)
print(filtered_words)  # 輸出: ['test', 'stop', 'words']

4. 詞頻統(tǒng)計

詞頻統(tǒng)計可以幫助我們了解文本中最常見的詞匯。

from collections import Counter

words = ["apple", "banana", "apple", "orange", "banana", "banana"]
word_counts = Counter(words)

print(word_counts)  # 輸出: Counter({'banana': 3, 'apple': 2, 'orange': 1})

5. TF-IDF 計算

TF-IDF(Term Frequency-Inverse Document Frequency)是一種統(tǒng)計方法，用于評估一個詞在一個文檔或一組文檔中的重要性。

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
    "The cat in the hat.",
    "A cat is a fine pet.",
    "Dogs and cats make good pets."
]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

print(tfidf_matrix.toarray())
# 輸出: [[0.         0.57735027 0.57735027 0.57735027 0.         0.        ]
#        [0.57735027 0.57735027 0.57735027 0.         0.         0.        ]
#        [0.40824829 0.40824829 0.         0.         0.57735027 0.57735027]]from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
    "The cat in the hat.",
    "A cat is a fine pet.",
    "Dogs and cats make good pets."
]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

print(tfidf_matrix.toarray())
# 輸出: [[0.         0.57735027 0.57735027 0.57735027 0.         0.        ]
#        [0.57735027 0.57735027 0.57735027 0.         0.         0.        ]
#        [0.40824829 0.40824829 0.         0.         0.57735027 0.57735027]]

6. 命名實(shí)體識別(NER)

命名實(shí)體識別(NER)是從文本中識別出特定類型的實(shí)體，如人名、地名、組織名等。

import spacy

nlp = spacy.load("en_core_web_sm")

text = "Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(text)

for ent in doc.ents:
    print(ent.text, ent.label_)

# 輸出:
# Apple ORG
# U.K. GPE
# $1 billion MONEY

7. 情感分析

情感分析是判斷文本的情感傾向，如正面、負(fù)面或中性。

from textblob import TextBlob

text = "I love this movie!"
blob = TextBlob(text)

sentiment = blob.sentiment.polarity
print(sentiment)  # 輸出: 0.875

8. 文本分類

文本分類是將文本歸類到預(yù)定義的類別中，如垃圾郵件檢測、新聞分類等。

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# 加載數(shù)據(jù)集
data = fetch_20newsgroups(subset='all')

# 劃分訓(xùn)練集和測試集
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# 創(chuàng)建管道
pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', MultinomialNB())
])

# 訓(xùn)練模型
pipeline.fit(X_train, y_train)

# 預(yù)測
y_pred = pipeline.predict(X_test)

# 評估
print(classification_report(y_test, y_pred, target_names=data.target_names))

9. 文本聚類

文本聚類是將相似的文本分組到同一個類別中。

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

documents = [
    "The cat in the hat.",
    "A cat is a fine pet.",
    "Dogs and cats make good pets.",
    "I love my pet dog.",
    "My dog loves to play with cats."
]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

kmeans = KMeans(n_clusters=2)
kmeans.fit(tfidf_matrix)

labels = kmeans.labels_
print(labels)  # 輸出: [0 0 1 1 1]

10. 機(jī)器翻譯

機(jī)器翻譯是將一種語言的文本自動翻譯成另一種語言。

from transformers import MarianMTModel, MarianTokenizer

model_name = "Helsinki-NLP/opus-mt-en-de"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

text = "I love this movie!"
translated = model.generate(**tokenizer(text, return_tensors="pt", padding=True))

translated_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
print(translated_text)  # 輸出: ['Ich liebe diesen Film!']

11. 問答系統(tǒng)

問答系統(tǒng)是根據(jù)給定的上下文回答問題。

from transformers import pipeline

qa_pipeline = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")

context = "The Transformer is a deep learning model introduced in 2017 by Vaswani et al. that revolutionized natural language processing."
question = "What year was the Transformer model introduced?"

answer = qa_pipeline(question=question, context=context)
print(answer['answer'])  # 輸出: 2017

12. 文本生成

文本生成是使用模型生成新的文本，如生成詩歌、故事等。

from transformers import GPT2LMHeadModel, GPT2Tokenizer

model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

prompt = "Once upon a time"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

output = model.generate(input_ids, max_length=50, num_return_sequences=1)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

print(generated_text)
# 輸出: Once upon a time in a land far away, there lived a brave knight who set out on a quest to find a magical treasure.

實(shí)戰(zhàn)案例：垃圾郵件檢測

假設(shè)你是一家電子郵件服務(wù)提供商，需要開發(fā)一個系統(tǒng)來檢測并過濾垃圾郵件。我們將使用樸素貝葉斯分類器來實(shí)現(xiàn)這個任務(wù)。

數(shù)據(jù)準(zhǔn)備

首先，我們需要準(zhǔn)備一些訓(xùn)練數(shù)據(jù)。這里我們使用 sklearn 提供的 20newsgroups 數(shù)據(jù)集的一部分作為示例。

from sklearn.datasets import fetch_20newsgroups

# 加載數(shù)據(jù)集
data = fetch_20newsgroups(subset='all', categories=['alt.atheism', 'soc.religion.christian'])

# 將數(shù)據(jù)劃分為訓(xùn)練集和測試集
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

特征提取

使用 CountVectorizer 將文本轉(zhuǎn)換為特征向量。

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

模型訓(xùn)練

使用樸素貝葉斯分類器進(jìn)行訓(xùn)練。

from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train_vectorized, y_train)

模型評估

評估模型的性能。

from sklearn.metrics import classification_report

y_pred = model.predict(X_test_vectorized)
print(classification_report(y_test, y_pred, target_names=data.target_names))

應(yīng)用

現(xiàn)在我們可以使用訓(xùn)練好的模型來檢測新的郵件是否為垃圾郵件。

def predict_spam(email_text):
    email_vectorized = vectorizer.transform([email_text])
    prediction = model.predict(email_vectorized)
    return "Spam" if prediction[0] == 1 else "Not Spam"

email_text = "Get rich quick! Click here to win a million dollars!"
print(predict_spam(email_text))  # 輸出: Spam

總結(jié)

本文介紹了 12 個實(shí)用的自然語言處理(NLP)案例，涵蓋了文本預(yù)處理、詞干提取、詞形還原、停用詞去除、詞頻統(tǒng)計、TF-IDF 計算、命名實(shí)體識別、情感分析、文本分類、文本聚類、機(jī)器翻譯、問答系統(tǒng)和文本生成。通過這些案例，你可以更好地理解和應(yīng)用 NLP 技術(shù)。

責(zé)任編輯：趙寧寧來源：小白PythonAI編程

Python 自然語言處理 NLP

點(diǎn)贊

51CTO技術(shù)棧公眾號

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營

<sub id="w14ho"></sub>

<sub id="w14ho"><p id="w14ho"></p></sub>

<sub id="w14ho"></sub>

<sub id="w14ho"></sub>

<sub id="w14ho"><i id="w14ho"></i></sub>

<em id="w14ho"></em>