自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

AI.x社區(qū)

軟考社區(qū)

企業(yè)培訓

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會

公眾號矩陣

移動端

視頻課免費課排行榜短視頻直播課軟考學堂

全部課程軟考華為認證廠商認證 IT技術(shù)PMP項目管理免費題庫

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學堂

51CTO博客

CTO訓練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學堂APP

51CTO學堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設置退出

通透！NLP 中常用的八大詞嵌入技術(shù)

作者：程序員小寒 2024-07-15 08:13:12

BERT 是一種基于 Transformer 的模型，它通過雙向（即從左到右和從右到左）考慮整個句子來生成上下文感知的嵌入。與 Word2Vec 或 GloVe 等為每個單詞生成單一表示的傳統(tǒng)詞嵌入不同，BERT 根據(jù)其上下文為每個單詞生成不同的嵌入。

大家好，我是小寒。

今天給大家分享自然語言處理中常用的詞嵌入（Word embedding）技術(shù)

Word embedding 是自然語言處理（NLP）中的一種技術(shù)，用于將詞匯映射到連續(xù)向量空間，以便能夠更好地處理和分析文本數(shù)據(jù)。

這些向量（嵌入）能夠捕捉到詞匯之間的語義關系和上下文信息。

圖片

常用的 word embedding 技術(shù)

1.One-Hot Encoding

One-Hot Encoding 是最簡單的詞嵌入方法，將每個詞表示為一個詞匯表大小的向量，在該向量中，只有一個位置為1，其余位置為0。

優(yōu)點

簡單易實現(xiàn)。
沒有任何假設或?qū)W習過程。

缺點

維度非常高，詞匯表越大，向量維度越高。
不能捕捉詞匯之間的語義關系。
稀疏表示，效率低下。

from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Sample data
corpus = ['dog', 'cat', 'dog', 'fish']

# Reshape data to fit the model
corpus = np.array(corpus).reshape(-1, 1)

# One-hot encode the data
onehot_encoder = OneHotEncoder(sparse=False)
onehot_encoded = onehot_encoder.fit_transform(corpus)

print(onehot_encoded)

#output
[[0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]

2.Bag of Words (BoW)

詞袋法 (BOW) 是自然語言處理 (NLP) 中的一種簡單技術(shù)，用于將文本文檔表示為數(shù)字向量。

其理念是將每個文檔視為一個單詞袋或單詞集合，然后計算文檔中每個單詞的頻率。

它不考慮單詞的順序，但提供了一種將文本轉(zhuǎn)換為向量的直接方法。

優(yōu)點

簡單易實現(xiàn)。
對小規(guī)模文本有效。

缺點

詞匯表大的情況下，向量維度高。
不能捕捉詞匯的順序和語義關系。
對常用詞和不常用詞一視同仁，不能區(qū)分重要詞匯。

from sklearn.feature_extraction.text import CountVectorizer

# Sample data
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the corpus
X = vectorizer.fit_transform(corpus)

print(X.toarray())
print(vectorizer.get_feature_names_out())

#output of the above code
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]
['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']

3.TF-IDF

TF-IDF 是對 BoW 的改進，它通過降低常用詞的權(quán)重同時增加稀有詞的權(quán)重來考慮單詞的重要性。

TF-IDF 背后的理念是通過考慮兩個因素來計算文檔中單詞的重要性：

詞頻 (TF)：這衡量了某個詞在文檔中出現(xiàn)的頻率。頻率越高，該詞對該文檔就越重要。
逆文檔頻率 (IDF)：它是衡量某個詞在語料庫中所有文檔的重要性的指標。它基于這樣的直覺：出現(xiàn)在許多文檔中的單詞比出現(xiàn)在較少文檔中的單詞信息量更少。

公式：

TF：詞頻，表示詞 t 在文檔 d 中出現(xiàn)的次數(shù)。詞在文檔中出現(xiàn)的次數(shù)文檔中的總詞數(shù)
IDF：逆文檔頻率，衡量詞在整個語料庫中的稀有程度。文檔總數(shù)包含詞的文檔數(shù)
TF-IDF：TF 和 IDF 的乘積。

優(yōu)點

強調(diào)重要詞匯，減弱常見詞的影響。
適用于信息檢索和文本挖掘。

缺點

仍然是稀疏向量，維度高。
不能捕捉詞匯的順序和語義關系。

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample data
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the corpus
X = vectorizer.fit_transform(corpus)

print(X.toarray())
print(vectorizer.get_feature_names_out())

#output
[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]
['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']

4.Word2Vec

Word2Vec 是一種基于神經(jīng)網(wǎng)絡的模型，可生成單詞的密集向量表示。

Word2Vec 的基本思想是訓練神經(jīng)網(wǎng)絡以預測給定目標詞的上下文詞，然后使用生成的向量表示來捕獲單詞的語義。

它使用兩種主要方法捕獲單詞之間的語義關系：連續(xù)詞袋 (CBOW) 和 Skip-gram。

連續(xù)詞袋模型（CBOW）：根據(jù)周圍的上下文詞預測目標詞。
Skip-Gram：根據(jù)目標詞預測周圍的上下文詞。

圖片

優(yōu)點

能捕捉詞匯的語義關系。
生成的詞向量密集且維度較低。
在大規(guī)模語料庫上訓練效果顯著。

缺點

需要大量語料進行訓練。
對計算資源要求較高。

from gensim.models import Word2Vec

# Sample data
sentences = [
    ['this', 'is', 'the', 'first', 'document'],
    ['this', 'document', 'is', 'the', 'second', 'document'],
    ['and', 'this', 'is', 'the', 'third', 'one'],
    ['is', 'this', 'the', 'first', 'document']
]

# Initialize the Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Train the model
model.train(sentences, total_examples=len(sentences), epochs=10)

# Get vector for a word
print(model.wv['document'])

5.GloVe

GloVe (Global Vectors for Word Representation) 是由斯坦福大學的研究人員在 2014 年提出的一種詞嵌入技術(shù)。

它結(jié)合了基于統(tǒng)計的全局矩陣分解方法和基于預測的局部上下文窗口方法，旨在通過捕捉詞對在大規(guī)模語料庫中的全局共現(xiàn)信息來學習詞向量。

GloVe 通過構(gòu)建一個詞對共現(xiàn)矩陣，并在此基礎上進行矩陣分解來學習詞向量。共現(xiàn)矩陣的每個元素表示兩個詞在一定窗口范圍內(nèi)共同出現(xiàn)的次數(shù)。GloVe 模型試圖找到一個向量表示，使得兩個詞向量的點積能夠很好地近似它們在共現(xiàn)矩陣中的共現(xiàn)概率。

優(yōu)點

能捕捉詞匯的語義關系和全局統(tǒng)計信息。
生成的詞向量密集且維度較低。
對大規(guī)模語料庫有良好表現(xiàn)。

缺點

需要大量語料進行訓練。
對計算資源要求較高。

import gensim.downloader as api

# Download pre-trained GloVe model (choose the size you need - 50, 100, 200, or 300 dimensions)
glove_vectors = api.load("glove-wiki-gigaword-100")  # Example: 100-dimensional GloVe

# Get word vectors (embeddings)
word1 = "king"
word2 = "queen"
vector1 = glove_vectors[word1]
vector2 = glove_vectors[word2]

# Compute cosine similarity between the two word vectors
similarity = glove_vectors.similarity(word1, word2)

print(f"Word vectors for '{word1}': {vector1}")
print(f"Word vectors for '{word2}': {vector2}")
print(f"Cosine similarity between '{word1}' and '{word2}': {similarity}")

6.FastText

FastText 是由 Facebook 的 AI 研究團隊開發(fā)的一種詞嵌入技術(shù)。

它是 Word2Vec 的擴展，主要特點是將詞分解為子詞（subword）進行表示，從而能夠更好地處理詞匯外單詞（OOV）和拼寫錯誤的詞。

FastText 的核心思想是將每個詞分解成一組子詞或 n-gram，然后學習這些子詞的向量表示。通過子詞的組合來表示整個詞，能夠更好地捕捉詞的內(nèi)部結(jié)構(gòu)信息。

優(yōu)點

處理詞匯外單詞：由于利用了子詞信息，F(xiàn)astText 能夠很好地處理詞匯表之外的新詞。
更好的泛化能力：能夠捕捉詞的內(nèi)部結(jié)構(gòu)信息，提升詞嵌入的泛化能力。
高效：在大規(guī)模數(shù)據(jù)上訓練速度快，并且生成的詞向量質(zhì)量高。

缺點

比 Word2Vec 維度更高

from gensim.models import FastText

# Sample data
sentences = [
    ['this', 'is', 'the', 'first', 'document'],
    ['this', 'document', 'is', 'the', 'second', 'document'],
    ['and', 'this', 'is', 'the', 'third', 'one'],
    ['is', 'this', 'the', 'first', 'document']
]

# Initialize the FastText model
model = FastText(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Train the model
model.train(sentences, total_examples=len(sentences), epochs=10)

# Get vector for a word
print(model.wv['document'])

7.ELMo

ELMo 是由 AllenNLP 團隊開發(fā)的一種上下文相關的詞嵌入技術(shù)。

與傳統(tǒng)的詞嵌入方法不同，ELMo 生成的詞向量依賴于上下文，并且在同一個句子中，同一個詞在不同位置的嵌入向量是不同的。

ELMo 使用雙向 LSTM 語言模型，從文本中學習詞的上下文表示。通過預訓練語言模型，然后在特定任務上進行微調(diào)，生成動態(tài)的上下文相關的詞嵌入。

圖片

優(yōu)點

上下文相關：能夠捕捉詞匯在不同上下文中的不同含義。
適應性強：在多個 NLP 任務中表現(xiàn)優(yōu)異，包括命名實體識別（NER）、問答系統(tǒng)等。

import tensorflow as tf
import tensorflow_hub as hub

# Load pre-trained ELMo model from TensorFlow Hub
elmo = hub.load("https://tfhub.dev/google/elmo/3")

# Sample data
sentences = ["This is the first document.", "This document is the second document."]

def elmo_vectors(sentences):
    embeddings = elmo.signatures['default'](tf.constant(sentences))['elmo']
    return embeddings

# Get ELMo embeddings
elmo_embeddings = elmo_vectors(sentences)
print(elmo_embeddings)

8.BERT

BERT 是一種基于 Transformer 的模型，它通過雙向（即從左到右和從右到左）考慮整個句子來生成上下文感知的嵌入。

與 Word2Vec 或 GloVe 等為每個單詞生成單一表示的傳統(tǒng)詞嵌入不同，BERT 根據(jù)其上下文為每個單詞生成不同的嵌入。

優(yōu)點

上下文雙向編碼：能夠同時捕捉詞匯的前后文信息。
預訓練和微調(diào)：通過預訓練大規(guī)模語言模型，并在特定任務上微調(diào)，顯著提升模型性能。
廣泛適用性：在多個 NLP 任務中表現(xiàn)出色，如問答系統(tǒng)、文本分類、命名實體識別等。

from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Sample data
sentence = "This is the first document."

# Tokenize input
inputs = tokenizer(sentence, return_tensors='pt')

# Get embeddings
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state

print(embeddings)

責任編輯：武曉燕來源：程序員學長

BERT 模型嵌入

51CTO技術(shù)棧公眾號

業(yè)務
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學堂精培企業(yè)培訓 CTO訓練營

<dfn id="dc1xu"></dfn>

<sub id="dc1xu"><p id="dc1xu"><li id="dc1xu"></li></p></sub>