自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會(huì)

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營

鴻蒙開發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫

賬號(hào)設(shè)置退出

如何在表格數(shù)據(jù)上使用特征提取進(jìn)行機(jī)器學(xué)習(xí)

作者：湃紳Python 2020-07-08 15:43:26

人工智能機(jī)器學(xué)習(xí)

在本文中，我們將介紹如何使用特征提取對(duì)表格數(shù)據(jù)進(jìn)行數(shù)據(jù)準(zhǔn)備。特征提取為表格數(shù)據(jù)的數(shù)據(jù)準(zhǔn)備提供了另一種方法，其中所有數(shù)據(jù)轉(zhuǎn)換都并行應(yīng)用于原始輸入數(shù)據(jù)，并組合在一起以創(chuàng)建一個(gè)大型數(shù)據(jù)集。

數(shù)據(jù)準(zhǔn)備最常見的方法是研究一個(gè)數(shù)據(jù)集，審查機(jī)器學(xué)習(xí)算法的期望，然后仔細(xì)選擇最合適的數(shù)據(jù)準(zhǔn)備技術(shù)來轉(zhuǎn)換原始數(shù)據(jù)，以最好地滿足算法的期望。這是緩慢的，昂貴的，并且需要大量的專業(yè)知識(shí)。

數(shù)據(jù)準(zhǔn)備的另一種方法是并行地對(duì)原始數(shù)據(jù)應(yīng)用一套通用和常用的數(shù)據(jù)準(zhǔn)備技術(shù)，并將所有轉(zhuǎn)換的結(jié)果合并到一個(gè)大數(shù)據(jù)集中，從中可以擬合和評(píng)估模型。

這是數(shù)據(jù)準(zhǔn)備的另一種哲學(xué)，它將數(shù)據(jù)轉(zhuǎn)換視為一種從原始數(shù)據(jù)中提取顯著特征的方法，從而將問題的結(jié)構(gòu)暴露給學(xué)習(xí)算法。它需要學(xué)習(xí)加權(quán)輸入特征可伸縮的算法，并使用那些與被預(yù)測(cè)目標(biāo)最相關(guān)的輸入特征。

這種方法需要較少的專業(yè)知識(shí)，與數(shù)據(jù)準(zhǔn)備方法的全網(wǎng)格搜索相比，在計(jì)算上是有效的，并且可以幫助發(fā)現(xiàn)非直觀的數(shù)據(jù)準(zhǔn)備解決方案，為給定的預(yù)測(cè)建模問題取得良好或最好的性能。

在本文中，我們將介紹如何使用特征提取對(duì)表格數(shù)據(jù)進(jìn)行數(shù)據(jù)準(zhǔn)備。

特征提取為表格數(shù)據(jù)的數(shù)據(jù)準(zhǔn)備提供了另一種方法，其中所有數(shù)據(jù)轉(zhuǎn)換都并行應(yīng)用于原始輸入數(shù)據(jù)，并組合在一起以創(chuàng)建一個(gè)大型數(shù)據(jù)集。

如何使用特征提取方法進(jìn)行數(shù)據(jù)準(zhǔn)備，以提高標(biāo)準(zhǔn)分類數(shù)據(jù)集的基準(zhǔn)性能。。

如何將特征選擇添加到特征提取建模管道中，以進(jìn)一步提升標(biāo)準(zhǔn)數(shù)據(jù)集上的建模性能。

本文分為三個(gè)部分：

一、特征提取技術(shù)的數(shù)據(jù)準(zhǔn)備

二、數(shù)據(jù)集和性能基準(zhǔn)

葡萄酒分類數(shù)據(jù)集
基準(zhǔn)模型性能

三、特征提取方法進(jìn)行數(shù)據(jù)準(zhǔn)備

特征提取技術(shù)的數(shù)據(jù)準(zhǔn)備

數(shù)據(jù)準(zhǔn)備可能具有挑戰(zhàn)性。

最常用和遵循的方法是分析數(shù)據(jù)集，檢查算法的要求，并轉(zhuǎn)換原始數(shù)據(jù)以最好地滿足算法的期望。

這可能是有效的，但也很慢，并且可能需要數(shù)據(jù)分析和機(jī)器學(xué)習(xí)算法方面的專業(yè)知識(shí)。

另一種方法是將輸入變量的準(zhǔn)備視為建模管道的超參數(shù)，并在選擇算法和算法配置時(shí)對(duì)其進(jìn)行調(diào)優(yōu)。

盡管它在計(jì)算上可能會(huì)很昂貴，但它也可能是暴露不直觀的解決方案并且只需要很少的專業(yè)知識(shí)的有效方法。

在這兩種數(shù)據(jù)準(zhǔn)備方法之間尋求合適的方法是將輸入數(shù)據(jù)的轉(zhuǎn)換視為特征工程或特征提取過程。這涉及對(duì)原始數(shù)據(jù)應(yīng)用一套通用或常用的數(shù)據(jù)準(zhǔn)備技術(shù)，然后將所有特征聚合在一起以創(chuàng)建一個(gè)大型數(shù)據(jù)集，然后根據(jù)該數(shù)據(jù)擬合并評(píng)估模型。

該方法的原理將每種數(shù)據(jù)準(zhǔn)備技術(shù)都視為一種轉(zhuǎn)換，可以從原始數(shù)據(jù)中提取顯著特征，以呈現(xiàn)給學(xué)習(xí)算法。理想情況下，此類轉(zhuǎn)換可解開復(fù)雜的關(guān)系和復(fù)合輸入變量，進(jìn)而允許使用更簡單的建模算法，例如線性機(jī)器學(xué)習(xí)技術(shù)。

由于缺乏更好的名稱，我們將其稱為“ 特征工程方法 ”或“ 特征提取方法 ”，用于為預(yù)測(cè)建模項(xiàng)目配置數(shù)據(jù)準(zhǔn)備。

它允許在選擇數(shù)據(jù)準(zhǔn)備方法時(shí)使用數(shù)據(jù)分析和算法專業(yè)知識(shí)，并可以找到不直觀的解決方案，但計(jì)算成本卻低得多。

輸入特征數(shù)量的排除也可以通過使用特征選擇技術(shù)來明確解決，這些特征選擇技術(shù)嘗試對(duì)所提取的大量特征的重要性或價(jià)值進(jìn)行排序，并僅選擇與預(yù)測(cè)目標(biāo)最相關(guān)的一小部分變量。

我們可以通過一個(gè)可行的示例探索這種數(shù)據(jù)準(zhǔn)備方法。

在深入研究示例之前，讓我們首先選擇一個(gè)標(biāo)準(zhǔn)數(shù)據(jù)集并制定性能基準(zhǔn)。

數(shù)據(jù)集和性能基準(zhǔn)

我們將首先選擇一個(gè)標(biāo)準(zhǔn)的機(jī)器學(xué)習(xí)數(shù)據(jù)集，并為此數(shù)據(jù)集建立性能基準(zhǔn)。這將為探索數(shù)據(jù)準(zhǔn)備的特征提取方法提供背景。

葡萄酒分類數(shù)據(jù)集

我們將使用葡萄酒分類數(shù)據(jù)集。

該數(shù)據(jù)集具有13個(gè)輸入變量，這些變量描述了葡萄酒樣品的化學(xué)成分，并要求將葡萄酒分類為三種類型之一。

該示例加載數(shù)據(jù)集并將其拆分為輸入和輸出列，然后匯總數(shù)據(jù)數(shù)組。

# example of loading and summarizing the wine dataset 
from pandas import read_csv 
# define the location of the dataset 
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv' 
# load the dataset as a data frame 
df = read_csv(url, header=None) 
# retrieve the numpy array 
data = df.values 
# split the columns into input and output variables 
X, y = data[:, :-1], data[:, -1] 
# summarize the shape of the loaded data 
print(X.shape, y.shape) 
#(178, 13) (178,)

通過運(yùn)行示例，我們可以看到數(shù)據(jù)集已正確加載，并且有179行數(shù)據(jù)，其中包含13個(gè)輸入變量和一個(gè)目標(biāo)變量。

接下來，讓我們?cè)谠摂?shù)據(jù)集上評(píng)估一個(gè)模型，并建立性能基準(zhǔn)。

基準(zhǔn)模型性能

通過評(píng)估原始輸入數(shù)據(jù)的模型，我們可以為葡萄酒分類任務(wù)建立性能基準(zhǔn)。

在這種情況下，我們將評(píng)估邏輯回歸模型。

首先，如scikit-learn庫所期望的，我們可以通過確保輸入變量是數(shù)字并且目標(biāo)變量是標(biāo)簽編碼來執(zhí)行最少的數(shù)據(jù)準(zhǔn)備。

# minimally prepare dataset 
X = X.astype('float') 
y = LabelEncoder().fit_transform(y.astype('str'))

接下來，我們可以定義我們的預(yù)測(cè)模型。

# define the model 
model = LogisticRegression(solver='liblinear')

我們將使用重復(fù)分層k-fold交叉驗(yàn)證的標(biāo)準(zhǔn)(10次重復(fù)和3次重復(fù))來評(píng)估模型。

模型性能將用分類精度來評(píng)估。

model = LogisticRegression(solver='liblinear') 
# define the cross-validation procedure 
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) 
# evaluate model 
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

在運(yùn)行結(jié)束時(shí)，我們將報(bào)告所有重復(fù)和評(píng)估倍數(shù)中收集的準(zhǔn)確性得分的平均值和標(biāo)準(zhǔn)偏差。

# report performance 
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

結(jié)合在一起，下面列出了在原酒分類數(shù)據(jù)集上評(píng)估邏輯回歸模型的完整示例。

# baseline model performance on the wine dataset 
from numpy import mean 
from numpy import std 
from pandas import read_csv 
from sklearn.preprocessing import LabelEncoder 
from sklearn.model_selection import RepeatedStratifiedKFold 
from sklearn.model_selection import cross_val_score 
from sklearn.linear_model import LogisticRegression 
# load the dataset 
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv' 
df = read_csv(url, header=None) 
data = df.values 
X, y = data[:, :-1], data[:, -1] 
# minimally prepare dataset 
X = X.astype('float') 
y = LabelEncoder().fit_transform(y.astype('str')) 
# define the model 
model = LogisticRegression(solver='liblinear') 
# define the cross-validation procedure 
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) 
# evaluate model 
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) 
# report performance 
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores))) 
     
#Accuracy: 0.953 (0.048)

如何在表格數(shù)據(jù)上使用特征提取進(jìn)行機(jī)器學(xué)習(xí)

通過運(yùn)行示例評(píng)估模型性能，并報(bào)告均值和標(biāo)準(zhǔn)差分類準(zhǔn)確性。

考慮到學(xué)習(xí)算法的隨機(jī)性，評(píng)估程序以及機(jī)器之間的精度差異，您的結(jié)果可能會(huì)有所不同。嘗試運(yùn)行該示例幾次。

在這種情況下，我們可以看到，對(duì)原始輸入數(shù)據(jù)進(jìn)行的邏輯回歸模型擬合獲得了約95.3%的平均分類精度，為性能提供了基準(zhǔn)。

接下來，讓我們探討使用基于特征提取的數(shù)據(jù)準(zhǔn)備方法是否可以提高性能。

特征提取方法進(jìn)行數(shù)據(jù)準(zhǔn)備

第一步是選擇一套通用且常用的數(shù)據(jù)準(zhǔn)備技術(shù)。

在這種情況下，假設(shè)輸入變量是數(shù)字，我們將使用一系列轉(zhuǎn)換來更改輸入變量的比例，例如MinMaxScaler，StandardScaler和RobustScaler，以及使用轉(zhuǎn)換來鏈接輸入變量的分布，例如QuantileTransformer和KBinsDiscretizer。最后，我們還將使用轉(zhuǎn)換來消除輸入變量(例如PCA和TruncatedSVD)之間的線性相關(guān)性。

FeatureUnion類可用于定義要執(zhí)行的轉(zhuǎn)換列表，這些轉(zhuǎn)換的結(jié)果將被聚合在一起。這將創(chuàng)建一個(gè)具有大量列的新數(shù)據(jù)集。

列數(shù)的估計(jì)將是13個(gè)輸入變量乘以五次轉(zhuǎn)換或65次再加上PCA和SVD維數(shù)降低方法的14列輸出，從而得出總共約79個(gè)特征。

# transforms for the feature union 
transforms = list() 
transforms.append(('mms', MinMaxScaler())) 
transforms.append(('ss', StandardScaler())) 
transforms.append(('rs', RobustScaler())) 
transforms.append(('qt', QuantileTransformer(n_quantiles=100, output_distribution='normal'))) 
transforms.append(('kbd', KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform'))) 
transforms.append(('pca', PCA(n_components=7))) 
transforms.append(('svd', TruncatedSVD(n_components=7))) 
# create the feature union 
fu = FeatureUnion(transforms)

如何在表格數(shù)據(jù)上使用特征提取進(jìn)行機(jī)器學(xué)習(xí)

然后，我們可以使用FeatureUnion作為第一步，并使用Logistic回歸模型作為最后一步來創(chuàng)建建模管道。

# define the model 
model = LogisticRegression(solver='liblinear') 
# define the pipeline 
steps = list() 
steps.append(('fu', fu)) 
steps.append(('m', model)) 
pipeline = Pipeline(steps=steps)

然后可以像前面一樣使用重復(fù)的分層k-fold交叉驗(yàn)證來評(píng)估管道。

下面列出了完整的示例。

# data preparation as feature engineering for wine dataset 
from numpy import mean 
from numpy import std 
from pandas import read_csv 
from sklearn.model_selection import RepeatedStratifiedKFold 
from sklearn.model_selection import cross_val_score 
from sklearn.linear_model import LogisticRegression 
from sklearn.pipeline import Pipeline 
from sklearn.pipeline import FeatureUnion 
from sklearn.preprocessing import LabelEncoder 
from sklearn.preprocessing import MinMaxScaler 
from sklearn.preprocessing import StandardScaler 
from sklearn.preprocessing import RobustScaler 
from sklearn.preprocessing import QuantileTransformer 
from sklearn.preprocessing import KBinsDiscretizer 
from sklearn.decomposition import PCA 
from sklearn.decomposition import TruncatedSVD 
# load the dataset 
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv' 
df = read_csv(url, header=None) 
data = df.values 
X, y = data[:, :-1], data[:, -1] 
# minimally prepare dataset 
X = X.astype('float') 
y = LabelEncoder().fit_transform(y.astype('str')) 
# transforms for the feature union 
transforms = list() 
transforms.append(('mms', MinMaxScaler())) 
transforms.append(('ss', StandardScaler())) 
transforms.append(('rs', RobustScaler())) 
transforms.append(('qt', QuantileTransformer(n_quantiles=100, output_distribution='normal'))) 
transforms.append(('kbd', KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform'))) 
transforms.append(('pca', PCA(n_components=7))) 
transforms.append(('svd', TruncatedSVD(n_components=7))) 
# create the feature union 
fu = FeatureUnion(transforms) 
# define the model 
model = LogisticRegression(solver='liblinear') 
# define the pipeline 
steps = list() 
steps.append(('fu', fu)) 
steps.append(('m', model)) 
pipeline = Pipeline(steps=steps) 
# define the cross-validation procedure 
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) 
# evaluate model 
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) 
# report performance 
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores))) 
#Accuracy: 0.968 (0.037)

如何在表格數(shù)據(jù)上使用特征提取進(jìn)行機(jī)器學(xué)習(xí)

通過運(yùn)行示例評(píng)估模型性能，并報(bào)告均值和標(biāo)準(zhǔn)差分類準(zhǔn)確性。

考慮到學(xué)習(xí)算法的隨機(jī)性，評(píng)估程序以及機(jī)器之間的精度差異，您的結(jié)果可能會(huì)有所不同。嘗試運(yùn)行該示例幾次。

在這種情況下，我們可以看到性能相對(duì)于基準(zhǔn)性能有所提升，實(shí)現(xiàn)了平均分類精度約96.8%。

嘗試向FeatureUnion添加更多數(shù)據(jù)準(zhǔn)備方法，以查看是否可以提高性能。

我們還可以使用特征選擇將大約80個(gè)提取的特征縮減為與模型最相關(guān)的特征的子集。除了減少模型的復(fù)雜性之外，它還可以通過刪除不相關(guān)和冗余的輸入特征來提高性能。

在這種情況下，我們將使用遞歸特征消除(RFE)技術(shù)進(jìn)行特征選擇，并將其配置為選擇15個(gè)最相關(guān)的特征。

# define the feature selection 
rfe = RFE(estimator=LogisticRegression(solver='liblinear'), n_features_to_select=15)

然后我們可以將RFE特征選擇添加到FeatureUnion算法之后和LogisticRegression算法之前的建模管道中。

# define the pipeline 
steps = list() 
steps.append(('fu', fu)) 
steps.append(('rfe', rfe)) 
steps.append(('m', model)) 
pipeline = Pipeline(steps=steps)

將這些結(jié)合起來，下面列出了特征選擇的特征選擇數(shù)據(jù)準(zhǔn)備方法的完整示例。

# data preparation as feature engineering with feature selection for wine dataset 
from numpy import mean 
from numpy import std 
from pandas import read_csv 
from sklearn.model_selection import RepeatedStratifiedKFold 
from sklearn.model_selection import cross_val_score 
from sklearn.linear_model import LogisticRegression 
from sklearn.pipeline import Pipeline 
from sklearn.pipeline import FeatureUnion 
from sklearn.preprocessing import LabelEncoder 
from sklearn.preprocessing import MinMaxScaler 
from sklearn.preprocessing import StandardScaler 
from sklearn.preprocessing import RobustScaler 
from sklearn.preprocessing import QuantileTransformer 
from sklearn.preprocessing import KBinsDiscretizer 
from sklearn.feature_selection import RFE 
from sklearn.decomposition import PCA 
from sklearn.decomposition import TruncatedSVD 
# load the dataset 
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv' 
df = read_csv(url, header=None) 
data = df.values 
X, y = data[:, :-1], data[:, -1] 
# minimally prepare dataset 
X = X.astype('float') 
y = LabelEncoder().fit_transform(y.astype('str')) 
# transforms for the feature union 
transforms = list() 
transforms.append(('mms', MinMaxScaler())) 
transforms.append(('ss', StandardScaler())) 
transforms.append(('rs', RobustScaler())) 
transforms.append(('qt', QuantileTransformer(n_quantiles=100, output_distribution='normal'))) 
transforms.append(('kbd', KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform'))) 
transforms.append(('pca', PCA(n_components=7))) 
transforms.append(('svd', TruncatedSVD(n_components=7))) 
# create the feature union 
fu = FeatureUnion(transforms) 
# define the feature selection 
rfe = RFE(estimator=LogisticRegression(solver='liblinear'), n_features_to_select=15) 
# define the model 
model = LogisticRegression(solver='liblinear') 
# define the pipeline 
steps = list() 
steps.append(('fu', fu)) 
steps.append(('rfe', rfe)) 
steps.append(('m', model)) 
pipeline = Pipeline(steps=steps) 
# define the cross-validation procedure 
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) 
# evaluate model 
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) 
# report performance 
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores))) 
#Accuracy: 0.989 (0.022)

如何在表格數(shù)據(jù)上使用特征提取進(jìn)行機(jī)器學(xué)習(xí)

運(yùn)行實(shí)例評(píng)估模型的性能，并報(bào)告均值和標(biāo)準(zhǔn)差分類精度。

由于學(xué)習(xí)算法的隨機(jī)性、評(píng)估過程以及不同機(jī)器之間的精度差異，您的結(jié)果可能會(huì)有所不同。試著運(yùn)行這個(gè)例子幾次。

再一次，我們可以看到性能的進(jìn)一步提升，從所有提取特征的96.8%提高到建模前使用特征選擇的98.9。

責(zé)任編輯：未麗燕來源：今日頭條

數(shù)據(jù)機(jī)器學(xué)習(xí)提取

點(diǎn)贊

51CTO技術(shù)棧公眾號(hào)

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營

<pre id="gmhpv"><strike id="gmhpv"></strike></pre>

<acronym id="gmhpv"></acronym><center id="gmhpv"></center>