自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

<cite id="kxqp1"><track id="kxqp1"></track></cite>

<sub id="kxqp1"><p id="kxqp1"></p></sub>

51CTO首頁(yè)

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開(kāi)發(fā)者社區(qū)

WOT技術(shù)大會(huì)

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫(kù)

在線學(xué)習(xí)

文章資源問(wèn)答課堂專(zhuān)欄直播

51CTO

鴻蒙開(kāi)發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營(yíng)

鴻蒙開(kāi)發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開(kāi)發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫(kù)

賬號(hào)設(shè)置退出

如何使用5種機(jī)器學(xué)習(xí)算法對(duì)罕見(jiàn)事件進(jìn)行分類(lèi)

作者：布加迪編譯 2020-02-03 08:00:00

人工智能機(jī)器學(xué)習(xí) 算法

機(jī)器學(xué)習(xí)是數(shù)據(jù)科學(xué)界的王冠，而監(jiān)督學(xué)習(xí)是機(jī)器學(xué)習(xí)界這頂王冠上的寶石。

【51CTO.com快譯】機(jī)器學(xué)習(xí)是數(shù)據(jù)科學(xué)界的王冠，而監(jiān)督學(xué)習(xí)是機(jī)器學(xué)習(xí)界這頂王冠上的寶石。

背景

幾年前《哈佛商業(yè)評(píng)論》發(fā)表過(guò)一篇題為《數(shù)據(jù)科學(xué)家：21世紀(jì)最性感的工作》的文章。文章發(fā)表后，數(shù)據(jù)科學(xué)系或統(tǒng)計(jì)系備受大學(xué)生追捧，沉悶的數(shù)據(jù)科學(xué)家頭回被認(rèn)為很性感。

對(duì)一些行業(yè)而言，數(shù)據(jù)科學(xué)家已改變了公司結(jié)構(gòu)，將許多決策交給了一線員工。能夠從數(shù)據(jù)獲得實(shí)用的業(yè)務(wù)洞察力從未如此容易。

據(jù)吳恩達(dá)稱(chēng)，監(jiān)督學(xué)習(xí)算法為業(yè)界貢獻(xiàn)了大部分價(jià)值。

監(jiān)督學(xué)習(xí)為什么創(chuàng)造如此大的業(yè)務(wù)價(jià)值不容懷疑。銀行用它來(lái)檢測(cè)信用卡欺詐，交易員根據(jù)模型做出購(gòu)買(mǎi)決定，工廠對(duì)生產(chǎn)線進(jìn)行過(guò)濾以查找有缺陷的零部件。

這些業(yè)務(wù)場(chǎng)景有兩個(gè)共同的特征：

二進(jìn)制結(jié)果：欺詐vs不欺詐，購(gòu)買(mǎi)vs不購(gòu)買(mǎi)，有缺陷的vs沒(méi)有缺陷。
不平均的數(shù)據(jù)分布：一個(gè)多數(shù)組vs一個(gè)少數(shù)組。

正如吳恩達(dá)最近指出，小數(shù)據(jù)、穩(wěn)健性和人為因素是AI項(xiàng)目取得成功的三大障礙。在某種程度上，一個(gè)少數(shù)組方面的罕見(jiàn)事件問(wèn)題也是一個(gè)小數(shù)據(jù)問(wèn)題：機(jī)器學(xué)習(xí)算法從多數(shù)組學(xué)到更多信息，很容易對(duì)小數(shù)據(jù)組錯(cuò)誤分類(lèi)。

下面是幾個(gè)事關(guān)重大的問(wèn)題：

對(duì)于這些罕見(jiàn)事件，哪種機(jī)器學(xué)習(xí)方法性能更好?
什么度量指標(biāo)?
有何美中不足?

本文試圖通過(guò)運(yùn)用5種機(jī)器學(xué)習(xí)方法處理實(shí)際數(shù)據(jù)集來(lái)回答上述問(wèn)題，附有完整的R實(shí)現(xiàn)代碼。

有關(guān)完整描述和原始數(shù)據(jù)集，請(qǐng)參閱原始數(shù)據(jù)集：https://archive.ics.uci.edu/ml/datasets/bank+marketing;有關(guān)完整的R代碼，請(qǐng)查看我的Github：https://github.com/LeihuaYe/Machine-Learning-Classification-for-Imbalanced-Data。

業(yè)務(wù)問(wèn)題

葡萄牙一家銀行在實(shí)施一項(xiàng)新銀行服務(wù)(定期存款)的營(yíng)銷(xiāo)策略，想知道哪些類(lèi)型的客戶已訂購(gòu)該服務(wù)，以便銀行可以在將來(lái)調(diào)整營(yíng)銷(xiāo)策略，鎖定特定人群。數(shù)據(jù)科學(xué)家與銷(xiāo)售和營(yíng)銷(xiāo)團(tuán)隊(duì)合作，提出了統(tǒng)計(jì)解決方案，以識(shí)別未來(lái)訂戶。

R實(shí)現(xiàn)

以下面是模型選擇流程和R實(shí)現(xiàn)。

1.導(dǎo)入、數(shù)據(jù)清理和探索性數(shù)據(jù)分析

不妨加載并清理原始數(shù)據(jù)集。

####load the dataset 
 
banking=read.csv(“bank-additional-full.csv”,sep =”;”,header=T)##check for missing data and make sure no missing data 
 
banking[!complete.cases(banking),]#re-code qualitative (factor) variables into numeric 
 
banking$job= recode(banking$job, “‘admin.’=1;’blue-collar’=2;’entrepreneur’=3;’housemaid’=4;’management’=5;’retired’=6;’self-employed’=7;’services’=8;’student’=9;’technician’=10;’unemployed’=11;’unknown’=12”)#recode variable again 
 
banking$marital = recode(banking$marital, “‘divorced’=1;’married’=2;’single’=3;’unknown’=4”)banking$education = recode(banking$education, “‘basic.4y’=1;’basic.6y’=2;’basic.9y’=3;’high.school’=4;’illiterate’=5;’professional.course’=6;’university.degree’=7;’unknown’=8”)banking$default = recode(banking$default, “‘no’=1;’yes’=2;’unknown’=3”)banking$housing = recode(banking$housing, “‘no’=1;’yes’=2;’unknown’=3”)banking$loan = recode(banking$loan, “‘no’=1;’yes’=2;’unknown’=3”) 
 
banking$contact = recode(banking$loan, “‘cellular’=1;’telephone’=2;”)banking$month = recode(banking$month, “‘mar’=1;’apr’=2;’may’=3;’jun’=4;’jul’=5;’aug’=6;’sep’=7;’oct’=8;’nov’=9;’dec’=10”)banking$day_of_week = recode(banking$day_of_week, “‘mon’=1;’tue’=2;’wed’=3;’thu’=4;’fri’=5;”)banking$poutcome = recode(banking$poutcome, “‘failure’=1;’nonexistent’=2;’success’=3;”)#remove variable “pdays”, b/c it has no variation 
 
banking$pdays=NULL #remove variable “pdays”, b/c itis collinear with the DV 
 
banking$duration=NULL

清理原始數(shù)據(jù)似乎很乏味，因?yàn)槲覀円獮槿笔У淖兞恐匦戮幋a，并將定性變量轉(zhuǎn)換成定量變量。清理實(shí)際數(shù)據(jù)要花更長(zhǎng)的時(shí)間。有言道“數(shù)據(jù)科學(xué)家花80%的時(shí)間來(lái)清理數(shù)據(jù)、花20%的時(shí)間來(lái)構(gòu)建模型。”

下一步，不妨探究結(jié)果變量的分布。

#EDA of the DV  
plot(banking$y,main="Plot 1: Distribution of Dependent Variable")

圖1

由此可見(jiàn)，相關(guān)變量(服務(wù)訂購(gòu))并不均勻分布，“No”多過(guò)“Yes”。分布不平衡應(yīng)該會(huì)發(fā)出一些警告信號(hào)，因?yàn)閿?shù)據(jù)分布影響最終的統(tǒng)計(jì)模型。它很容易使用多數(shù)范例(majority case)開(kāi)發(fā)的模型對(duì)少數(shù)范例(minority case)錯(cuò)誤分類(lèi)。

2. 數(shù)據(jù)分割

下一步，不妨將數(shù)據(jù)集分割成兩部分：訓(xùn)練集和測(cè)試集。通常而言，我們堅(jiān)持80–20分割：80%是訓(xùn)練集，20%是測(cè)試集。如果是時(shí)間序列數(shù)據(jù)，我們基于90%的數(shù)據(jù)訓(xùn)練模型，將剩余10%的數(shù)據(jù)作為測(cè)試數(shù)據(jù)集。

#split the dataset into training and test sets randomly  
set.seed(1)#set seed so as to generate the same value each time we run the code#create an index to split the data: 80% training and 20% test  
index = round(nrow(banking)*0.2,digits=0)#sample randomly throughout the dataset and keep the total number equal to the value of index  
test.indices = sample(1:nrow(banking), index)#80% training set  
banking.train=banking[-test.indices,] #20% test set  
banking.test=banking[test.indices,] #Select the training set except the DV  
YTrain = banking.train$y  
XTrain = banking.train %>% select(-y)# Select the test set except the DV  
YTest = banking.test$y  
XTest = banking.test %>% select(-y)

這里，不妨創(chuàng)建一個(gè)空的跟蹤記錄。

records = matrix(NA, nrow=5, ncol=2) 
colnames(records) <- c(“train.error”,”test.error”)  
rownames(records) <- c(“Logistic”,”Tree”,”KNN”,”Random Forests”,”SVM”)

3. 訓(xùn)練模型

我們?cè)谶@一節(jié)定義一個(gè)新的函數(shù)(calc_error_rate)，運(yùn)用它計(jì)算每個(gè)機(jī)器學(xué)習(xí)模型的訓(xùn)練和測(cè)試誤差。

calc_error_rate <- function(predicted.value, true.value)  
{return(mean(true.value!=predicted.value))}

如果預(yù)測(cè)的標(biāo)簽與實(shí)際值不符，該函數(shù)就計(jì)算比率。

#1 邏輯回歸模型

想了解邏輯模型的簡(jiǎn)介，不妨看看這兩篇文章：《機(jī)器學(xué)習(xí)101》(https://towardsdatascience.com/machine-learning-101-predicting-drug-use-using-logistic-regression-in-r-769be90eb03d)和《機(jī)器學(xué)習(xí)102》(https://towardsdatascience.com/machine-learning-102-logistic-regression-with-polynomial-features-98a208688c17)。

不妨添加一個(gè)邏輯模型，包括結(jié)果變量以外的所有其他變量。由于結(jié)果是二進(jìn)制的，我們將模型設(shè)置為二項(xiàng)分布(“family-binomial”)。

glm.fit = glm(y ~ age+factor(job)+factor(marital)+factor(education)+factor(default)+factor(housing)+factor(loan)+factor(contact)+factor(month)+factor(day_of_week)+campaign+previous+factor(poutcome)+emp.var.rate+cons.price.idx+cons.conf.idx+euribor3m+nr.employed, data=banking.train, family=binomial)

下一步是獲得訓(xùn)練誤差。由于我們預(yù)測(cè)結(jié)果的類(lèi)型并采用多數(shù)規(guī)則，于是將類(lèi)型設(shè)置為響應(yīng)式：如果先驗(yàn)概率超過(guò)或等于0.5，我們預(yù)測(cè)結(jié)果為yes，否則是no。

prob.training = predict(glm.fit,type=”response”)banking.train_glm = banking.train %>% #select all rows of the train  
mutate(predicted.value=as.factor(ifelse(prob.training<=0.5, “no”, “yes”))) #create a new variable using mutate and set a majority rule using ifelse# get the training error  
logit_traing_error <- calc_error_rate(predicted.value=banking.train_glm$predicted.value, true.value=YTrain)# get the test error of the logistic model  
prob.test = predict(glm.fit,banking.test,type=”response”)banking.test_glm = banking.test %>% # select rows  
mutate(predicted.value2=as.factor(ifelse(prob.test<=0.5, “no”, “yes”))) # set ruleslogit_test_error <- calc_error_rate(predicted.value=banking.test_glm$predicted.value2, true.value=YTest)# write down the training and test errors of the logistic model 
records[1,] <- c(logit_traing_error,logit_test_error)#write into the first row

#2 決策樹(shù)

若是決策樹(shù)，我們遵循交叉驗(yàn)證，以識(shí)別最佳的分割節(jié)點(diǎn)。想大致了解決策樹(shù)，請(qǐng)參閱此文：https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052。

# finding the best nodes  
# the total number of rows  
nobs = nrow(banking.train)#build a DT model;  
#please refer to this document (https://www.datacamp.com/community/tutorials/decision-trees-R) for constructing a DT model  
bank_tree = tree(y~., data= banking.train,na.action = na.pass,  
control = tree.control(nobs , mincut =2, minsize = 10, mindev = 1e-3))#cross validation to prune the tree  
set.seed(3)  
cv = cv.tree(bank_tree,FUN=prune.misclass, K=10)  
cv#identify the best cv  
best.size.cv = cv$size[which.min(cv$dev)]  
best.size.cv#best = 3bank_tree.pruned<-prune.misclass(bank_tree, best=3)  
summary(bank_tree.pruned)

交叉驗(yàn)證的最佳大小是3。

# Training and test errors of bank_tree.pruned  
pred_train = predict(bank_tree.pruned, banking.train, type=”class”)  
pred_test = predict(bank_tree.pruned, banking.test, type=”class”)# training error  
DT_training_error <- calc_error_rate(predicted.value=pred_train, true.value=YTrain)# test error  
DT_test_error <- calc_error_rate(predicted.value=pred_test, true.value=YTest)# write down the errors  
records[2,] <- c(DT_training_error,DT_test_error)

#3 K最近鄰(KNN)

作為一種非參數(shù)方法，KNN不需要任何分布的先驗(yàn)知識(shí)。簡(jiǎn)而言之，KNN將k個(gè)數(shù)量的最近鄰分配給相關(guān)的單元。

想大致了解，不妨參閱這篇文章《R中的K最近鄰入門(mén)指南：從菜鳥(niǎo)到高手》：https://towardsdatascience.com/beginners-guide-to-k-nearest-neighbors-in-r-from-zero-to-hero-d92cd4074bdb。想詳細(xì)了解交叉驗(yàn)證和do.chunk函數(shù)，請(qǐng)參閱此文：https://towardsdatascience.com/beginners-guide-to-k-nearest-neighbors-in-r-from-zero-to-hero-d92cd4074bdb。

使用交叉驗(yàn)證，我們發(fā)現(xiàn)當(dāng)k = 20時(shí)交叉驗(yàn)證誤差最小。

nfold = 10  
set.seed(1)# cut() divides the range into several intervals  
folds = seq.int(nrow(banking.train)) %>%  
cut(breaks = nfold, labels=FALSE) %>%  
sampledo.chunk <- function(chunkid, folddef, Xdat, Ydat, k){  
train = (folddef!=chunkid)# training indexXtr = Xdat[train,] # training set by the indexYtr = Ydat[train] # true label in training setXvl = Xdat[!train,] # test setYvl = Ydat[!train] # true label in test setpredYtr = knn(train = Xtr, test = Xtr, cl = Ytr, k = k) # predict training labelspredYvl = knn(train = Xtr, test = Xvl, cl = Ytr, k = k) # predict test labelsdata.frame(fold =chunkid, # k folds 
train.error = calc_error_rate(predYtr, Ytr),#training error per fold  
val.error = calc_error_rate(predYvl, Yvl)) # test error per fold  
}# set error.folds to save validation errors  
error.folds=NULL# create a sequence of data with an interval of 10  
kvec = c(1, seq(10, 50, length.out=5))set.seed(1)for (j in kvec){  
tmp = ldply(1:nfold, do.chunk, # apply do.function to each fold  
folddef=folds, Xdat=XTrain, Ydat=YTrain, k=j) # required arguments  
tmp$neighbors = j # track each value of neighbors  
error.folds = rbind(error.folds, tmp) # combine the results  
}#melt() in the package reshape2 melts wide-format data into long-format data  
errors = melt(error.folds, id.vars=c(“fold”,”neighbors”), value.name= “error”)

隨后，不妨找到盡量減少驗(yàn)證誤差的最佳K數(shù)。

val.error.means = errors %>%  
filter(variable== “val.error” ) %>%  
group_by(neighbors, variable) %>%  
summarise_each(funs(mean), error) %>%  
ungroup() %>%  
filter(error==min(error))#the best number of neighbors =20  
numneighbor = max(val.error.means$neighbors)  
numneighbor## [20]

遵循同一步，我們查找訓(xùn)練誤差和測(cè)試誤差。

#training error  
set.seed(20)  
pred.YTtrain = knn(train=XTrain, test=XTrain, cl=YTrain, k=20)  
knn_traing_error <- calc_error_rate(predicted.value=pred.YTtrain, true.value=YTrain)#test error =0.095set.seed(20)  
pred.YTest = knn(train=XTrain, test=XTest, cl=YTrain, k=20)  
knn_test_error <- calc_error_rate(predicted.value=pred.YTest, true.value=YTest)records[3,] <- c(knn_traing_error,knn_test_error)

#4 隨機(jī)森林

我們遵循構(gòu)建隨機(jī)森林模型的標(biāo)準(zhǔn)步驟。想大致了解隨機(jī)森林，參閱此文：https://towardsdatascience.com/understanding-random-forest-58381e0602d2。

# build a RF model with default settings  
set.seed(1)  
RF_banking_train = randomForest(y ~ ., data=banking.train, importance=TRUE)# predicting outcome classes using training and test sets  
pred_train_RF = predict(RF_banking_train, banking.train, type=”class”)pred_test_RF = predict(RF_banking_train, banking.test, type=”class”)# training error  
RF_training_error <- calc_error_rate(predicted.value=pred_train_RF, true.value=YTrain)# test error  
RF_test_error <- calc_error_rate(predicted.value=pred_test_RF, true.value=YTest)records[4,] <- c(RF_training_error,RF_test_error)

#5 支持向量機(jī)

同樣，我們遵循構(gòu)建支持向量機(jī)的標(biāo)準(zhǔn)步驟。想大致了解該方法，請(qǐng)參閱此文：https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47。

set.seed(1)  
tune.out=tune(svm, y ~., data=banking.train,  
kernel=”radial”,ranges=list(cost=c(0.1,1,10)))# find the best parameters  
summary(tune.out)$best.parameters# the best model  
best_model = tune.out$best.modelsvm_fit=svm(y~., data=banking.train,kernel=”radial”,gamma=0.05555556,cost=1,probability=TRUE)# using training/test sets to predict outcome classes  
svm_best_train = predict(svm_fit,banking.train,type=”class”)  
svm_best_test = predict(svm_fit,banking.test,type=”class”)# training error  
svm_training_error <- calc_error_rate(predicted.value=svm_best_train, true.value=YTrain)# test error  
svm_test_error <- calc_error_rate(predicted.value=svm_best_test, true.value=YTest)records[5,] <- c(svm_training_error,svm_test_error)

4. 模型度量指標(biāo)

我們已構(gòu)建了遵循模型選擇過(guò)程的所有機(jī)器學(xué)習(xí)模型，并獲得了訓(xùn)練誤差和測(cè)試誤差。這一節(jié)將使用一些模型的度量指標(biāo)選擇最佳模型。

4.1 訓(xùn)練/測(cè)試誤差

可以使用訓(xùn)練/測(cè)試誤差找到最佳模型嗎?

現(xiàn)在不妨看看結(jié)果。

records

圖2

這里，隨機(jī)森林的訓(xùn)練誤差最小，不過(guò)其他方法有類(lèi)似的測(cè)試誤差。你可能注意到，訓(xùn)練誤差和測(cè)試誤差很接近，很難說(shuō)清楚哪個(gè)明顯勝出。

此外，分類(lèi)精度(無(wú)論是訓(xùn)練誤差還是測(cè)試誤差)都不應(yīng)該是高度不平衡數(shù)據(jù)集的度量指標(biāo)。這是由于數(shù)據(jù)集以多數(shù)范例為主，即使隨機(jī)猜測(cè)也會(huì)得出50%的準(zhǔn)確性。更糟糕的是，高度精確的模型可能?chē)?yán)重“處罰”少數(shù)范例。因此，不妨查看另一個(gè)度量指標(biāo)：ROC曲線。

4.2受試者工作特征(ROC)曲線

ROC是一種圖形表示，顯示分類(lèi)模型在所有分類(lèi)閾值下有怎樣的表現(xiàn)。我們更喜歡比其他分類(lèi)器更快逼近1的分類(lèi)器。

ROC曲線在同一個(gè)圖中繪制不同閾值下的兩個(gè)參數(shù)：真陽(yáng)率(True Positive Rate)和假陽(yáng)率(False Positive Rate)。

TPR (Recall) = TP/(TP+FN)

FPR = FP/(TN+FP)

圖3

在很大程度上，ROC曲線不僅衡量分類(lèi)準(zhǔn)確度，還在TPR和FPR之間達(dá)到了很好的平衡。這是罕見(jiàn)事件所需要的，因?yàn)槲覀冞€想在多數(shù)范例和少數(shù)范例之間達(dá)到平衡。

# load the library  
library(ROCR)#creating a tracking record  
Area_Under_the_Curve = matrix(NA, nrow=5, ncol=1)  
colnames(Area_Under_the_Curve) <- c(“AUC”)  
rownames(Area_Under_the_Curve) <- c(“Logistic”,”Tree”,”KNN”,”Random Forests”,”SVM”)########### logistic regression ###########  
# ROC  
prob_test <- predict(glm.fit,banking.test,type=”response”)  
pred_logit<- prediction(prob_test,banking.test$y)  
performance_logit <- performance(pred_logit,measure = “tpr”, x.measure=”fpr”)########### Decision Tree ###########  
# ROC  
pred_DT<-predict(bank_tree.pruned, banking.test,type=”vector”)  
pred_DT <- prediction(pred_DT[,2],banking.test$y)  
performance_DT <- performance(pred_DT,measure = “tpr”,x.measure= “fpr”)########### KNN ###########  
# ROC  
knn_model = knn(train=XTrain, test=XTrain, cl=YTrain, k=20,prob=TRUE)prob <- attr(knn_model, “prob”)  
prob <- 2*ifelse(knn_model == “-1”, prob,1-prob) — 1  
pred_knn <- prediction(prob, YTrain)  
performance_knn <- performance(pred_knn, “tpr”, “fpr”)########### Random Forests ###########  
# ROC  
pred_RF<-predict(RF_banking_train, banking.test,type=”prob”)  
pred_class_RF <- prediction(pred_RF[,2],banking.test$y) 
performance_RF <- performance(pred_class_RF,measure = “tpr”,x.measure= “fpr”)########### SVM ###########  
# ROC  
svm_fit_prob = predict(svm_fit,type=”prob”,newdata=banking.test,probability=TRUE)  
svm_fit_prob_ROCR = prediction(attr(svm_fit_prob,”probabilities”)[,2],banking.test$y==”yes”)  
performance_svm <- performance(svm_fit_prob_ROCR, “tpr”,”fpr”)

不妨繪制ROC曲線。

我們添加一條直線，以顯示隨機(jī)分配的概率。我們的分類(lèi)器其表現(xiàn)勝過(guò)隨機(jī)猜測(cè)，是不是?

#logit  
plot(performance_logit,col=2,lwd=2,main=”ROC Curves for These Five Classification Methods”)legend(0.6, 0.6, c(‘logistic’, ‘Decision Tree’, ‘KNN’,’Random Forests’,’SVM’), 2:6)#decision tree  
plot(performance_DT,col=3,lwd=2,add=TRUE)#knn  
plot(performance_knn,col=4,lwd=2,add=TRUE)#RF  
plot(performance_RF,col=5,lwd=2,add=TRUE)# SVM  
plot(performance_svm,col=6,lwd=2,add=TRUE)abline(0,1)

圖4

這里已分出勝負(fù)。

據(jù)ROC曲線顯示，KNN(藍(lán)色線)高于其他所有方法。

4.3 曲線下面積(AUC)

顧名思義，AUC是ROC曲線下的面積。它是直觀的AUC曲線的數(shù)學(xué)表示。AUC給出了分類(lèi)器在可能的分類(lèi)閾值下性能如何的合并結(jié)果。

########### Logit ###########  
auc_logit = performance(pred_logit, “auc”)@y.values  
Area_Under_the_Curve[1,] <-c(as.numeric(auc_logit))########### Decision Tree ###########  
auc_dt = performance(pred_DT,”auc”)@y.values  
Area_Under_the_Curve[2,] <- c(as.numeric(auc_dt))########### KNN ###########  
auc_knn <- performance(pred_knn,”auc”)@y.values  
Area_Under_the_Curve[3,] <- c(as.numeric(auc_knn))########### Random Forests ###########  
auc_RF = performance(pred_class_RF,”auc”)@y.values  
Area_Under_the_Curve[4,] <- c(as.numeric(auc_RF))########### SVM ###########  
auc_svm<-performance(svm_fit_prob_ROCR,”auc”)@y.values[[1]]  
Area_Under_the_Curve[5,] <- c(as.numeric(auc_svm))

不妨查看AUC值。

Area_Under_the_Curve

圖5

此外，KNN擁有最大的AUC值(0.847)。

結(jié)束語(yǔ)

我們?cè)诒疚闹邪l(fā)現(xiàn)KNN這個(gè)非參數(shù)分類(lèi)器的表現(xiàn)勝過(guò)參數(shù)分類(lèi)器。就度量指標(biāo)而言，為罕見(jiàn)事件選擇ROC曲線而非分類(lèi)準(zhǔn)確度來(lái)得更合理。

原文標(biāo)題：Classify A Rare Event Using 5 Machine Learning Algorithms，作者：Leihua Ye

【51CTO譯稿，合作站點(diǎn)轉(zhuǎn)載請(qǐng)注明原文譯者和出處為51CTO.com】

責(zé)任編輯：龐桂玉來(lái)源： 51CTO

機(jī)器學(xué)習(xí)人工智能 AI

點(diǎn)贊

51CTO技術(shù)棧公眾號(hào)

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開(kāi)發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營(yíng)