自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

^{<blockquote id="4m83j"></blockquote>}

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會(huì)

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營

鴻蒙開發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫

賬號(hào)設(shè)置退出

用Scikit-Learn構(gòu)建K-近鄰算法，分類MNIST數(shù)據(jù)集

作者：乾樹、劉曉坤譯 2018-04-06 05:10:04

開發(fā) 開發(fā)工具算法

本篇教程將帶你使用 Scikit-Learn 構(gòu)建 K 近鄰算法，并應(yīng)用于 MNIST 數(shù)據(jù)集。然后，作者將帶你構(gòu)建自己的 K-NN 算法，開發(fā)出比 Scikit-Learn K-NN 更準(zhǔn)更快的算法。

K 近鄰算法，簡稱 K-NN。在如今深度學(xué)習(xí)盛行的時(shí)代，這個(gè)經(jīng)典的機(jī)器學(xué)習(xí)算法經(jīng)常被輕視。本篇教程將帶你使用 Scikit-Learn 構(gòu)建 K 近鄰算法，并應(yīng)用于 MNIST 數(shù)據(jù)集。然后，作者將帶你構(gòu)建自己的 K-NN 算法，開發(fā)出比 Scikit-Learn K-NN 更準(zhǔn)更快的算法。

一、K 近鄰分類模型

K 近鄰算法是一種容易實(shí)現(xiàn)的監(jiān)督機(jī)器學(xué)習(xí)算法，并且其分類性能的魯棒性還不錯(cuò)。K-NN ***的優(yōu)點(diǎn)之一就是它是一個(gè)惰性算法，即該模型無須訓(xùn)練就可以對(duì)數(shù)據(jù)進(jìn)行分類，而不像其他需要訓(xùn)練的 ML 算法，如 SVM、回歸和多層感知機(jī)。

K-NN 如何工作

為了對(duì)給定的數(shù)據(jù)點(diǎn) p 進(jìn)行分類，K-NN 模型首先使用某個(gè)距離度量將 p 與其數(shù)據(jù)庫中其它點(diǎn)進(jìn)行比較。

距離度量就是類似歐幾里得距離之類的標(biāo)準(zhǔn)，以兩個(gè)點(diǎn)為輸入并返回這兩個(gè)點(diǎn)之間距離的簡單函數(shù)。

因此，可以假設(shè)距離較小的兩個(gè)點(diǎn)比距離較大的兩個(gè)點(diǎn)相似度更高。這是 K-NN 的核心思想。

該過程將返回一個(gè)無序數(shù)組，其中數(shù)組中的每一項(xiàng)都表示 p 與模型數(shù)據(jù)庫中 n 個(gè)數(shù)據(jù)點(diǎn)之間的距離。所以返回?cái)?shù)組的大小為 n。

K 近鄰的 K 的含義是：k 是一個(gè)任意值(通常在 3-11 之間)，表示模型在對(duì) p 分類時(shí)應(yīng)該考慮多少個(gè)最相似的點(diǎn)。然后模型將記錄這 k 個(gè)最相似的值，并使用投票算法來決定 p 屬于哪一類，如下圖所示。

上圖中的 K-NN 模型的 k 值為 3，箭頭指向的中心點(diǎn)為 p，算法將對(duì)這個(gè)點(diǎn)進(jìn)行分類。

如你所見，圓圈中的三個(gè)點(diǎn)是與 p 最接近或最相似的三個(gè)點(diǎn)。因此，使用簡單的投票算法，p 將被歸為「白色」，因?yàn)榘咨?k 個(gè)最相似值中占大多數(shù)。

酷炫!但令人驚訝的是，這個(gè)簡單的算法可以在某些情況下實(shí)現(xiàn)不俗的結(jié)果，并且可以應(yīng)用于各種各樣的問題，我們將在下面介紹。

二、在 Scikit-Learn 中實(shí)現(xiàn) K-NN 算法用來分類 MNIST 圖像

1. 數(shù)據(jù)

對(duì)于這個(gè)例子，我們將使用常見的 MNIST 數(shù)據(jù)集。MNIST 數(shù)據(jù)集是機(jī)器學(xué)習(xí)中最常用的數(shù)據(jù)集之一，因?yàn)樗苋菀讓?shí)現(xiàn)，而且是驗(yàn)證我們模型的可靠方法。

MNIST 是一組包含 70,000 個(gè)手寫數(shù)字 0-9 的數(shù)據(jù)集。任意兩個(gè)手寫數(shù)字都不相同，有些可能很難正確分類。

2. 算法

我們從 Scikit-Learn 的 Python 庫的 KNeighborsClassifier() 函數(shù)入手。這個(gè)函數(shù)有很多參數(shù)，但在這個(gè)例子中我們只需用少量幾個(gè)參數(shù)。具體來說，我們只會(huì)傳遞 n_neighbors 參數(shù)的值(就是 k 值啦)。

weights 參數(shù)給出了模型使用的投票算法的類型，其中默認(rèn)值是 uniform。這意味著在對(duì) p 進(jìn)行分類時(shí)，k 個(gè)點(diǎn)中的每一個(gè)的權(quán)重都一樣。algorithm 參數(shù)也將使用默認(rèn)值 auto，因?yàn)槲覀兿Ｍ?Scikit-Learn 自動(dòng)找到對(duì) MNIST 數(shù)據(jù)進(jìn)行分類的***算法。

以下是一個(gè)用 Scikit-Learn 構(gòu)建 K-NN 分類器的 Jupyter Notebook：Scikit-Learn 實(shí)現(xiàn)的用于 MNIST 的 K 近鄰算法

Notebook 地址：https://gist.github.com/samgrassi01/82d0e5f89daac3e65531a6ef497cc129#file-skl-knn-ipynb

我們通過導(dǎo)入所需的庫直接開始。

In [1]: 
import numpy as np 
 
from sklearn import datasets, model_selection 
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.metrics import classification_report 
 
mnist = datasets.fetch_mldata('MNIST original') 
data, target = mnist.data, mnist.target 
 
# make sure everything was correctly imported 
data.shape, target.shape 
Out[1]: 
((70000, 784), (70000,))

(1) 構(gòu)建數(shù)據(jù)集

我們通過制作不同的數(shù)據(jù)集來構(gòu)建 K-NN 模型。我們將創(chuàng)建一個(gè)可以獲取特定大小數(shù)據(jù)集、返回?cái)?shù)據(jù)集大小的函數(shù)。

In [2]: 
# make an array of indices the size of MNIST to use for making the data sets. 
# This array is in random order, so we can use it to scramble up the MNIST data 
indx = np.random.choice(len(target), 70000, replace=False) 
 
# method for building datasets to test with 
def mk_dataset(size): 
    """makes a dataset of size "size", and returns that datasets images and targets 
    This is used to make the dataset that will be stored by a model and used in  
    experimenting with different stored dataset sizes 
    """ 
    train_img = [data[i] for i in indx[:size]] 
    train_img = np.array(train_img) 
    train_target = [target[i] for i in indx[:size]] 
    train_target = np.array(train_target)

不錯(cuò)?，F(xiàn)在我們將使用這個(gè)函數(shù)來構(gòu)建兩個(gè)不同大小的數(shù)據(jù)集，來看看模型在不同數(shù)據(jù)量上的分類性能怎么樣。

提示：制作較小的數(shù)據(jù)集時(shí)，你仍然可以進(jìn)行分類，但模型畢竟少了一些數(shù)據(jù)，這可能會(huì)導(dǎo)致分類錯(cuò)誤。

In [3]: 
# lets make a dataset of size 50,000, meaning the model will have 50,000 data points to compare each  
# new point it is to classify to 
fifty_x, fifty_y = mk_dataset(50000) 
fifty_x.shape, fifty_y.shape 
Out[3]: 
((50000, 784), (50000,)) 
In [4]: 
# lets make one more of size 20,000 and see how classification accuracy decreases when we use that one 
twenty_x, twenty_y = mk_dataset(20000) 
twenty_x.shape, twenty_y.shape 
Out[4]: 
((20000, 784), (20000,))

注意這些數(shù)據(jù)是如何為模型匹配標(biāo)簽的。模型需要這些標(biāo)簽來理解每一個(gè)點(diǎn)代表什么，因此可以把我們要分類的點(diǎn)放在一個(gè)特定的類中，而不是說「這是與待分類點(diǎn)最相似的類」。

現(xiàn)在我們將構(gòu)建一個(gè)大小為 10000 的測(cè)試集。

In [5]: 
# build model testing dataset 
test_img = [data[i] for i in indx[60000:70000]] 
test_img1 = np.array(test_img) 
test_target = [target[i] for i in indx[60000:70000]] 
test_target1 = np.array(test_target) 
test_img1.shape, test_target1.shape 
Out[5]: 
((10000, 784), (10000,))

不錯(cuò)!現(xiàn)在我們已經(jīng)完成了所有的數(shù)據(jù)處理，可以開始搭建 K-NN 模型了!

(2) 構(gòu)建模型

我們首先將 Scikit-Learn K-NN 模型放在函數(shù)中，以便可以輕松調(diào)用它并對(duì)其進(jìn)行調(diào)整。

In [6]: 
def skl_knn(k, test_data, test_target, stored_data, stored_target): 
    """k: number of neighbors to use in classication 
    test_data: the data/targets used to test the classifier 
    stored_data: the data/targets used to classify the test_data 
    """ 
 
    classifier = KNeighborsClassifier(n_neighbors=k)   
    classifier.fit(stored_data, stored_target) 
 
    y_pred = classifier.predict(test_data)  
 
    print(classification_report(test_target, y_pred))

(3) 測(cè)試

現(xiàn)在我們看看這個(gè)模型在兩個(gè)不同的測(cè)試集上的運(yùn)行效果。

In [7]: 
%%time 
# stored data set size of 50,000 
skl_knn(5, test_img1, test_target1, fifty_x, fifty_y)

In [8]: 
%%time 
# stored data set size of 20,000 
skl_knn(5, test_img1, test_target1, twenty_x, twenty_y)

可以的!我們的模型與人眼識(shí)別差不多!如你所見，當(dāng)模型有更多的數(shù)據(jù)可以使用時(shí)(50,000 而不是 20,000 個(gè)點(diǎn))，它的性能會(huì)更好。更加引人注目的是，它非常簡單，并且能以人類水平來獲取不同圖像之間的復(fù)雜關(guān)系。更多的細(xì)節(jié)分析請(qǐng)?jiān)L問這個(gè) GitHub repo：

https://github.com/samgrassi01/Cosine-Similarity-Classifier。

厲害了!我們使用 Scikit-Learn 構(gòu)建了一個(gè)非常簡單的 K 近鄰模型，該模型在 MNIST 數(shù)據(jù)集上表現(xiàn)非凡。

不足之處?分類這些點(diǎn)需要很長時(shí)間(兩個(gè)數(shù)據(jù)集分別耗時(shí) 8 分鐘和 4 分鐘)，諷刺的是，K-NN 仍然是最快的分類方法之一。我們必須有一個(gè)更快的方法。

三、構(gòu)建一個(gè)更快的模型

大多數(shù) K-NN 模型使用歐幾里德距離或曼哈頓距離作為 go-to 距離度量。這些指標(biāo)非常簡單，而且在各種各樣的情況中都表現(xiàn)不錯(cuò)。

還有一個(gè)很少用到的距離標(biāo)準(zhǔn)度量是余弦相似度。余弦相似度通常不是 go-to 距離度量標(biāo)準(zhǔn)，這是因?yàn)樗`反了三角不等式，而且對(duì)負(fù)數(shù)無效。但是，余弦相似度對(duì)于 MNIST 來說很***。它速度快、算法簡單，而且比 MNIST 中應(yīng)用其他距離度量的準(zhǔn)確率稍高一些。

但是，為了得到***性能，我們需要自己編寫 K-NN 模型。然后，我們應(yīng)該能得到比 Scikit-Learn 模型更高的性能，甚至能得到更高的準(zhǔn)確度。讓我們看看以下建立的 K-NN 模型的 Notebook 吧：

構(gòu)建一個(gè)更快的 KNN 分類器

Notebook 地址：

https://gist.github.com/samgrassi01/15a1fe53dcde8813eed9367b103676b2#file-cos-knn-ipynb

在這個(gè) notebook 中，我們將構(gòu)建一個(gè)簡單的 K-NN 模型，該模型使用余弦相似度作為距離度量對(duì) MNIST 圖像進(jìn)行分類，試圖找到更快或更加準(zhǔn)確的模型。

首先，需要導(dǎo)入所需的庫，然后構(gòu)建與 Scikit-Learn K-NN notebook 相同的數(shù)據(jù)集。

In [1]: 
import numpy as np 
import heapq 
from collections import Counter 
from sklearn.metrics.pairwise import cosine_similarity 
from sklearn import datasets, model_selection 
from sklearn.metrics import classification_report 
 
mnist = datasets.fetch_mldata('MNIST original') 
data, target = mnist.data, mnist.target 
 
# make sure everything was correctly imported 
data.shape, target.shape 
Out[1]: 
((70000, 784), (70000,))

使用與 Scikit-Learn K-NN notebook 相同的方法，設(shè)置完全相同的數(shù)據(jù)集。

In [2]: 
# make an array of indices the size of MNIST to use for making the data sets. 
# This array is in random order, so we can use it to scramble up the MNIST data 
indx = np.random.choice(len(target), 70000, replace=False) 
 
# method for building datasets to test with 
def mk_dataset(size): 
    """makes a dataset of size "size", and returns that datasets images and targets 
    This is used to make the dataset that will be stored by a model and used in  
    experimenting with different stored dataset sizes 
    """ 
    train_img = [data[i] for i in indx[:size]] 
    train_img = np.array(train_img) 
    train_target = [target[i] for i in indx[:size]] 
    train_target = np.array(train_target) 
 
    return train_img, train_target 
In [3]: 
# lets make a dataset of size 50,000, meaning the model will have 50,000 data points to compare each  
# new point it is to classify to 
fifty_x, fifty_y = mk_dataset(50000) 
fifty_x.shape, fifty_y.shape 
Out[3]: 
((50000, 784), (50000,)) 
In [4]: 
# lets make one more of size 20,000 and see how classification accuracy decreases when we use that one 
twenty_x, twenty_y = mk_dataset(20000) 
twenty_x.shape, twenty_y.shape 
Out[4]: 
((20000, 784), (20000,)) 
In [5]: 
# build model testing dataset 
test_img = [data[i] for i in indx[60000:70000]] 
test_img1 = np.array(test_img) 
test_target = [target[i] for i in indx[60000:70000]] 
test_target1 = np.array(test_target) 
test_img1.shape, test_target1.shape 
Out[5]: 
((10000, 784), (10000,))

1. . 構(gòu)建模型

下面，我們創(chuàng)建函數(shù) cos_knn()，作為用于 MNIST 數(shù)據(jù)集的分類器。你可以利用函數(shù)的注釋了解其工作原理。

In [6]: 
def cos_knn(k, test_data, test_target, stored_data, stored_target): 
    """k: number of neighbors to use for voting 
    test_data: a set of unobserved images to classify 
    test_target: the labels for the test_data (for calculating accuracy) 
    stored_data: the images already observed and available to the model 
    stored_target: labels for stored_data 
    """ 
 
    # find cosine similarity for every point in test_data between every other point in stored_data 
    cosim = cosine_similarity(test_data, stored_data) 
 
    # get top k indices of images in stored_data that are most similar to any given test_data point 
    top = [(heapq.nlargest((k), range(len(i)), i.take)) for i in cosim] 
    # convert indices to numbers using stored target values 
    top = [[stored_target[j] for j in i[:k]] for i in top] 
 
    # vote, and return prediction for every image in test_data 
    pred = [max(set(i), key=i.count) for i in top] 
    pred = np.array(pred) 
 
    # print table giving classifier accuracy using test_target 
    print(classification_report(test_target, pred))

2. 測(cè)試模型

現(xiàn)在，就像 Scikit-Learn K-NN 模型一樣，我們對(duì) cos_knn() 模型在兩個(gè)數(shù)據(jù)集上分別測(cè)試，并看看它的性能如何。

In [7]: 
%%time 
# stored data set size of 50,000 
cos_knn(5, test_img1, test_target1, fifty_x, fifty_y)

In [8]: 
%%time 
# stored data set size of 20,000 
cos_knn(5, test_img1, test_target1, twenty_x, twenty_y)

太棒了!余弦相似度模型性能超過了 Scikit-Learn K-NN!值得一提的是，該模型的分類速度和準(zhǔn)確率都優(yōu)于 Scikit-Learn K-NN(其中速度獲得了很大提升)，而模型卻非常簡單!

為了進(jìn)一步分析模型的工作原理，同時(shí)了解該模型為何在許多不同情況下比 Scikit-Learn K-NN 模型要性能更優(yōu)，請(qǐng)參閱這個(gè) GitHub repo：

https://github.com/samgrassi01/Cosine-Similarity-Classifier。

正如 notebook 所示，該 K-NN 模型在分類速度和準(zhǔn)確率方面都勝過了 Scikit-Learn K-NN，其中速度獲得了大幅提升，而在一個(gè)數(shù)據(jù)集上的準(zhǔn)確率提高了 1%。既然如此，我們可以在實(shí)踐中繼續(xù)使用這個(gè)模型了。

四、結(jié)論

首先，我們知道了 K-NN 的工作機(jī)制，以及如何輕松地實(shí)現(xiàn)它。但最重要的是，我們發(fā)現(xiàn)，始終考慮需要解決的問題以及解決問題的工具非常重要。有時(shí)候，在解決問題的過程中，***花一些時(shí)間來實(shí)踐——當(dāng)然，也需要建立自己的模型。正如 notebook 中所展示的那樣，它可以帶來極大的益處：我們第二個(gè)專有模型獲得了 1.5 - 2 倍的加速，節(jié)省了很多時(shí)間。

原文鏈接：

https://towardsdatascience.com/building-improving-a-k-nearest-neighbors-algorithm-in-python-3b6b5320d2f8

【本文是51CTO專欄機(jī)構(gòu)“機(jī)器之心”的原創(chuàng)譯文，微信公眾號(hào)“機(jī)器之心( id: almosthuman2014)”】

戳這里，看該作者更多好文

責(zé)任編輯：趙寧寧來源： 51CTO專欄

K-NN 數(shù)據(jù)集算法

點(diǎn)贊

51CTO技術(shù)棧公眾號(hào)

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營

^{<sub id="aptcj"><big id="aptcj"></big></sub>}

<p id="aptcj"></p>