自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

<sub id="svm0z"><i id="svm0z"></i></sub>

<sub id="svm0z"></sub>

51CTO首頁(yè)

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開(kāi)發(fā)者社區(qū)

WOT技術(shù)大會(huì)

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫(kù)

在線學(xué)習(xí)

文章資源問(wèn)答課堂專(zhuān)欄直播

51CTO

鴻蒙開(kāi)發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營(yíng)

鴻蒙開(kāi)發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開(kāi)發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫(kù)

賬號(hào)設(shè)置退出

三大指標(biāo)助力K均值與層次聚類(lèi)數(shù)選定及Python示例代碼

作者：新語(yǔ)數(shù)據(jù)故事匯 2024-07-16 10:35:42

在數(shù)據(jù)分析和機(jī)器學(xué)習(xí)中，聚類(lèi)是一項(xiàng)關(guān)鍵技術(shù)，幫助我們從未標(biāo)記的數(shù)據(jù)中發(fā)現(xiàn)模式和洞察。確定最佳聚類(lèi)數(shù)是聚類(lèi)過(guò)程中的重要挑戰(zhàn)，影響分析質(zhì)量。本文介紹了多種聚類(lèi)驗(yàn)證技術(shù)如Gap統(tǒng)計(jì)量、Calinski-Harabasz指數(shù)、Davies Bouldin指數(shù)和輪廓分?jǐn)?shù)，這些指標(biāo)可以幫助我們選擇最優(yōu)化的聚類(lèi)數(shù)，提升聚類(lèi)結(jié)果的有效性和可靠性。?

在數(shù)據(jù)分析和機(jī)器學(xué)習(xí)領(lǐng)域，聚類(lèi)作為一種核心技術(shù)，對(duì)于從未標(biāo)記數(shù)據(jù)中發(fā)現(xiàn)模式和洞察力至關(guān)重要。聚類(lèi)的過(guò)程是將數(shù)據(jù)點(diǎn)分組，使得同組內(nèi)的數(shù)據(jù)點(diǎn)比不同組的數(shù)據(jù)點(diǎn)更相似，這在市場(chǎng)細(xì)分到社交網(wǎng)絡(luò)分析的各種應(yīng)用中都非常重要。然而，聚類(lèi)最具挑戰(zhàn)性的方面之一在于確定最佳聚類(lèi)數(shù)，這一決策對(duì)分析質(zhì)量有著重要影響。

雖然大多數(shù)數(shù)據(jù)科學(xué)家依賴(lài)肘部圖和樹(shù)狀圖來(lái)確定K均值和層次聚類(lèi)的最佳聚類(lèi)數(shù)，但還有一組其他的聚類(lèi)驗(yàn)證技術(shù)可以用來(lái)選擇最佳的組數(shù)（聚類(lèi)數(shù)）。我們將在sklearn.datasets.load_wine問(wèn)題上使用K均值和層次聚類(lèi)來(lái)實(shí)現(xiàn)一組聚類(lèi)驗(yàn)證指標(biāo)。以下的大多數(shù)代碼片段都是可重用的，可以在任何數(shù)據(jù)集上使用Python實(shí)現(xiàn)。

接下來(lái)我們主要介紹以下主要指標(biāo)：

Gap統(tǒng)計(jì)量（Gap Statistics）（!pip install --upgrade gap-stat[rust]）
Calinski-Harabasz指數(shù)（Calinski-Harabasz Index ）（!pip install yellowbrick）
Davies Bouldin評(píng)分（Davies Bouldin Score ）（作為Scikit-Learn的一部分提供）
輪廓評(píng)分（Silhouette Score ）（!pip install yellowbrick）

引入包和加載數(shù)據(jù)

# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# to scale the data using z-score
from sklearn.preprocessing import StandardScaler
# to compute distances
from scipy.spatial.distance import cdist, pdist
# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
sns.set(color_codes=True)


from sklearn.datasets import load_iris, load_wine, load_digits, make_blobs
wine = load_wine()
X_wine = wine.data
X_wine

標(biāo)準(zhǔn)化數(shù)據(jù)：

scaler=StandardScaler()
X_wine_int=X_wine.copy()
X_wine_interim=scaler.fit_transform(X_wine_int)
X_wine_scaled=pd.DataFrame(X_wine_interim)
X_wine_scaled.head(10)

Gap統(tǒng)計(jì)量（Gap Statistics）

from gap_statistic import OptimalK
from sklearn.cluster import KMeans
def KMeans_clustering_func(X, k):
    """ 
    K Means Clustering function, which uses the K Means model from sklearn.
    These user-defined functions *must* take the X (input features) and a k 
    when initializing OptimalK
    """
    
    # Include any clustering Algorithm that can return cluster centers
    
    m = KMeans(random_state=11, n_clusters=k)
    m.fit(X)
    return m.cluster_centers_, m.predict(X)
#--------------------create a wrapper around OptimalK to extract cluster centers and cluster labels
optimalK = OptimalK(clusterer=KMeans_clustering_func)
#--------------------Run optimal K on the input data (subset_scaled_interim) and number of clusters
n_clusters = optimalK(X_wine_scaled, cluster_array=np.arange(1, 15))
print('Optimal clusters: ', n_clusters)
#--------------------Gap Statistics data frame
optimalK.gap_df[['n_clusters', 'gap_value']]

plt.figure(figsize=(10,6))
n_clusters=3
plt.plot(optimalK.gap_df.n_clusters.values, optimalK.gap_df.gap_value.values, linewidth=2)
plt.scatter(optimalK.gap_df[optimalK.gap_df.n_clusters == n_clusters].n_clusters,
            optimalK.gap_df[optimalK.gap_df.n_clusters == n_clusters].gap_value, s=250, c='r')
plt.grid(True)
plt.xlabel('Cluster Count')
plt.ylabel('Gap Value')
plt.title('Gap Values by Cluster Count')
plt.axvline(3, linestyle="--")
plt.show()

上圖展示不同K值（從K=1到14）下的Gap統(tǒng)計(jì)量值。請(qǐng)注意，在本例中我們可以將K=3視為最佳的聚類(lèi)數(shù)。如上所述，可以從圖中獲得Gap統(tǒng)計(jì)量的拐點(diǎn)。

Calinski-Harabasz指數(shù)（Calinski-Harabasz Inde）

Calinski-Harabasz指數(shù)，也稱(chēng)為方差比準(zhǔn)則，是所有組的組間距離與組內(nèi)距離之和（群內(nèi)距離）的比值。較高的分?jǐn)?shù)表示更好的聚類(lèi)緊密度?？梢允褂肞ython的YellowBrick庫(kù)中的KElbow visualizer來(lái)計(jì)算。

plt.figure(figsize=(10,6))
model = KMeans(random_state=1)
# k is a range of the number of clusters.
visualizer = KElbowVisualizer(
    model, k=(2, 10), metric="calinski_harabasz", timings=True
)
visualizer.fit(X_wine_scaled)  # Fit the data to the visualizer
visualizer.show()  # Finalize and generate the plot

上圖展示不同K值（從K=1到9）下的Calinski Harabasz指數(shù)。請(qǐng)注意，在本例中我們可以將K=2視為最佳的聚類(lèi)數(shù)。如上所述，可以從圖中獲得Calinski Harabasz指數(shù)的最大值。

使用“metric”超參數(shù)選擇用于評(píng)估群組的評(píng)分指標(biāo)。默認(rèn)使用的指標(biāo)是均方失真，定義為每個(gè)點(diǎn)到其最近質(zhì)心（即聚類(lèi)中心）的距離平方和。其他一些指標(biāo)包括：

distortion：點(diǎn)到其聚類(lèi)中心的距離平方和的均值
silhouette：聚類(lèi)內(nèi)距離與數(shù)據(jù)點(diǎn)到其最近聚類(lèi)中心距離的比率，對(duì)所有數(shù)據(jù)點(diǎn)求平均
calinski_harabasz：群內(nèi)到群間離散度的比率

Davies-Bouldin指數(shù)（Davies-Bouldin Index）

Davies-Bouldin指數(shù)計(jì)算為每個(gè)聚類(lèi)（例如Ci）與其最相似聚類(lèi)（例如Cj）的平均相似度。這個(gè)指數(shù)表示聚類(lèi)的平均“相似度”，其中相似度是一種將聚類(lèi)距離與聚類(lèi)大小相關(guān)聯(lián)的度量。具有較低Davies-Bouldin指數(shù)的模型在聚類(lèi)之間有更好的分離效果。對(duì)于聚類(lèi)i到其最近的聚類(lèi)j的相似度R定義為(Si + Sj) / Dij，其中Si是聚類(lèi)i中每個(gè)點(diǎn)到其質(zhì)心的平均距離，Dij是聚類(lèi)i和j質(zhì)心之間的距離。一旦計(jì)算了相似度（例如i = 1, 2, 3, ..., k）到j(luò)，我們?nèi)的最大值，然后按聚類(lèi)數(shù)k進(jìn)行平均。

from sklearn.metrics import davies_bouldin_score
def get_Hmeans_score(  data, distance, link, center):  
    """
    returns the  score regarding Davies Bouldin for points to centers
    INPUT:
        data - the dataset you want to fit Agglomerative to
        distance - the distance for AgglomerativeClustering
        link - the linkage method for AgglomerativeClustering
        center - the number of clusters you want (the k value)
    OUTPUT:
        score - the Davies Bouldin score for the Hierarchical model fit to the data
    """
    hmeans = AgglomerativeClustering(n_clusters=center,linkage=link)
    model = hmeans.fit_predict(data)
    score = davies_bouldin_score(data, model)
    return score


centers = list(range(2, 10)) #------Number of Clusters in the data
avg_scores = []
for center in centers:
  avg_scores.append(get_Hmeans_score(X_wine_scaled, "euclidean", "average", center))
plt.figure(figsize=(15,6));
 
plt.plot(centers, avg_scores, linestyle="-", marker="o", color="b")
plt.xlabel("K")
plt.ylabel("Davies Bouldin score")
plt.title("Davies Bouldin score vs. K")

上圖展示不同K值（從K=1到9）下的Davies Bouldin指數(shù)。請(qǐng)注意，在本例中我們可以將K=2視為最佳的聚類(lèi)數(shù)。如上所述，可以從圖中獲得Davies Bouldin指數(shù)的最小值，該值對(duì)應(yīng)于最優(yōu)化的聚類(lèi)數(shù)。

輪廓分?jǐn)?shù)（Silhouette Score）

輪廓分?jǐn)?shù)衡量了考慮到聚類(lèi)內(nèi)部（within）和聚類(lèi)間（between）距離的聚類(lèi)之間的差異性。在下面的公式中，bi代表了點(diǎn)i到所有不屬于其所在聚類(lèi)的任何其他聚類(lèi)中所有點(diǎn)的平均最短距離；ai是所有數(shù)據(jù)點(diǎn)到其聚類(lèi)中心的平均距離。如果bi大于ai，則表示該點(diǎn)與其相鄰聚類(lèi)分離良好，但與其聚類(lèi)內(nèi)的所有點(diǎn)更接近。

plt.figure(figsize=(10,6))
model = KMeans(random_state=1)
# k is a range of the number of clusters.
visualizer = KElbowVisualizer(
    model, k=(2, 10), metric="silhouette", timings=True
)
visualizer.fit(X_wine_scaled)  # Fit the data to the visualizer
visualizer.show()  # Finalize and generate the plot

上圖展示不同K值（從K=1到9）下的輪廓分?jǐn)?shù)。請(qǐng)注意，在本例中我們可以將K=2視為最佳的聚類(lèi)數(shù)。如上所述，輪廓分?jǐn)?shù)可以從圖中獲得最大值，該值對(duì)應(yīng)于最優(yōu)化的聚類(lèi)數(shù)。

在數(shù)據(jù)分析和機(jī)器學(xué)習(xí)中，聚類(lèi)是一項(xiàng)關(guān)鍵技術(shù)，幫助我們從未標(biāo)記的數(shù)據(jù)中發(fā)現(xiàn)模式和洞察。確定最佳聚類(lèi)數(shù)是聚類(lèi)過(guò)程中的重要挑戰(zhàn)，影響分析質(zhì)量。本文介紹了多種聚類(lèi)驗(yàn)證技術(shù)如Gap統(tǒng)計(jì)量、Calinski-Harabasz指數(shù)、Davies Bouldin指數(shù)和輪廓分?jǐn)?shù)，這些指標(biāo)可以幫助我們選擇最優(yōu)化的聚類(lèi)數(shù)，提升聚類(lèi)結(jié)果的有效性和可靠性。

責(zé)任編輯：武曉燕來(lái)源：今日頭條

K均值機(jī)器學(xué)習(xí)數(shù)據(jù)分析

點(diǎn)贊

51CTO技術(shù)棧公眾號(hào)

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開(kāi)發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營(yíng)

<p id="ggro3"></p>

<cite id="ggro3"><rp id="ggro3"></rp></cite>