通過學(xué)習(xí)曲線識(shí)別過擬合和欠擬合

作者：佚名 2024-04-29 14:54:36

本文將介紹如何通過學(xué)習(xí)曲線來有效識(shí)別機(jī)器學(xué)習(xí)模型中的過擬合和欠擬合。

欠擬合和過擬合

1、過擬合

如果一個(gè)模型對(duì)數(shù)據(jù)進(jìn)行了過度訓(xùn)練，以至于它從中學(xué)習(xí)了噪聲，那么這個(gè)模型就被稱為過擬合。過擬合模型非常完美地學(xué)習(xí)了每一個(gè)例子，所以它會(huì)錯(cuò)誤地分類一個(gè)看不見的/新的例子。對(duì)于一個(gè)過擬合的模型，我們會(huì)得到一個(gè)完美/接近完美的訓(xùn)練集分?jǐn)?shù)和一個(gè)糟糕的測(cè)試/驗(yàn)證分?jǐn)?shù)。

過擬合的原因:用一個(gè)復(fù)雜的模型來解決一個(gè)簡(jiǎn)單的問題，從數(shù)據(jù)中提取噪聲。因?yàn)樾?shù)據(jù)集作為訓(xùn)練集可能無法代表所有數(shù)據(jù)的正確表示。

2、欠擬合

如果一個(gè)模型不能正確地學(xué)習(xí)數(shù)據(jù)中的模式，我們就說它是欠擬合的。欠擬合模型并不能完全學(xué)習(xí)數(shù)據(jù)集中的每一個(gè)例子。在這種情況下，我們看到訓(xùn)練集和測(cè)試/驗(yàn)證集的分?jǐn)?shù)都很低。

欠擬合的原因:使用一個(gè)簡(jiǎn)單的模型來解決一個(gè)復(fù)雜的問題，這個(gè)模型不能學(xué)習(xí)數(shù)據(jù)中的所有模式，或者模型錯(cuò)誤的學(xué)習(xí)了底層數(shù)據(jù)的模式。

學(xué)習(xí)曲線

學(xué)習(xí)曲線通過增量增加新的訓(xùn)練樣例來繪制訓(xùn)練樣例樣本的訓(xùn)練和驗(yàn)證損失?？梢詭椭覀兇_定添加額外的訓(xùn)練示例是否會(huì)提高驗(yàn)證分?jǐn)?shù)(在未見過的數(shù)據(jù)上得分)。如果模型是過擬合的，那么添加額外的訓(xùn)練示例可能會(huì)提高模型在未見數(shù)據(jù)上的性能。同理如果一個(gè)模型是欠擬合的，那么添加訓(xùn)練樣本也沒有什么用。' learning_curve '方法可以從Scikit-Learn的' model_selection '模塊導(dǎo)入。

from sklearn.model_selection import learning_curve

我們將使用邏輯回歸和Iris數(shù)據(jù)進(jìn)行演示。創(chuàng)建一個(gè)名為“l(fā)earn_curve”的函數(shù)，它將擬合邏輯回歸模型，并返回交叉驗(yàn)證分?jǐn)?shù)、訓(xùn)練分?jǐn)?shù)和學(xué)習(xí)曲線數(shù)據(jù)。

#The function below builds the model and returns cross validation scores, train score and learning curve data
 def learn_curve(X,y,c):
 ''' param X: Matrix of input features
        param y: Vector of Target/Label
        c: Inverse Regularization variable to control overfitting (high value causes overfitting, low value causes underfitting)
    '''
 '''We aren't splitting the data into train and test because we will use StratifiedKFoldCV.
        KFold CV is a preferred method compared to hold out CV, since the model is tested on all the examples.
        Hold out CV is preferred when the model takes too long to train and we have a huge test set that truly represents the universe
    '''
     
     le = LabelEncoder() # Label encoding the target
     sc = StandardScaler() # Scaling the input features
     y = le.fit_transform(y)#Label Encoding the target
 log_reg = LogisticRegression(max_iter=200,random_state=11,C=c) # LogisticRegression model
 # Pipeline with scaling and classification as steps, must use a pipelne since we are using KFoldCV
     lr = Pipeline(steps=(['scaler',sc],
                        ['classifier',log_reg]))
     
     
     cv = StratifiedKFold(n_splits=5,random_state=11,shuffle=True) # Creating a StratifiedKFold object with 5 folds
 cv_scores = cross_val_score(lr,X,y,scoring="accuracy",cv=cv) # Storing the CV scores (accuracy) of each fold
     
     
     lr.fit(X,y) # Fitting the model
 
     train_score = lr.score(X,y) # Scoring the model on train set
     
     #Building the learning curve
 train_size,train_scores,test_scores =learning_curve(estimator=lr,X=X,y=y,cv=cv,scoring="accuracy",random_state=11)
 train_scores = 1-np.mean(train_scores,axis=1)#converting the accuracy score to misclassification rate
     test_scores = 1-np.mean(test_scores,axis=1)#converting the accuracy score to misclassification rate
 lc =pd.DataFrame({"Training_size":train_size,"Training_loss":train_scores,"Validation_loss":test_scores}).melt(id_vars="Training_size")
     return {"cv_scores":cv_scores,
            "train_score":train_score,
            "learning_curve":lc}

上面代碼很簡(jiǎn)單，就是我們?nèi)粘５挠?xùn)練過程，下面我們開始介紹學(xué)習(xí)曲線的用處

1、擬合模型的學(xué)習(xí)曲線

我們將使用' learn_curve '函數(shù)通過將反正則化變量/參數(shù)' c '設(shè)置為1來獲得一個(gè)良好的擬合模型(即我們不執(zhí)行任何正則化)。

lc = learn_curve(X,y,1)
 print(f'Cross Validation Accuracies:\n{"-"*25}\n{list(lc["cv_scores"])}\n\n\
 Mean Cross Validation Accuracy:\n{"-"*25}\n{np.mean(lc["cv_scores"])}\n\n\
 Standard Deviation of Deep HUB Cross Validation Accuracy:\n{"-"*25}\n{np.std(lc["cv_scores"])}\n\n\
 Training Accuracy:\n{"-"*15}\n{lc["train_score"]}\n\n')
 sns.lineplot(data=lc["learning_curve"],x="Training_size",y="value",hue="variable")
 plt.title("Learning Curve of Good Fit Model")
 plt.ylabel("Misclassification Rate/Loss");

上面的結(jié)果中，交叉驗(yàn)證準(zhǔn)確率與訓(xùn)練準(zhǔn)確率接近。

訓(xùn)練的損失（藍(lán)色）：一個(gè)好的擬合模型的學(xué)習(xí)曲線會(huì)隨著訓(xùn)練樣例的增加逐漸減小并逐漸趨于平坦，說明增加更多的訓(xùn)練樣例并不能提高模型在訓(xùn)練數(shù)據(jù)上的性能。

驗(yàn)證的損失（黃色）：一個(gè)好的擬合模型的學(xué)習(xí)曲線在開始時(shí)具有較高的驗(yàn)證損失，隨著訓(xùn)練樣例的增加逐漸減小并逐漸趨于平坦，說明樣本越多，就能夠?qū)W習(xí)到更多的模式，這些模式對(duì)于”看不到“的數(shù)據(jù)會(huì)有幫助

最后還可以看到，在增加合理數(shù)量的訓(xùn)練樣例后，訓(xùn)練損失和驗(yàn)證損失彼此接近。

2、過擬合模型的學(xué)習(xí)曲線

我們將使用' learn_curve '函數(shù)通過將反正則化變量/參數(shù)' c '設(shè)置為10000來獲得過擬合模型(' c '的高值導(dǎo)致過擬合)。

lc = learn_curve(X,y,10000)
 print(f'Cross Validation Accuracies:\n{"-"*25}\n{list(lc["cv_scores"])}\n\n\
 Mean Cross Validation Deep HUB Accuracy:\n{"-"*25}\n{np.mean(lc["cv_scores"])}\n\n\
 Standard Deviation of Cross Validation Accuracy:\n{"-"*25}\n{np.std(lc["cv_scores"])} (High Variance)\n\n\
 Training Accuracy:\n{"-"*15}\n{lc["train_score"]}\n\n')
 sns.lineplot(data=lc["learning_curve"],x="Training_size",y="value",hue="variable")
 plt.title("Learning Curve of an Overfit Model")
 plt.ylabel("Misclassification Rate/Loss");

與擬合模型相比，交叉驗(yàn)證精度的標(biāo)準(zhǔn)差較高。

過擬合模型的學(xué)習(xí)曲線一開始的訓(xùn)練損失很低，隨著訓(xùn)練樣例的增加，學(xué)習(xí)曲線逐漸增加，但不會(huì)變平。過擬合模型的學(xué)習(xí)曲線在開始時(shí)具有較高的驗(yàn)證損失，隨著訓(xùn)練樣例的增加逐漸減小并且不趨于平坦，說明增加更多的訓(xùn)練樣例可以提高模型在未知數(shù)據(jù)上的性能。同時(shí)還可以看到，訓(xùn)練損失和驗(yàn)證損失彼此相差很遠(yuǎn)，在增加額外的訓(xùn)練數(shù)據(jù)時(shí)，它們可能會(huì)彼此接近。

3、欠擬合模型的學(xué)習(xí)曲線

將反正則化變量/參數(shù)' c '設(shè)置為1/10000來獲得欠擬合模型(' c '的低值導(dǎo)致欠擬合)。

lc = learn_curve(X,y,1/10000)
 print(f'Cross Validation Accuracies:\n{"-"*25}\n{list(lc["cv_scores"])}\n\n\
 Mean Cross Validation Accuracy:\n{"-"*25}\n{np.mean(lc["cv_scores"])}\n\n\
 Standard Deviation of Cross Validation Accuracy:\n{"-"*25}\n{np.std(lc["cv_scores"])} (Low variance)\n\n\
 Training Deep HUB Accuracy:\n{"-"*15}\n{lc["train_score"]}\n\n')
 sns.lineplot(data=lc["learning_curve"],x="Training_size",y="value",hue="variable")
 plt.title("Learning Curve of an Underfit Model")
 plt.ylabel("Misclassification Rate/Loss");

與過擬合和良好擬合模型相比，交叉驗(yàn)證精度的標(biāo)準(zhǔn)差較低。

欠擬合模型的學(xué)習(xí)曲線在開始時(shí)具有較低的訓(xùn)練損失，隨著訓(xùn)練樣例的增加逐漸增加，并在最后突然下降到任意最小點(diǎn)(最小并不意味著零損失)。這種最后的突然下跌可能并不總是會(huì)發(fā)生。這表明增加更多的訓(xùn)練樣例并不能提高模型在未知數(shù)據(jù)上的性能。

總結(jié)

在機(jī)器學(xué)習(xí)和統(tǒng)計(jì)建模中，過擬合（Overfitting）和欠擬合（Underfitting）是兩種常見的問題，它們描述了模型與訓(xùn)練數(shù)據(jù)的擬合程度如何影響模型在新數(shù)據(jù)上的表現(xiàn)。

分析生成的學(xué)習(xí)曲線時(shí)，可以關(guān)注以下幾個(gè)方面：

欠擬合：如果學(xué)習(xí)曲線顯示訓(xùn)練集和驗(yàn)證集的性能都比較低，或者兩者都隨著訓(xùn)練樣本數(shù)量的增加而緩慢提升，這通常表明模型欠擬合。這種情況下，模型可能太簡(jiǎn)單，無法捕捉數(shù)據(jù)中的基本模式。
過擬合：如果訓(xùn)練集的性能隨著樣本數(shù)量的增加而提高，而驗(yàn)證集的性能在一定點(diǎn)后開始下降或停滯不前，這通常表示模型過擬合。在這種情況下，模型可能太復(fù)雜，過度適應(yīng)了訓(xùn)練數(shù)據(jù)中的噪聲而非潛在的數(shù)據(jù)模式。

根據(jù)學(xué)習(xí)曲線的分析，你可以采取以下策略進(jìn)行調(diào)整：