如何在評估機器學習模型時防止數(shù)據(jù)泄漏

作者：deephub 2021-02-22 11:44:43

本文討論了評估模型性能時的數(shù)據(jù)泄漏問題以及避免數(shù)據(jù)泄漏的方法。

在模型評估過程中，當訓(xùn)練集的數(shù)據(jù)進入驗證/測試集時，就會發(fā)生數(shù)據(jù)泄漏。這將導(dǎo)致模型對驗證/測試集的性能評估存在偏差。讓我們用一個使用Scikit-Learn的“波士頓房價”數(shù)據(jù)集的例子來理解它。數(shù)據(jù)集沒有缺失值，因此隨機引入100個缺失值，以便更好地演示數(shù)據(jù)泄漏。

import numpy as np  
import pandas as pd  
from sklearn.datasets import load_boston  
from sklearn.preprocessing import StandardScaler  
from sklearn.pipeline import Pipeline  
from sklearn.impute import SimpleImputer  
from sklearn.neighbors import KNeighborsRegressor  
from sklearn.model_selection import cross_validate, train_test_split  
from sklearn.metrics import mean_squared_error  
 
#Importing the dataset  
data = pd.DataFrame(load_boston()['data'],columns=load_boston()['feature_names'])  
data['target'] = load_boston()['target']  
 
 
#Split the input and target features  
X = data.iloc[:,:-1].copy()  
y = data.iloc[:,-1].copy()  
 
 
# Adding 100 random missing values  
np.random.seed(11)  
rand_cols = np.random.randint(0,X.shape[1],100)  
rand_rows = np.random.randint(0,X.shape[0],100)  
for i,j in zip(rand_rows,rand_cols):  
X.iloc[i,j] = np.nan  
 
#Splitting the data into training and test sets  
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=11)  
 
#Initislizing KNN Regressor  
knn = KNeighborsRegressor()  
 
#Initializing mode imputer  
imp = SimpleImputer(strategy='most_frequent')  
 
#Initializing StandardScaler  
standard_scaler = StandardScaler()  
 
#Imputing and scaling X_train  
X_train_impute = imp.fit_transform(X_train).copy()  
X_train_scaled = standard_scaler.fit_transform(X_train_impute).copy()  
 
#Running 5-fold cross-validation  
cv = cross_validate(estimator=knn,X=X_train_scaled,y=y_train,cv=5,scoring="neg_root_mean_squared_error",return_train_score=True)  
 
#Calculating mean of the training scores of cross-validation  
print(f'Training RMSE (with data leakage): {-1 * np.mean(cv["train_score"])}')  
 
#Calculating mean of the validation scores of cross-validation  
print(f'validation RMSE (with data leakage): {-1 * np.mean(cv["test_score"])}')  
 
#fitting the model to the training data  
lr.fit(X_train_scaled,y_train)  
 
#preprocessing the test data  
X_test_impute = imp.transform(X_test).copy()  
X_test_scaled = standard_scaler.transform(X_test_impute).copy()  
 
#Predictions and model evaluation on unseen data  
pred = lr.predict(X_test_scaled)  
print(f'RMSE on unseen data: {np.sqrt(mean_squared_error(y_test,pred))}')

在上面的代碼中，‘Xtrain’是訓(xùn)練集(k-fold交叉驗證)，‘Xtest’用于對看不見的數(shù)據(jù)進行模型評估。上面的代碼是一個帶有數(shù)據(jù)泄漏的模型評估示例，其中，用于估算缺失值的模式(strategy= ' mostfrequent ')在' Xtrain '上計算。類似地，用于縮放數(shù)據(jù)的均值和標準偏差也使用' Xtrain '計算。' Xtrain的缺失值將被輸入，' X_train '在k-fold交叉驗證之前進行縮放。

在k-fold交叉驗證中，' Xtrain '被分割成' k '折疊。在每次k-fold交叉驗證迭代中，其中一個折用于驗證(我們稱其為驗證部分)，其余的折用于訓(xùn)練(我們稱其為訓(xùn)練部分)。每次迭代中的訓(xùn)練和驗證部分都有已經(jīng)使用' Xtrain '計算的模式輸入的缺失值。類似地，它們已經(jīng)使用在' Xtrain '上計算的平均值和標準偏差進行了縮放。這種估算和縮放操作會導(dǎo)致來自' Xtrain '的信息泄露到k-fold交叉驗證的訓(xùn)練和驗證部分。這種信息泄漏可能導(dǎo)致模型在驗證部分上的性能估計有偏差。下面的代碼展示了一種通過使用管道來避免它的方法。

#Preprocessing and regressor pipeline  
pipeline = Pipeline(steps=[['imputer',imp],['scaler',standard_scaler],['regressor',knn]])  
 
#Running 5-fold cross-validation using pipeline as estimator  
cv = cross_validate(estimator=pipeline,X=X_train,y=y_train,cv=5,scoring="neg_root_mean_squared_error",return_train_score=True)  
 
#Calculating mean of the training scores of cross-validation  
print(f'Training RMSE (without data leakage): {-1 * np.mean(cv["train_score"])}')  
 
#Calculating mean of the validation scores of cross-validation  
print(f'validation RMSE (without data leakage): {-1 * np.mean(cv["test_score"])}')  
 
#fitting the pipeline to the training data  
pipeline.fit(X_train,y_train)  
 
#Predictions and model evaluation on unseen data  
pred = pipeline.predict(X_test)  
print(f'RMSE on unseen data: {np.sqrt(mean_squared_error(y_test,pred))}')

在上面的代碼中，我們已經(jīng)在管道中包含了輸入器、標量和回歸器。在本例中，' X_train '被分割為5個折，在每次迭代中，管道使用訓(xùn)練部分計算用于輸入訓(xùn)練和驗證部分中缺失值的模式。同樣，用于衡量訓(xùn)練和驗證部分的平均值和標準偏差也在訓(xùn)練部分上計算。這一過程消除了數(shù)據(jù)泄漏，因為在每次k-fold交叉驗證迭代中，都在訓(xùn)練部分計算歸責模式和縮放的均值和標準偏差。在每次k-fold交叉驗證迭代中，這些值用于計算和擴展訓(xùn)練和驗證部分。

我們可以看到在有數(shù)據(jù)泄漏和沒有數(shù)據(jù)泄漏的情況下計算的訓(xùn)練和驗證rmse的差異。由于數(shù)據(jù)集很小，我們只能看到它們之間的微小差異。在大數(shù)據(jù)集的情況下，這個差異可能會很大。對于看不見的數(shù)據(jù)，驗證RMSE(帶有數(shù)據(jù)泄漏)接近RMSE只是偶然的。

因此，使用管道進行k-fold交叉驗證可以防止數(shù)據(jù)泄漏，并更好地評估模型在不可見數(shù)據(jù)上的性能。

責任編輯：華軒來源：今日頭條

機器學習數(shù)據(jù)泄露學習

自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

如何在評估機器學習模型時防止數(shù)據(jù)泄漏