機器學習中處理不平衡數(shù)據(jù)集的五種方法
大家好,我是小寒
今天給大家分享處理不平衡數(shù)據(jù)集的常用方法。
在開始之前,我們先來了解一下什么是不平衡的數(shù)據(jù)集。
不平衡數(shù)據(jù)集是指在分類任務中,不同類別的樣本數(shù)量差異顯著的數(shù)據(jù)集,通常表現(xiàn)為少數(shù)類樣本遠少于多數(shù)類樣本。這樣的數(shù)據(jù)集在現(xiàn)實生活中很常見,比如欺詐檢測、醫(yī)療診斷、故障預測等場景。
例如,在一個包含 10,000 個實例的數(shù)據(jù)集中,95% 屬于一個類(類 0),只有 5% 屬于另一個類(類 1),很明顯,模型可能會高度關注多數(shù)類,而經(jīng)常完全忽略少數(shù)類。
不平衡數(shù)據(jù)的問題
在不平衡的數(shù)據(jù)集中,多數(shù)類別主導著模型的預測,導致少數(shù)類別的預測性能較差。
例如,如果 95% 的數(shù)據(jù)被標記為 0 類,則將所有實例預測為 0 類可獲得 95% 的準確率,即使 1 類預測完全不正確。
示例:
考慮一個欺詐檢測系統(tǒng),其中 99% 的交易是合法的,只有 1% 是欺詐的。預測所有交易均為合法的模型將達到 99% 的準確率,但無法檢測到任何欺詐行為,使其無法達到預期目的。
讓我們通過一個例子來可視化不平衡數(shù)據(jù)
import numpy as np
import pandas as pd
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import plotly.figure_factory as ff
# Generate imbalanced data
n_samples = 10000
class_0_ratio = 0.95
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0
X_class_0 = np.random.randn(n_class_0, 2)
X_class_1 = np.random.randn(n_class_1, 2) + 2 # Shift class 1 data
y_class_0 = np.zeros(n_class_0)
y_class_1 = np.ones(n_class_1)
X = np.concatenate((X_class_0, X_class_1), axis=0)
y = np.concatenate((y_class_0, y_class_1), axis=0)
# Create a Pandas DataFrame for easier handling
df = pd.DataFrame(X, columns=['Feature 1', 'Feature 2'])
df['Target'] = y
# Visualize class distribution
fig = px.histogram(df, x='Target', title='Class Distribution', width=800, height=600)
fig.update_layout(title_x=0.5)
fig.update_xaxes(tickvals=[0, 1], ticktext=['Class 0', 'Class 1'])
fig.show()
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
fig = ff.create_annotated_heatmap(cm, x=['Predicted 0', 'Predicted 1'], y=['True 0', 'True 1'])
fig.update_layout(title_text='Confusion Matrix', width=800, height=600, title_x=0.5)
fig.show()
此代碼生成一個不平衡的數(shù)據(jù)集,其中 95% 的實例被標記為類 0,只有 5% 被標記為類 1。
當我們可視化類別分布時,我們會看到兩個類別之間的明顯不平衡。
圖片
Accuracy: 0.9815
Precision: 0.8451
Recall: 0.6977
F1-score: 0.7643
混淆矩陣顯示,雖然準確率很高,但少數(shù)類(類1)的準確率和召回率要低得多。該模型偏向多數(shù)類。
圖片
import plotly.express as px
df["Target"] = df["Target"].astype(str)
fig = px.scatter(df, x='Feature 1', y='Feature 2', color='Target', title='Original Dataset')
fig.update_layout(title_x=0.5, width=800, height=600)
fig.show()
圖片
處理不平衡數(shù)據(jù)的技術
1.隨機欠采樣
隨機欠采樣是一種通過減少多數(shù)類樣本的數(shù)量來平衡類分布的方法。
具體做法是隨機選擇部分多數(shù)類樣本并將其移除,使得多數(shù)類和少數(shù)類的樣本數(shù)量接近平衡。
優(yōu)點
- 簡單易行,不需要復雜的算法。
- 減少了數(shù)據(jù)集的規(guī)模,降低了計算成本。
缺點
- 可能丟失重要的多數(shù)類信息,導致模型性能下降。
- 縮小的數(shù)據(jù)集可能導致模型對多數(shù)類的泛化能力變差。
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter
# Use RandomUnderSampler to balance the dataset
undersampler = RandomUnderSampler(sampling_strategy='auto', random_state=42)
X_resampled, y_resampled = undersampler.fit_resample(X, y)
# Check the original class distribution
print("Original class distribution:", Counter(y))
# Check the new class distribution after undersampling
print("New class distribution after undersampling:", Counter(y_resampled))
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)
# Train a simple logistic regression model
model = LogisticRegression(solver='liblinear')
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
# Distribution of undersampled data
df_resampled = pd.DataFrame(X_resampled, columns=['Feature 1', 'Feature 2'])
df_resampled['Target'] = y_resampled
df_resampled["Target"] = df_resampled["Target"].astype(str)
fig = px.scatter(df_resampled, x='Feature 1', y='Feature 2', color='Target', title='Undersampled Dataset')
fig.update_layout(title_x=0.5, width=800, height=600)
fig.show()
圖片
2.隨機過采樣
隨機過采樣通過增加少數(shù)類樣本的數(shù)量來平衡類分布。
常見的做法是隨機復制少數(shù)類的樣本,直到少數(shù)類樣本的數(shù)量與多數(shù)類樣本的數(shù)量相等。
優(yōu)點
- 不會丟失數(shù)據(jù),不像欠采樣那樣丟失多數(shù)類的樣本。
- 在數(shù)據(jù)較少時,可以通過增加樣本數(shù)量提高模型的學習效果。
缺點
- 由于重復樣本的存在,可能導致模型過擬合少數(shù)類樣本。
from imblearn.over_sampling import RandomOverSampler
# Check the original class distribution
original_class_distribution = Counter(y)
print("Original class distribution:", original_class_distribution)
# Initialize RandomOverSampler
oversampler = RandomOverSampler(sampling_strategy='auto', random_state=42)
# Apply random oversampling to balance the dataset
X_oversampled, y_oversampled = oversampler.fit_resample(X, y)
# Check the new class distribution after oversampling
new_class_distribution = Counter(y_oversampled)
print("New class distribution after oversampling:", new_class_distribution)
X_train, X_test, y_train, y_test = train_test_split(X_oversampled, y_oversampled, test_size=0.2, random_state=42)
# Train a simple logistic regression model
model = LogisticRegression(solver='liblinear')
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
df_resampled = pd.DataFrame(X_oversampled, columns=['Feature 1', 'Feature 2'])
df_resampled['Target'] = y_oversampled
df_resampled["Target"] = df_resampled["Target"].astype(str)
fig = px.scatter(df_resampled, x='Feature 1', y='Feature 2', color='Target', title='Oversampled Dataset')
fig.update_layout(title_x=0.5, width=800, height=600)
fig.show()
圖片
3.SMOTE
SMOTE 是一種合成過采樣方法,通過生成新的少數(shù)類樣本來平衡數(shù)據(jù)集。
它不是簡單地復制現(xiàn)有的少數(shù)類樣本,而是通過對現(xiàn)有少數(shù)類樣本的特征進行插值,創(chuàng)建新樣本。
具體來說,SMOTE 從少數(shù)類樣本中選取一個樣本和其最近鄰樣本,在它們之間生成新的合成樣本。
優(yōu)點
- 通過生成新樣本代替簡單復制,緩解了過擬合的問題。
- 利用插值方法生成多樣化的少數(shù)類樣本,擴展了少數(shù)類樣本的分布。
缺點
- 生成的合成樣本可能落在錯誤的決策邊界上,尤其是在樣本分布不清晰時。
- 對高維數(shù)據(jù)的效果不佳,因為高維數(shù)據(jù)中的樣本通常稀疏,插值生成的樣本可能不具有代表性。
from imblearn.over_sampling import SMOTE
# Check the original class distribution
original_class_distribution = Counter(y)
print("Original class distribution:", original_class_distribution)
# Initialize SMOTE
smote = SMOTE(sampling_strategy='auto', random_state=42)
# Apply SMOTE to balance the dataset
X_smote, y_smote = smote.fit_resample(X, y)
# Check the new class distribution after SMOTE
new_class_distribution = Counter(y_smote)
print("New class distribution after SMOTE:", new_class_distribution)
X_train, X_test, y_train, y_test = train_test_split(X_smote, y_smote, test_size=0.2, random_state=42)
# Train a simple logistic regression model
model = LogisticRegression(solver='liblinear')
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
df_resampled = pd.DataFrame(X_smote, columns=['Feature 1', 'Feature 2'])
df_resampled['Target'] = y_smote
df_resampled["Target"] = df_resampled["Target"].astype(str)
fig = px.scatter(df_resampled, x='Feature 1', y='Feature 2', color='Target', title='SMOTE Dataset')
fig.update_layout(title_x=0.5, width=800, height=600)
fig.show()
df_resampled["Target"].value_counts()
圖片
4.成本敏感型學習
成本敏感型學習通過為分類錯誤分配不同的成本來解決數(shù)據(jù)不平衡問題。
在不平衡數(shù)據(jù)集中,錯分少數(shù)類的代價通常比多數(shù)類更高。成本敏感型學習通過在損失函數(shù)中引入成本矩陣來調(diào)整模型,使得少數(shù)類的錯分類損失更大,從而引導模型更加關注少數(shù)類。
優(yōu)點
- 不需要對數(shù)據(jù)進行重采樣,可以直接在模型訓練中融入不平衡問題。
- 可以靈活調(diào)整不同錯誤分類的成本,適應不同場景的需求。
缺點:
- 成本矩陣的設置需要根據(jù)實際問題調(diào)整,具有一定的挑戰(zhàn)性。
- 在處理嚴重不平衡的數(shù)據(jù)時,仍可能遇到少數(shù)類樣本過少的問題。
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter
from sklearn.datasets import make_classification
# Create a mock imbalanced dataset
X, y = make_classification(n_classes=2, weights=[0.99, 0.01], n_samples=1000, random_state=42)
print('Original class distribution:', Counter(y))
# Train a cost-sensitive decision tree
model = DecisionTreeClassifier(class_weight={0: 1, 1: 10}, random_state=42)
model.fit(X, y)
# Evaluate the model
y_pred = model.predict(X)
print(classification_report(y, y_pred))
5.平衡隨機森林
平衡隨機森林是在隨機森林的基礎上改進的一種方法,針對不平衡數(shù)據(jù)集做了優(yōu)化。
它通過在構建每棵決策樹時,對多數(shù)類進行隨機欠采樣,確保每棵樹的訓練集都是平衡的。同時,它結(jié)合了隨機森林的特性,通過多個弱分類器的集成來提高整體的預測能力。
優(yōu)點
- 保留了隨機森林的優(yōu)勢,如高準確性和魯棒性。
- 對多數(shù)類進行欠采樣,能夠減少模型對多數(shù)類的偏向,提高對少數(shù)類的預測能力。
- 集成多個決策樹,具有較強的泛化能力,減少了單一模型的偏差。
缺點:
- 相比于傳統(tǒng)隨機森林,平衡隨機森林的計算成本更高,因為需要對多數(shù)類進行多次欠采樣。
- 欠采樣過程中可能丟失多數(shù)類的重要信息,影響模型的整體表現(xiàn)。
from imblearn.ensemble import BalancedRandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train a Balanced Random Forest model
brf = BalancedRandomForestClassifier(random_state=42)
brf.fit(X_train, y_train)
# Evaluate
y_pred = brf.predict(X_test)
print('Balanced Random Forest Accuracy:', accuracy_score(y_test, y_pred))