自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

AI.x社區(qū)

軟考社區(qū)

企業(yè)培訓

鴻蒙開發(fā)者社區(qū)

WOT技術大會

公眾號矩陣

移動端

視頻課免費課排行榜短視頻直播課軟考學堂

全部課程軟考華為認證廠商認證 IT技術 PMP項目管理免費題庫

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術棧

51CTO官微

51CTO學堂

51CTO博客

CTO訓練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學堂APP

51CTO學堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設置退出

Python時間序列分析：使用TSFresh進行自動化特征提取

作者：Kyle Jones 2025-01-16 16:24:07

開發(fā) 前端

TSFresh（基于可擴展假設檢驗的時間序列特征提?。┦且粋€專門用于時間序列數(shù)據(jù)特征自動提取的框架。該框架提取的特征可直接應用于分類、回歸和異常檢測等機器學習任務。TSFresh通過自動化特征工程流程，顯著提升了時間序列分析的效率。

TSFresh（基于可擴展假設檢驗的時間序列特征提?。┦且粋€專門用于時間序列數(shù)據(jù)特征自動提取的框架。該框架提取的特征可直接應用于分類、回歸和異常檢測等機器學習任務。TSFresh通過自動化特征工程流程，顯著提升了時間序列分析的效率。

自動化特征提取過程涉及處理數(shù)百個統(tǒng)計特征，包括均值、方差、偏度和自相關性等，并通過統(tǒng)計檢驗方法篩選出具有顯著性的特征，同時剔除冗余特征。該框架支持單變量和多變量時間序列數(shù)據(jù)處理。

TSFresh工作流程

TSFresh的基本工作流程包含以下步驟：首先將數(shù)據(jù)轉(zhuǎn)換為特定格式，然后使用extract_features函數(shù)進行特征提取，最后可選擇性地使用select_features函數(shù)進行特征選擇。

TSFresh要求輸入數(shù)據(jù)采用長格式（Long Format），每個時間序列必須包含唯一的id標識列。

構建示例：生成100個特征的100組時間序列觀測數(shù)據(jù)

import pandas as pd
 import numpy as np
 from tsfresh import extract_features
 from tsfresh import select_features
 from tsfresh.utilities.dataframe_functions import impute
 from tsfresh.feature_extraction import EfficientFCParameters
 from tsfresh.feature_extraction.feature_calculators import mean
 from sklearn.model_selection import train_test_split
 from sklearn.ensemble import RandomForestClassifier
 from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
 import matplotlib.pyplot as plt
 import seaborn as sns
 # 構建大規(guī)模樣本數(shù)據(jù)集
 np.random.seed(42)
 n_series = 100
 n_timepoints = 100
 time_series_list = []
 for i in range(n_series):
     frequency = np.random.uniform(0.5, 2)
     phase = np.random.uniform(0, 2*np.pi)
     noise_level = np.random.uniform(0.05, 0.2)
     
     values = np.sin(frequency * np.linspace(0, 10, n_timepoints) + phase) + np.random.normal(0, noise_level, n_timepoints)
     
     df = pd.DataFrame({
         'id': i,
         'time': range(n_timepoints),
         'value': values
    })
     time_series_list.append(df)
 time_series = pd.concat(time_series_list, ignore_index=True)
 print("Original time series data:")
 print(time_series.head())
 print(f"Number of time series: {n_series}")
 print(f"Number of timepoints per series: {n_timepoints}")

接下來對生成的數(shù)據(jù)進行可視化分析：

# 選擇性可視化時間序列數(shù)據(jù)
 plt.figure(figsize=(12, 6))
 for i in range(5):  # 繪制前5條時間序列
     plt.plot(time_series[time_series['id'] == i]['time'],
              time_series[time_series['id'] == i]['value'],
              label=f'Series {i}')
 plt.title('Sample of Time Series')
 plt.xlabel('Time')
 plt.ylabel('Value')
 plt.legend()
 plt.savefig("sample_TS.png")
 plt.show()

數(shù)據(jù)展現(xiàn)出預期的隨機性特征，這與實際時間序列數(shù)據(jù)的特性相符。

特征提取過程

數(shù)據(jù)呈現(xiàn)出典型的時間序列特征，包含噪聲和波動。下面使用tsfresh.extract_features函數(shù)執(zhí)行特征提取操作。

# 執(zhí)行特征提取
 features = extract_features(time_series, column_id="id", column_sort="time", n_jobs=0)
 print("\nExtracted features:")
 print(features.head())
 # 對缺失值進行插補處理
 features_imputed = impute(features)

輸出示例（部分特征）：

value__mean value__variance value__autocorrelation_lag_1  
 id                                                              
 1         0.465421       0.024392                     0.856201  
 2         0.462104       0.023145                     0.845318

特征選擇

為提高模型效率，需要對提取的特征進行篩選。使用select_features函數(shù)基于統(tǒng)計顯著性進行特征選擇。

# 構造目標變量（基于頻率的二分類）
 target = pd.Series(index=range(n_series), dtype=int)
 target[features_imputed.index % 2 == 0] = 0  # 偶數(shù)索引分類
 target[features_imputed.index % 2 == 1] = 1  # 奇數(shù)索引分類
 # 執(zhí)行特征選擇
 selected_features = select_features(features_imputed, target)
 # 特征選擇結(jié)果處理
 if selected_features.empty:
     print("\nNo features were selected. Using all features.")
     selected_features = features_imputed
 else:
     print("\nSelected features:")
     print(selected_features.head())
 print(f"\nNumber of features: {selected_features.shape[1]}")
 print("\nNames of features (first 10):")
 print(selected_features.columns.tolist()[:10])

此過程可有效篩選出與目標變量具有顯著相關性的特征。

特征應用于監(jiān)督學習

特征工程的主要目的是為機器學習模型提供有效的輸入變量。TSFresh可與scikit-learn等主流機器學習庫無縫集成。

以下展示了特征在分類任務中的應用實例：

# 分類模型構建
 # 數(shù)據(jù)集劃分
 X_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(
     selected_features, target, test_size=0.2, random_state=42
 )
 # 隨機森林分類器訓練
 clf = RandomForestClassifier(random_state=42)
 clf.fit(X_train_clf, y_train_clf)
 # 模型評估
 y_pred_clf = clf.predict(X_test_clf)
 print("\nClassification Model Performance:")
 print(f"Accuracy: {accuracy_score(y_test_clf, y_pred_clf):.2f}")
 print("\nClassification Report:")
 print(classification_report(y_test_clf, y_pred_clf))
 # 混淆矩陣可視化
 cm = confusion_matrix(y_test_clf, y_pred_clf)
 plt.figure(figsize=(8, 6))
 sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
 plt.title('Confusion Matrix')
 plt.xlabel('Predicted')
 plt.ylabel('Actual')
 plt.savefig("confusion_matrix.png")
 plt.show()

# 特征重要性分析
 feature_importance = pd.DataFrame({
     'feature': X_train_clf.columns,
     'importance': clf.feature_importances_
 }).sort_values('importance', ascending=False)
 print("\nTop 10 Most Important Features:")
 print(feature_importance.head(10))
 # 特征重要性可視化
 plt.figure(figsize=(12, 6))
 sns.barplot(x='importance', y='feature', data=feature_importance.head(20))
 plt.title('Top 20 Most Important Features')
 plt.xlabel('Importance')
 plt.ylabel('Feature')
 plt.savefig("feature_importance.png")
 plt.show()

多變量時間序列處理

TSFresh支持對數(shù)據(jù)集中的多個變量同時進行特征提取。

# 多變量特征提取示例
 # 添加新的時間序列變量
 time_series["value2"] = time_series["value"] * 0.5 + np.random.normal(0, 0.05, len(time_series))
 # 對多個變量進行特征提取
 features_multivariate = extract_features(
     time_series,
     column_id="id",
     column_sort="time",
     default_fc_parameters=EfficientFCParameters(),
     n_jobs=0
 )
 print("\nMultivariate features:")
 print(features_multivariate.head())

自定義特征提取方法

TSFresh框架允許通過tsfresh.feature_extraction.feature_calculators模塊定制特征提取函數(shù)。

# 多變量特征提取實現(xiàn)
 # 構造附加時間序列變量
 time_series["value2"] = time_series["value"] * 0.5 + np.random.normal(0, 0.05, len(time_series))
 # 執(zhí)行多變量特征提取
 features_multivariate = extract_features(
     time_series,
     column_id="id",
     column_sort="time",
     default_fc_parameters=EfficientFCParameters(),
     n_jobs=0
 )
 print("\nMultivariate features:")
 print(features_multivariate.head())

以下展示了使用matplotlib進行數(shù)據(jù)分布可視化：

# 計算時間序列均值特征
 custom_features = time_series.groupby("id")["value"].apply(mean)
 print("\nCustom features (mean of each time series, first 5):")
 print(custom_features.head())
 # 特征分布可視化
 plt.figure(figsize=(10, 6))
 sns.histplot(custom_features, kde=True)
 plt.title('Distribution of Mean Values for Each Time Series')
 plt.xlabel('Mean Value')
 plt.ylabel('Count')
 plt.savefig("dist_of_means_TS.png")
 plt.show()

# 特征與目標變量關系可視化
 plt.figure(figsize=(10, 6))
 sns.scatterplot(x=custom_features, y=target)
 plt.title('Relationship between Mean Values and Target')
 plt.xlabel('Mean Value')
 plt.ylabel('Target')
 plt.savefig("means_v_target_TS.png")
 plt.show()

總結(jié)

TSFresh在時間序列特征工程領域展現(xiàn)出顯著優(yōu)勢。通過自動化特征生成機制，它為下游機器學習任務提供了豐富的特征輸入。但是需要注意的是，大量自動生成的特征可能導致過擬合問題，這一方面仍需進一步的實證研究驗證。

責任編輯：華軒來源： DeepHub IMBA

Python TSFresh 自動化特征提取

51CTO技術棧公眾號

業(yè)務
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學堂精培企業(yè)培訓 CTO訓練營

<tfoot id="vv9bh"><i id="vv9bh"></i></tfoot><acronym id="vv9bh"><p id="vv9bh"></p></acronym>