自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

AI.x社區(qū)

軟考社區(qū)

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會

公眾號矩陣

移動端

視頻課免費課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項目管理免費題庫

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設(shè)置退出

Python數(shù)據(jù)分析真的不難學(xué)，實戰(zhàn)來了：大佬級別數(shù)據(jù)預(yù)處理方式

作者：野貓談Python 2020-07-08 13:46:25

大數(shù)據(jù) 數(shù)據(jù)分析

這次我們專門挑了一份爛大街的數(shù)據(jù)集Titanic，寫了一點關(guān)于數(shù)據(jù)預(yù)處理部分，但是代碼風(fēng)格卻是大(zhuang)佬(bi)級別。很明顯，我不是大佬，不過是有幸被培訓(xùn)過。

這次我們專門挑了一份爛大街的數(shù)據(jù)集Titanic，寫了一點關(guān)于數(shù)據(jù)預(yù)處理部分，但是代碼風(fēng)格卻是大(zhuang)佬(bi)級別。很明顯，我不是大佬，不過是有幸被培訓(xùn)過。

說到預(yù)處理，一般就是需要：

數(shù)字型缺失值處理
類別型缺失值處理
數(shù)字型標(biāo)準(zhǔn)化
類別型特征變成dummy變量
Pipeline 思想

在做數(shù)據(jù)處理以及機器學(xué)習(xí)的過程中，最后你會發(fā)現(xiàn)每個項目似乎都存在“套路”。所有的項目處理過程都會存在一個“套路”：

預(yù)處理
建模
訓(xùn)練
預(yù)測

對于預(yù)處理，其實也是一個套路，不過我們不用pipeline 函數(shù)，而是另一個FeatureUnion函數(shù)。

當(dāng)然一個函數(shù)也不能解決所有問題，我們通過實戰(zhàn)來看看哪些函數(shù)以及編碼風(fēng)格能讓我們的代碼看起來很有條理并且“大(zhuang)佬(bi)”風(fēng)格十足。

導(dǎo)入數(shù)據(jù)開啟實戰(zhàn)

今天我們分析的titanic 數(shù)據(jù)，數(shù)據(jù)我已經(jīng)下載，并且放在項目路徑下的data 文件中。

import pandas as pd 
file = 'data/titanic_train.csv' 
raw_df = pd.read_csv(file)

接下來就是標(biāo)準(zhǔn)套路：預(yù)覽info以及預(yù)覽head。

print(raw_df.info()) 
print(raw_df.head())

我們對數(shù)據(jù)集的名稱進(jìn)行簡單的回顧：

RangeIndex: 891 entries, 0 to 890：表示891 個樣本
columns ：共12 列

按數(shù)據(jù)類型來劃分：

int64 ：

PassengerId ：乘客ID
Survived：是否生存，1 為生存
Pclass ：乘客級別
SibSp ：sibling and spouse (兄弟姐妹以及配偶個數(shù))Parch ：parents and children(父母以及子女個數(shù))

object:

Name: 名字
Sex：性別
Ticket ：船票編號
Cabin：船艙號
Embarked：登船地點

float64：

Age：年齡
Fare 票價

RangeIndex: 891 entries, 0 to 890 
Data columns (total 12 columns): 
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   PassengerId  891 non-null    int64   
 1   Survived     891 non-null    int64   
 2   Pclass       891 non-null    int64   
 3   Name         891 non-null    object  
 4   Sex          891 non-null    object  
 5   Age          714 non-null    float64 
 6   SibSp        891 non-null    int64   
 7   Parch        891 non-null    int64   
 8   Ticket       891 non-null    object  
 9   Fare         891 non-null    float64 
 10  Cabin        204 non-null    object  
 11  Embarked     889 non-null    object  
dtypes: float64(2), int64(5), object(5) 
memory usage: 83.7+ KB

一般的機器學(xué)習(xí)都不會預(yù)處理缺失值以及類別型數(shù)據(jù)，因此我們至少要對這兩種情形做預(yù)處理。

首先我們查看缺失值，其實上文中的info已經(jīng)有這樣的信息。這里我們更顯式的展示缺失信息。

# get null count for each columns 
nulls_per_column = raw_df.isnull().sum() 
print(nulls_per_column)

結(jié)果如下：

PassengerId       0 
Survived             0 
Pclass                 0 
Name                 0 
Sex                     0 
Age                177 
SibSp                  0 
Parch                  0 
Ticket                 0 
Fare                    0 
Cabin             687 
Embarked          2 
dtype: int64

可以看到Age 有缺失，Age是float64 類型數(shù)據(jù)，Cabin 有缺失，Cabin 為object 類型，Embarked 有缺失，Embarked 也是object 類型。

主角登場(策略與函數(shù))

上述我們可以看到缺失的列有哪些，對于有些情況，比如快速清理數(shù)據(jù)，我們僅僅會制定如下策略：

對于float類型，我們一般就是用均值或者中位數(shù)來代替對于object 類型，如果ordinal 類型，也就是嚴(yán)格類別之分，比如(男，女)，比如(高，中，低)等，一般就用眾數(shù)來替代對于object 類型，如果nominal類型，也就是沒有等級/嚴(yán)格類別關(guān)系，比如ID，我們就用常值來替代。本文中用到的是sklearn的preprocessing 模塊，pipeline模塊，以及一個第三方“新秀”sklearn_pandas 庫。

這里我們簡單的介紹這個函數(shù)的用途。

StandardScaler： 用于對數(shù)字類型做標(biāo)準(zhǔn)化處理 
LabelBinarizer： 顧名思義，將類型類型，先label 化（變成數(shù)字），再Binarize （變成二進(jìn)制）。相當(dāng)于onehot 編碼，不過LabelBinarizer只是針對一列進(jìn)行處理 
FeatureUnion：用于將不同特征預(yù)處理過程（函數(shù)）重新合并，但是需要注意的是它的輸入不是數(shù)據(jù)而是transformer，也就是預(yù)處理的方法。 
SimpleImputer：sklearn 自帶了類似于fillna的預(yù)處理函數(shù) 
CategoricalImputer： 來自于sklearn_pandas 的補充，因為sklearn 中并沒有針對類別類型數(shù)據(jù)的預(yù)處理。 
DataFrameMapper： 相當(dāng)于構(gòu)建針對dataframe的不同的列構(gòu)建不同的transformer。 
from sklearn.preprocessing import StandardScaler 
from sklearn.preprocessing import LabelBinarizer 
from sklearn.pipeline import FeatureUnion 
from sklearn_pandas import CategoricalImputer 
from sklearn_pandas import DataFrameMapper 
from sklearn.impute import SimpleImputer

按照我們策略，我們需要將列分為數(shù)字型和類別型。思路就是看一列數(shù)據(jù)是否為object類型。

# split categorical columns and numerical columns 
categorical_mask = (raw_df.dtypes == object) 
categorical_cols = raw_df.columns[categorical_mask].tolist() 
numeric_cols = raw_df.columns[~categorical_mask].tolist() 
numeric_cols.remove('Survived') 
print(f'categorical_cols are {categorical_cols}' ) 
print(f'numeric_cols are {numeric_cols}' )

print:

categorical_cols are ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'] 
numeric_cols are ['PassengerId', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']

數(shù)值型數(shù)據(jù)預(yù)處理

對數(shù)值型數(shù)據(jù)進(jìn)行預(yù)處理，這里我們采用DataFrameMapper來創(chuàng)建這個transformer 對象，對所有的numeric_cols 進(jìn)行填寫中值。

numeric_fillna_mapper=DataFrameMapper([([col], SimpleImputer(strategy="median")) for col in numeric_cols], 
                                            input_df=True, 
                                            df_out=True 
                                           )

我們可以測試代碼，看一下變換后的數(shù)據(jù)是什么樣。這里需要調(diào)用fit_transform 方法。

transformed = numeric_fillna_mapper.fit_transform(raw_df) 
print(transformed.info())

結(jié)果如下，可以看到變換后的數(shù)據(jù)只包含我們處理的列，并且可以看到non-null 個數(shù)已經(jīng)為891，表明沒有缺失。

#   Column       Non-Null Count  Dtype   
--  ------       --------------  -----   
0   PassengerId  891 non-null    float64 
1   Pclass       891 non-null    float64 
2   Age          891 non-null    float64 
3   SibSp        891 non-null    float64 
4   Parch        891 non-null    float64 
5   Fare         891 non-null    float64

如果我們需要對數(shù)值型特征，先進(jìn)行缺失值填充，然后再進(jìn)行標(biāo)準(zhǔn)化。這樣我們只需要將上面的函數(shù)重新修改，增加一個transformer list。這個transformer list包含SimpleImputer 和StandardScaler 兩步。

# fill nan with mean 
# and then standardize cols 
numeric_fillna_standardize_mapper=DataFrameMapper([([col], [SimpleImputer(strategy="median"), 
                                                StandardScaler()]) for col in numeric_cols], 
                                            input_df=True, 
                                            df_out=True 
                                           ) 
fillna_standardized = numeric_fillna_standardize_mapper.fit_transform(raw_df) 
 
print(fillna_standardized.head())

預(yù)覽變換后的結(jié)果：

   PassengerId       Pclass          Age        SibSp        Parch          Fare 
0    -1.730108  0.827377 -0.565736  0.432793 -0.473674 -0.502445 
1    -1.726220 -1.566107  0.663861  0.432793 -0.473674  0.786845 
2    -1.722332  0.827377 -0.258337 -0.474545 -0.473674 -0.488854 
3    -1.718444 -1.566107  0.433312  0.432793 -0.473674  0.420730 
4    -1.714556  0.827377  0.433312 -0.474545 -0.473674 -0.486337

這樣我們就完成了數(shù)值型數(shù)據(jù)的預(yù)處理。類似的我們可以針對類別型數(shù)據(jù)進(jìn)行預(yù)處理。

類別型數(shù)據(jù)預(yù)處理

本例中，Cabin 有缺失，Embarked 有缺失，因為這兩者都是有有限類別個數(shù)的，我們可以用出現(xiàn)最高頻次的數(shù)據(jù)進(jìn)行填充，假如是Name 缺失呢?一般Name都沒有重名的，而且即便有個別重名，用最高頻次的數(shù)據(jù)進(jìn)行填充也沒有意義。所以我們會選擇用常數(shù)值填充，比如“unknown”等。

作為一個模板，這里我們的處理方法要涵蓋兩種情況。

['Name','Cabin','Ticket'] 其實都類似于ID，幾乎沒有重復(fù)的，我們用常值替代，然后用LabelBinarizer變成dummy 變量其他列，我們用最高頻次的類別填充，然后用LabelBinarizer變成dummy 變量。

# Apply categorical imputer 
 
constant_cols = ['Name','Cabin','Ticket'] 
frequency_cols = [_ for _  in categorical_cols if _ not in constant_cols] 
 
categorical_fillna_freq_mapper = DataFrameMapper( 
                                                [(col, [CategoricalImputer(),LabelBinarizer()]) for col in frequency_cols], 
                                                input_df=True, 
                                                df_out=True 
                                               ) 
 
categorical_fillna_constant_mapper = DataFrameMapper( 
                                                [(col, [CategoricalImputer(strategy='constant',fill_value='unknown'),LabelBinarizer()]) for col in constant_cols], 
                                                input_df=True, 
                                                df_out=True 
                                               )

我們同樣進(jìn)行測試代碼：

transformed = categorical_fillna_freq_mapper.fit_transform(raw_df) 
print(transformed.info()) 
transformed = categorical_fillna_constant_mapper.fit_transform(raw_df) 
print(transformed.shape)

結(jié)果如下：

Data columns (total 4 columns): 
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Sex         891 non-null    int32 
 1   Embarked_C  891 non-null    int32 
 2   Embarked_Q  891 non-null    int32 
 3   Embarked_S  891 non-null    int32 
dtypes: int32(4)

以及：

(891, 1720)

featureunion 所有的預(yù)處理過程

前面我們已經(jīng)測試了每一種的預(yù)處理的方式(transfomer 或者稱為mapper)，可以看到結(jié)果中只包含處理的部分列對應(yīng)的結(jié)果。

實際中，我們可以用FeatureUnion，直接將所有需要處理的方式(transfomer 或者稱為mapper)變成一個pipeline，一目了然。

然后調(diào)用fit_transform 對原始數(shù)據(jù)進(jìn)行變換，這樣我們的預(yù)處理看起來更有條理。

feature_union_1 = FeatureUnion([("numeric_fillna_standerdize", numeric_fillna_standardize_mapper), 
                              ("cat_freq", categorical_fillna_freq_mapper), 
                                ("cat_constant", categorical_fillna_constant_mapper)]) 
 
df_1 = feature_union_1.fit_transform(raw_df) 
 
print(df_1.shape) 
print(raw_df.shape)

總結(jié)

本文介紹了“大佬”級別的數(shù)據(jù)預(yù)處理方式，并且是在實戰(zhàn)中進(jìn)行演示。

通過本文可以學(xué)到：

數(shù)值型預(yù)處理，通過DataFrameMapper 直接對數(shù)值類型的列進(jìn)行多次變換
類別型預(yù)處理，通過DataFrameMapper 直接對類別型的列進(jìn)行多次變換
類別型變換方法可以至少采用兩種方式
LabelBinarizer，SimpleImputer，CategoricalImputer，LabelBinarizer等函數(shù)對數(shù)據(jù) 進(jìn)行變換
FeatureUnion 來將預(yù)處理過程管道化(pipeline) 通過這樣的方式處理數(shù)據(jù)，會一目了然。

責(zé)任編輯：未麗燕來源：今日頭條

Python 數(shù)據(jù)分析預(yù)處理

51CTO技術(shù)棧公眾號

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營

<p id="3tqci"></p>

<meter id="3tqci"></meter>