自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會(huì)

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營

鴻蒙開發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫

賬號(hào)設(shè)置退出

手把手教你用pandas處理缺失值

作者：Wes McKinney 2021-02-06 14:55:05

大數(shù)據(jù) 數(shù)據(jù)分析

在進(jìn)行數(shù)據(jù)分析和建模的過程中，大量的時(shí)間花在數(shù)據(jù)準(zhǔn)備上：加載、清理、轉(zhuǎn)換和重新排列。本文將討論用于缺失值處理的工具。

pandas對(duì)象的所有描述性統(tǒng)計(jì)信息默認(rèn)情況下是排除缺失值的。

pandas對(duì)象中表現(xiàn)缺失值的方式并不完美，但是它對(duì)大部分用戶來說是有用的。對(duì)于數(shù)值型數(shù)據(jù)，pandas使用浮點(diǎn)值NaN（Not a Number來表示缺失值）。我們稱NaN為容易檢測(cè)到的標(biāo)識(shí)值：

In :

string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])

string_data

Out:

0      aardvark  
1     artichoke  
2            NaN  
3       avocado  
dtype: object

In:

string_data.isnull()

Out:

0     False  
1     False  
2      True  
3     False  
dtype: bool

在pandas中，我們采用了R語言中的編程慣例，將缺失值成為NA，意思是not available（不可用）。在統(tǒng)計(jì)學(xué)應(yīng)用中，NA數(shù)據(jù)可以是不存在的數(shù)據(jù)或者是存在但不可觀察的數(shù)據(jù)（例如在數(shù)據(jù)收集過程中出現(xiàn)了問題）。當(dāng)清洗數(shù)據(jù)用于分析時(shí)，對(duì)缺失數(shù)據(jù)本身進(jìn)行分析以確定數(shù)據(jù)收集問題或數(shù)據(jù)丟失導(dǎo)致的數(shù)據(jù)偏差通常很重要。

Python內(nèi)建的None值在對(duì)象數(shù)組中也被當(dāng)作NA處理：

In:

string_data[0] = None

string_data.isnull()

Out:

0      True  
1     False  
2      True  
3     False  
dtype: bool

pandas項(xiàng)目持續(xù)改善處理缺失值的內(nèi)部細(xì)節(jié)，但是用戶API函數(shù)，比如pandas. isnull，抽象掉了很多令人厭煩的細(xì)節(jié)。處理缺失值的相關(guān)函數(shù)列表如下：

dropna：根據(jù)每個(gè)標(biāo)簽的值是否是缺失數(shù)據(jù)來篩選軸標(biāo)簽，并根據(jù)允許丟失的數(shù)據(jù)量來確定閾值
fillna：用某些值填充缺失的數(shù)據(jù)或使用插值方法(如“ffill”或“bfill”)。
isnull：返回表明哪些值是缺失值的布爾值
notnull：isnull的反作用函數(shù)

01 過濾缺失值

有多種過濾缺失值的方法。雖然你可以使用pandas.isnull和布爾值索引手動(dòng)地過濾缺失值，但dropna在過濾缺失值時(shí)是非常有用的。在Series上使用dropna，它會(huì)返回Series中所有的非空數(shù)據(jù)及其索引值：

In:

from numpy import nan as NA

data = pd.Series([1, NA, 3.5, NA, 7])

data.dropna()

Out:

0     1.0  
2     3.5  
4     7.0  
dtype: float64

上面的例子與下面的代碼是等價(jià)的：

In:

data[data.notnull()]

Out:

0     1.0  
2     3.5  
4     7.0  
dtype: float64

當(dāng)處理DataFrame對(duì)象時(shí)，事情會(huì)稍微更復(fù)雜一點(diǎn)。你可能想要?jiǎng)h除全部為NA或包含有NA的行或列。dropna默認(rèn)情況下會(huì)刪除包含缺失值的行：

In:

data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA]  
                     [NA, NA, NA], [NA, 6.5, 3.]])

cleaned = data.dropna()

data

Out:

   0     1     2  
0  1.0  6.5  3.0  
1  1.0  NaN  NaN  
2  NaN  NaN  NaN  
3  NaN  6.5  3.0

In:

cleaned

Out:

0     1     2  
 1.0  6.5  3.0

傳入how='all’時(shí)，將刪除所有值均為NA的行：

In:

data.dropna(how='all')

Out:

     0    1    2  
0  1.0  6.5  3.0  
1  1.0  NaN  NaN  
3  NaN  6.5  3.0

如果要用同樣的方式去刪除列，傳入?yún)?shù)axis=1：

In:

data[4] = NA

data

Out:

     0    1    2   4  
0  1.0  6.5  3.0 NaN  
1  1.0  NaN  NaN NaN  
2  NaN  NaN  NaN NaN  
3  NaN  6.5  3.0 NaN

In:

data.dropna(axis=1, how='all')

Out:

     0    1    2  
0  1.0  6.5  3.0  
1  1.0  NaN  NaN  
2  NaN  NaN  NaN  
3  NaN  6.5  3.0

過濾DataFrame的行的相關(guān)方法往往涉及時(shí)間序列數(shù)據(jù)。假設(shè)你只想保留包含一定數(shù)量的觀察值的行。你可以用thresh參數(shù)來表示：

In:

df = pd.DataFrame(np.random.randn(7, 3))

df.iloc[:4, 1] = NA

df.iloc[:2, 2] = NA

df

Out:

          0         1         2  
0 -0.204708       NaN       NaN  
1 -0.555730       NaN       NaN  
2  0.092908       NaN  0.769023  
3  1.246435       NaN -1.296221  
4  0.274992  0.228913  1.352917  
5  0.886429 -2.001637 -0.371843  
6  1.669025 -0.438570 -0.539741

In:

df.dropna()

Out:

         0         1         2  
4 0.274992  0.228913  1.352917  
5 0.886429 -2.001637 -0.371843  
6 1.669025 -0.438570 -0.539741

In:

df.dropna(thresh=2)

Out:

         0         1         2  
2 0.092908       NaN  0.769023  
3 1.246435       NaN -1.296221  
4 0.274992  0.228913  1.352917  
5 0.886429 -2.001637 -0.371843  
6 1.669025 -0.438570 -0.539741

02 補(bǔ)全缺失值

你有時(shí)可能需要以多種方式補(bǔ)全“漏洞”，而不是過濾缺失值（也可能丟棄其他數(shù)據(jù)）。

大多數(shù)情況下，主要使用fillna方法來補(bǔ)全缺失值。調(diào)用fillna時(shí)，可以使用一個(gè)常數(shù)來替代缺失值：

In:

df.fillna(0)

Out:

          0         1         2  
0 -0.204708  0.000000  0.000000  
1 -0.555730  0.000000  0.000000  
2  0.092908  0.000000  0.769023  
3  1.246435  0.000000 -1.296221  
4  0.274992  0.228913  1.352917  
5  0.886429 -2.001637 -0.371843  
6  1.669025 -0.438570 -0.539741

在調(diào)用fillna時(shí)使用字典，你可以為不同列設(shè)定不同的填充值：

In:

df.fillna({1: 0.5, 2: 0})

Out:

         0         1         2  
0 -0.204708  0.500000  0.000000  
1 -0.555730  0.500000  0.000000  
2  0.092908  0.500000  0.769023  
3  1.246435  0.500000 -1.296221  
4  0.274992  0.228913  1.352917  
5  0.886429 -2.001637 -0.371843  
6  1.669025 -0.438570 -0.539741

fillna返回的是一個(gè)新的對(duì)象，但你也可以修改已經(jīng)存在的對(duì)象：

In:

_ = df.fillna(0, inplace=True)

df

Out:

        0         1         2  
0 -0.204708  0.000000  0.000000  
1 -0.555730  0.000000  0.000000  
2  0.092908  0.000000  0.769023  
3  1.246435  0.000000 -1.296221  
4  0.274992  0.228913  1.352917  
5  0.886429 -2.001637 -0.371843  
6  1.669025 -0.438570 -0.539741

用于重建索引的相同的插值方法也可以用于fillna：

In:

df = pd.DataFrame(np.random.randn(6, 3))

df.iloc[2:, 1] = NA

df.iloc[4:, 2] = NA

df

Out:

         0         1         2  
0  0.476985  3.248944 -1.021228  
1 -0.577087  0.124121  0.302614  
2  0.523772       NaN  1.343810  
3 -0.713544       NaN -2.370232  
4 -1.860761       NaN       NaN  
5 -1.265934       NaN       NaN

In:

df.fillna(method='ffill')

Out:

          0         1         2  
0  0.476985  3.248944 -1.021228  
1 -0.577087  0.124121  0.302614  
2  0.523772  0.124121  1.343810  
3 -0.713544  0.124121 -2.370232  
4 -1.860761  0.124121 -2.370232  
5 -1.265934  0.124121 -2.370232

In:

df.fillna(method='ffill', limit=2)

Out:

        0         1         2  
0  0.476985  3.248944 -1.021228  
1 -0.577087  0.124121  0.302614  
2  0.523772  0.124121  1.343810  
3 -0.713544  0.124121 -2.370232  
4 -1.860761       NaN -2.370232  
5 -1.265934       NaN -2.370232

使用fillna你可以完成很多帶有一點(diǎn)創(chuàng)造性的工作。例如，你可以將Series的平均值或中位數(shù)用于填充缺失值：

In:

data = pd.Series([1., NA, 3.5, NA, 7])

data.fillna(data.mean())

Out:

0     1.000000  
1     3.833333  
2     3.500000  
3     3.833333  
4     7.000000  
dtype: float64

以下是fillna的函數(shù)參數(shù)。

value：標(biāo)量值或字典型對(duì)象用于填充缺失值
method：插值方法，如果沒有其他參數(shù)，默認(rèn)是'ffill'
axis：需要填充的軸，默認(rèn)axis=0
inplace：修改被調(diào)用的對(duì)象，而不是生成一個(gè)備份
limit：用于前向或后向填充時(shí)最大的填充范圍

關(guān)于作者：韋斯·麥金尼（Wes McKinney）是流行的Python開源數(shù)據(jù)分析庫pandas的創(chuàng)始人。他是一名活躍的演講者，也是Python數(shù)據(jù)社區(qū)和Apache軟件基金會(huì)的Python/C++開源開發(fā)者。目前他在紐約從事軟件架構(gòu)師工作。

本文摘編自《利用Python進(jìn)行數(shù)據(jù)分析》（原書第2版），經(jīng)出版方授權(quán)發(fā)布。

責(zé)任編輯：龐桂玉來源：大數(shù)據(jù)DT

大數(shù)據(jù)pandas 數(shù)據(jù)分析

點(diǎn)贊

51CTO技術(shù)棧公眾號(hào)

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營

<sup id="0jsjg"><rt id="0jsjg"></rt></sup>

<sub id="0jsjg"><rt id="0jsjg"></rt></sub>