自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

<style id="xlj6f"></style>

<cite id="xlj6f"><rp id="xlj6f"><pre id="xlj6f"></pre></rp></cite>

AI.x社區(qū)

軟考社區(qū)

企業(yè)培訓

鴻蒙開發(fā)者社區(qū)

WOT技術大會

公眾號矩陣

移動端

視頻課免費課排行榜短視頻直播課軟考學堂

全部課程軟考華為認證廠商認證 IT技術 PMP項目管理免費題庫

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術棧

51CTO官微

51CTO學堂

51CTO博客

CTO訓練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學堂APP

51CTO學堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設置退出

Python數(shù)據(jù)分析-看了這篇文章，數(shù)據(jù)清洗你也就完全掌握了

作者：嘩啦圈的夢 2019-09-11 14:40:44

大數(shù)據(jù)

不管你從哪里獲取了數(shù)據(jù)，你都需要認真仔細觀察你的數(shù)據(jù)，對不合規(guī)的數(shù)據(jù)進行清理，雖然不是說一定要有這個步驟，但是這是一個好習慣，因為保不齊后面分析的時候發(fā)現(xiàn)之前因為沒有對數(shù)據(jù)進行整理，而導致統(tǒng)計的數(shù)據(jù)有問題，今天小編就把平時用的數(shù)據(jù)清洗的技巧進行一個梳理，里面可能很多你都懂，那就當溫習了吧!

所有做數(shù)據(jù)分析的前提就是：你得有數(shù)據(jù)，而且已經(jīng)經(jīng)過清洗，整理成需要的格式。

不管你從哪里獲取了數(shù)據(jù)，你都需要認真仔細觀察你的數(shù)據(jù)，對不合規(guī)的數(shù)據(jù)進行清理，雖然不是說一定要有這個步驟，但是這是一個好習慣，因為保不齊后面分析的時候發(fā)現(xiàn)之前因為沒有對數(shù)據(jù)進行整理，而導致統(tǒng)計的數(shù)據(jù)有問題，今天小編就把平時用的數(shù)據(jù)清洗的技巧進行一個梳理，里面可能很多你都懂，那就當溫習了吧!

文章大綱：

如何更有效的導入你的數(shù)據(jù)
全面的觀察數(shù)據(jù)
設置索引
設置標簽
處理缺失值
刪除重復項
數(shù)據(jù)類型轉(zhuǎn)換
篩選數(shù)據(jù)
數(shù)據(jù)排序
處理文本
合并&匹配

導入數(shù)據(jù)：

pd.read_excel("aa.xlsx") 
pd.read_csv("aa.xlsx") 
pd.read_clipboard

如何有效的導入數(shù)據(jù)：

1、限定導入的行，如果數(shù)據(jù)很大，初期只是為了查看數(shù)據(jù)，可以先導入一小部分：

pd.read_csv("aaa.csv",nrows=1000) 
pd.read_excel("aa.xlsx",nrows=1000)

2、如果你知道需要那些列，而且知道標簽名，可以只導入需要的數(shù)據(jù)：

pd.read_csv("aaa.csv",usecols=["A","B"]) 
pd.read_excel("aa.xlsx",usecols=["A","B"])

3、關于列標簽，如果沒有，或者需要重新設定：

pd.read_excel("aa.xlsx",header=None)#不需要原來的索引，會默認分配索引：0，1，2 
pd.read_excel("aa.xlsx",header=1)#設置第二行為列標簽 
pd.read_excel("aa.xlsx",header=[1,2])#多級索引 
pd.read_csv("aaa.csv",header=None) 
pd.read_csv("aaa.csv",header=1) 
pd.read_csv("aaa.csv",header=[1,2])

4、設置索引列，如果你可以提供一個更有利于數(shù)據(jù)分析的索引列，否則分配默認的0，1，2：

pd.read_csv("aaa.csv",index_col=1) 
pd.read_excel("aa.xlsx",index_col=2)

5、設置數(shù)值類型，這一步很重要，涉及到后期數(shù)據(jù)計算，也可以后期設置：

pd.read_csv("aaa.csv",converters = {'排名': str, '場次': float}) 
data = pd.read_excel(io, sheet_name = 'converters', converters = {'排名': str, '場次': float})

全面的查看數(shù)據(jù)：

查看前幾行：

data.head()

python數(shù)據(jù)分析-看了這篇文章，數(shù)據(jù)清洗你也就完全掌握了

查看末尾幾行：

python數(shù)據(jù)分析-看了這篇文章，數(shù)據(jù)清洗你也就完全掌握了

查看數(shù)據(jù)維度：

data.shape(16281, 7)

查看DataFrame的數(shù)據(jù)類型

df.dtypes

查看DataFrame的索引

df.index

查看DataFrame的列索引

df.columns

查看DataFrame的值

df.values

查看DataFrame的描述

df.describe()

某一列格式：

df['B'].dtype

設置索引和標簽：

有時我們經(jīng)常需要重新設置索引列，或者需要重新設置列標簽名字：

重新設置列標簽名：

df.rename(columns={"A": "a", "B": "c"}) 
df.rename(index={0: "x", 1: "y", 2: "z"})

重新設置索引：

df.set_index('month')

重新修改行列范圍：

df.reindex(['http_status', 'user_agent'], axis="columns") 
new_index= ['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10', 'Chrome'] 
df.reindex(new_index)

取消原有索引：

df.reset_index()

處理缺失值和重復項：

判斷是否有NA：df.isnull().any()

填充NA：

pf.fillna(0)

刪除含有NA的行：

rs=df.dropna(axis=0)

刪除含有NA的列：

rs=df.dropna(axis=1)

刪除某列的重復值：

a= frame.drop_duplicates(subset=['pop'],keep='last')

數(shù)據(jù)類型轉(zhuǎn)換：

df.dtypes：查看數(shù)值類型

python數(shù)據(jù)分析-看了這篇文章，數(shù)據(jù)清洗你也就完全掌握了

astype()強制轉(zhuǎn)化數(shù)據(jù)類型
通過創(chuàng)建自定義的函數(shù)進行數(shù)據(jù)轉(zhuǎn)化
pandas提供的to_nueric()以及to_datetime()

df["Active"].astype("bool") 
df['2016'].astype('float') 
df["2016"].apply(lambda x: x.replace(",","").replace("$","")).astype("float64") 
df["Percent Growth"].apply(lambda x: x.replace("%","")).astype("float")/100 
pd.to_numeric(df["Jan Units"],errors='coerce').fillna(0) 
pd.to_datetime(df[['Month', 'Day', 'Year']])

篩選數(shù)據(jù)：

1、按索引提取單行的數(shù)值

df_inner.loc[3]

2、按索引提取區(qū)域行數(shù)值

df_inner.iloc[0:5]

3、提取4日之前的所有數(shù)據(jù)

df_inner[:’2013-01-04’]

4、使用iloc按位置區(qū)域提取數(shù)據(jù)

df_inner.iloc[:3,:2] #冒號前后的數(shù)字不再是索引的標簽名稱，而是數(shù)據(jù)所在的位置，從0開始，前三行，前兩列。

5、適應iloc按位置單獨提起數(shù)據(jù)

df_inner.iloc[[0,2,5],[4,5]] #提取第0、2、5行，4、5列

6、使用ix按索引標簽和位置混合提取數(shù)據(jù)

df_inner.ix[:’2013-01-03’,:4] #2013-01-03號之前，前四列數(shù)據(jù)

7、使用loc提取行和列

df_inner.loc(2:10,"A":"Z")

8、判斷city列里是否包含beijing和shanghai，然后將符合條件的數(shù)據(jù)提取出來

df_inner[‘city’].isin([‘beijing’]) 
df_inner.loc[df_inner[‘city’].isin([‘beijing’,’shanghai’])]

9、提取前三個字符，并生成數(shù)據(jù)表

pd.DataFrame(category.str[:3])

10、使用“與”進行篩選

df_inner.loc[(df_inner[‘age’] > 25) & (df_inner[‘city’] == ‘beijing’), [‘id’,’city’,’age’,’category’,’gender’]]

11、使用“或”進行篩選

df_inner.loc[(df_inner[‘age’] > 25) | (df_inner[‘city’] == ‘beijing’), [‘id’,’city’,’age’,’category’,’gender’]].sort([‘age’])

12、使用“非”條件進行篩選

df_inner.loc[(df_inner[‘city’] != ‘beijing’), [‘id’,’city’,’age’,’category’,’gender’]].sort([‘id’])

13、對篩選后的數(shù)據(jù)按city列進行計數(shù)

df_inner.loc[(df_inner[‘city’] != ‘beijing’), [‘id’,’city’,’age’,’category’,’gender’]].sort([‘id’]).city.count()

14、使用query函數(shù)進行篩選

df_inner.query(‘city == [“beijing”, “shanghai”]’)

15、對篩選后的結(jié)果按prince進行求和

df_inner.query(‘city == [“beijing”, “shanghai”]’).price.sum()

數(shù)據(jù)排序

按照特定列的值排序：

df_inner.sort_values(by=[‘age’])

按照索引列排序：

df_inner.sort_index()

升序

df_inner.sort_values(by=[‘age’],ascending=True)

降序

df_inner.sort_values(by=[‘age’],ascending=False)

合并匹配：

merge

1.result = pd.merge(left, right, on='key') 
2.result = pd.merge(left, right, on=['key1', 'key2']) 
3.result = pd.merge(left, right, how='left', on=['key1', 'key2']) 
4.result = pd.merge(left, right, how='right', on=['key1', 'key2']) 
5.result = pd.merge(left, right, how='outer', on=['key1', 'key2'])

2、append

1.result = df1.append(df2) 
2.result = df1.append(df4) 
3.result = df1.append([df2, df3]) 
4.result = df1.append(df4, ignore_index=True)

4、join

left.join(right, on=key_or_keys)

1.result = left.join(right, on='key') 
2.result = left.join(right, on=['key1', 'key2']) 
3.result = left.join(right, on=['key1', 'key2'], how='inner')

5、concat

1.result = pd.concat([df1, df4], axis=1) 
2.result = pd.concat([df1, df4], axis=1, join='inner') 
3.result = pd.concat([df1, df4], axis=1, join_axes=[df1.index]) 
4.result = pd.concat([df1, df4], ignore_index=True)

文本處理：

1. lower()函數(shù)示例

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveMinsu']) 
s.str.lower()

2. upper()函數(shù)示例

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveMinsu']) 
s.str.upper()

3. len()計數(shù)

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveMinsu']) 
s.str.len()

4. strip()去除空格

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t']) 
s.str.strip()

5. split(pattern)切分字符串

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t']) 
s.str.split(' ')

6. cat(sep=pattern)合并字符串

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t']) 
s.str.cat(sep=' <=> ') 
執(zhí)行上面示例代碼，得到以下結(jié)果 - 
Tom <=> William Rick <=> John <=> Alber@t

7. get_dummies()用sep拆分每個字符串，返回一個虛擬/指示dataFrame

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t']) 
s.str.get_dummies()

8. contains()判斷字符串中是否包含子串true; pat str或正則表達式

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t']) 
s.str.contains(' ')

9. replace(a,b)將值pat替換為值b。

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t']) 
.str.replace('@','$')

10. repeat(value)重復每個元素指定的次數(shù)

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t']) 
s.str.repeat(2)

執(zhí)行上面示例代碼，得到以下結(jié)果 -

0 Tom Tom
1 William Rick William Rick
2 JohnJohn
3 Alber@tAlber@t

11. count(pattern)子串出現(xiàn)次數(shù)

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t']) 
print ("The number of 'm's in each string:") 
print (s.str.count('m'))

執(zhí)行上面示例代碼，得到以下結(jié)果 -

The number of 'm's in each string:

0 1
1 1
2 0
3 0

12. startswith(pattern)字符串開頭是否匹配子串True

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t']) 
print ("Strings that start with 'T':") 
print (s.str. startswith ('T'))

執(zhí)行上面示例代碼，得到以下結(jié)果 -

Strings that start with 'T':

0 True
1 False
2 False
3 False

13. endswith(pattern)字符串結(jié)尾是否是特定子串 true

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t']) 
print ("Strings that end with 't':") 
print (s.str.endswith('t'))

執(zhí)行上面示例代碼，得到以下結(jié)果 -

Strings that end with 't':

0 False
1 False
2 False
3 True

14. find(pattern)查子串首索引，子串包含在[start：end];無返回-1

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t']) 
print (s.str.find('e'))

執(zhí)行上面示例代碼，得到以下結(jié)果 -

0 -1
1 -1
2 -1
3 3

注意：-1表示元素中沒有這樣的模式可用。

15. findall(pattern)查找所有符合正則表達式的字符，以數(shù)組形式返回

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t']) 
print (s.str.findall('e'))

執(zhí)行上面示例代碼，得到以下結(jié)果 -

0 []
1 []
2 []
3 [e]

空列表([])表示元素中沒有這樣的模式可用。

16. swapcase()變換字母大小寫，大變小，小變大

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t']) 
s.str.swapcase()

執(zhí)行上面示例代碼，得到以下結(jié)果 -

tOM
wILLIAM rICK
jOHN
aLBER

17. islower()檢查是否都是大寫

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t']) 
s.str.islower()

18. isupper()檢查是否都是大寫

s = pd.Series(['TOM', 'William Rick', 'John', 'Alber@t']) 
s.str.isupper()

19. isnumeric()檢查是否都是數(shù)字

s = pd.Series(['Tom', '1199','William Rick', 'John', 'Alber@t']) 
s.str.isnumeric()

責任編輯：未麗燕來源：今日頭條

數(shù)據(jù)清洗數(shù)據(jù)分析數(shù)據(jù)類型

51CTO技術棧公眾號

業(yè)務
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學堂精培企業(yè)培訓 CTO訓練營

<sub id="5gqws"></sub>

<sup id="5gqws"></sup>