自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

<tfoot id="36btf"><strike id="36btf"><label id="36btf"></label></strike></tfoot>

<u id="36btf"><li id="36btf"></li></u>

<pre id="36btf"><span id="36btf"></span></pre>

<em id="36btf"><dfn id="36btf"><fieldset id="36btf"></fieldset></dfn></em>

AI.x社區(qū)

軟考社區(qū)

企業(yè)培訓

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會

公眾號矩陣

移動端

視頻課免費課排行榜短視頻直播課軟考學堂

全部課程軟考華為認證廠商認證 IT技術(shù)PMP項目管理免費題庫

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學堂

51CTO博客

CTO訓練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學堂APP

51CTO學堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設(shè)置退出

用Pyjanitor消除數(shù)據(jù)清洗(Data Cleaning)中的煩惱

作者：新語數(shù)據(jù)故事匯 2024-05-10 08:31:57

大數(shù)據(jù) 數(shù)據(jù)分析

數(shù)據(jù)清洗（Data Cleaning）通常被視為數(shù)據(jù)驅(qū)動決策的關(guān)鍵準備步驟，其目的在于查找并糾正數(shù)據(jù)中的錯誤和不一致，以提高數(shù)據(jù)質(zhì)量。隨著數(shù)據(jù)集的增長，確保數(shù)據(jù)的清潔度和完整性變得越發(fā)具有挑戰(zhàn)性。了解數(shù)據(jù)清洗的重要性以及如何進行數(shù)據(jù)清洗變得至關(guān)重要。

從數(shù)據(jù)分析到EDA(探索性數(shù)據(jù)分析/exploratory data analysis)再到機器學習模型，數(shù)據(jù)集的質(zhì)量和完整性都是確保分析和建模過程有效的關(guān)鍵因素。高質(zhì)量、完整的數(shù)據(jù)集能夠提供更可靠、更準確的分析結(jié)果，有助于制定基于數(shù)據(jù)的決策。

數(shù)據(jù)清洗（Data Cleaning）通常被視為數(shù)據(jù)驅(qū)動決策的關(guān)鍵準備步驟，其目的在于查找并糾正數(shù)據(jù)中的錯誤和不一致，以提高數(shù)據(jù)質(zhì)量。隨著數(shù)據(jù)集的增長，確保數(shù)據(jù)的清潔度和完整性變得越發(fā)具有挑戰(zhàn)性。了解數(shù)據(jù)清洗的重要性以及如何進行數(shù)據(jù)清洗變得至關(guān)重要。

關(guān)于數(shù)據(jù)清洗的重要性參見《一文帶您了解數(shù)據(jù)清洗的重要:數(shù)據(jù)驅(qū)動決策的關(guān)鍵步驟》或參考《數(shù)據(jù)科學/機器學習項目中處理缺失值：策略與實踐》

今天我們將介紹并示范一款優(yōu)秀的數(shù)據(jù)清洗工具包，能夠加速并簡化數(shù)據(jù)清洗的過程：pyjanitor。

pyjanitor 是什么？

Pyjanitor是一個功能強大的Python庫，旨在簡化數(shù)據(jù)清洗的過程。作為流行的Pandas庫的擴展，Pyjanitor為數(shù)據(jù)科學家和分析師提供了額外的功能，使數(shù)據(jù)清洗變得更加高效和便捷。該庫不僅易于使用，而且高度可定制，可以滿足各種數(shù)據(jù)清洗任務的需求。通過Pyjanitor，用戶可以輕松添加和刪除列，重命名列，處理缺失值，過濾數(shù)據(jù)，進行數(shù)據(jù)分組，數(shù)據(jù)重塑，以及處理字符串和文本數(shù)據(jù)等。這些功能使Pyjanitor成為處理數(shù)據(jù)預處理挑戰(zhàn)的理想選擇，無論是在數(shù)據(jù)科學項目中還是日常數(shù)據(jù)分析任務中。

Pyjanitor的一些關(guān)鍵特性包括：

添加和刪除列
重命名列
處理缺失值
數(shù)據(jù)過濾
數(shù)據(jù)分組
數(shù)據(jù)重塑
處理字符串和文本數(shù)據(jù)

使用Pyjanitor進行數(shù)據(jù)清洗的一些關(guān)鍵優(yōu)勢包括：

簡化了數(shù)據(jù)清洗的流程
節(jié)省時間和精力
提供了豐富的數(shù)據(jù)清洗和準備功能
高度可定制和靈活
與Pandas和其他流行的Python庫兼容

安裝pyjanitor

pipinstall pyjanitor

pyjanitor簡單示例

import pandas as pd
import janitor


# Read the dataset
df = pd.read_csv('heart_disease_uci.csv')


# Clean the column names
df = df.clean_names()
# Droping the unnecessary columns
df = df.remove_columns(['ca', 'thal'])


# Convert the trestbps to a float
df['trestbps'] = df['trestbps'].astype(float)


# Sort the dataframe by the trestbps column in descending order
df = df.sort_values(by='trestbps', ascending=False)


# Save the cleaned dataframe to a new CSV file
df.to_csv('cleaned_heart_disease_uci.csv', index=False)

在這個示例中，首先導入了必要的庫，包括Pyjanitor。然后使用Pandas的read_csv函數(shù)讀取數(shù)據(jù)集。接著，使用Pyjanitor的clean_names函數(shù)來標準化列名。然后使用remove_columns函數(shù)刪除任何不必要的列。使用astype方法將工資列轉(zhuǎn)換為浮點數(shù)。最后，使用sort_values方法按照工資列的降序?qū)?shù)據(jù)框進行排序，并使用to_csv方法將清理后的數(shù)據(jù)框保存到新的CSV文件中。

API 方式

有三種使用API的方法。第一種，也是最強烈推薦的方法，是將pyjanitor的函數(shù)用作Pandas的本地函數(shù)。

import pandas as pd
import numpy as np
company_sales = {
    'SalesMonth': ['Jan', 'Feb', 'Mar', 'April'],
    'Company1': [150.0, 200.0, 300.0, 400.0],
    'Company2': [180.0, 250.0, np.nan, 500.0],
    'Company3': [400.0, 500.0, 600.0, 675.0]
}
import janitor  # upon import, functions are registered as part of pandas.


# This cleans the column names as well as removes any duplicate rows
df = pd.DataFrame.from_dict(company_sales).clean_names().remove_empty()
df

第二種方法是函數(shù)式API。

import pandas as pd
import numpy as np
company_sales = {
    'SalesMonth': ['Jan', 'Feb', 'Mar', 'April'],
    'Company1': [150.0, 200.0, 300.0, 400.0],
    'Company2': [180.0, 250.0, np.nan, 500.0],
    'Company3': [400.0, 500.0, 600.0, 675.0]
}
from janitor import clean_names, remove_empty
df = pd.DataFrame.from_dict(company_sales)
df = clean_names(df)
df = remove_empty(df)
df

最后一種方法是使用pipe()方法：

import pandas as pd
import numpy as np
company_sales = {
    'SalesMonth': ['Jan', 'Feb', 'Mar', 'April'],
    'Company1': [150.0, 200.0, 300.0, 400.0],
    'Company2': [180.0, 250.0, np.nan, 500.0],
    'Company3': [400.0, 500.0, 600.0, 675.0]
}
from janitor import clean_names, remove_empty
df = (
    pd.DataFrame.from_dict(company_sales)
    .pipe(clean_names)
    .pipe(remove_empty)
)
df

填充的函數(shù)示例：fill_direction(df, **kwargs)

提供一個可鏈式調(diào)用的方法，用于填充所選列中的缺失值。

它是pd.Series.ffill和pd.Series.bfill的包裝器，并將列名與up、down、updown和downup中的一個配對使用。

import pandas as pd
import janitor as jn
df = pd.DataFrame(
   {
       'col1': [1, 2, 3, 4],
       'col2': [None, 5, 6, 7],
       'col3': [8, 9, 10, None],
       'col4': [None, None, 11, None],
       'col5': [None, 12, 13, None]
   }
)
print(df)


df1=df.fill_direction(
col2 = 'up',
col3 = 'down',
col4 = 'downup',
col5 = 'updown'
)
print(df1)

添加新功能(functionality)

需要定義一個函數(shù)，該函數(shù)表達了數(shù)據(jù)處理/清理的流程。該函數(shù)應接受一個DataFrame作為第一個參數(shù)，并返回一個修改過的DataFrame。

import pandas_flavor as pf


@pf.register_dataframe_method
def my_data_cleaning_function(df, arg1, arg2, ...):
    # Put data processing function here.
    return df

Pyjanitor 提供了簡化和自動化數(shù)據(jù)清洗過程的解決方案，旨在使數(shù)據(jù)清洗更快速、更高效。作為一個功能強大且多功能的包，Pyjanitor 的集成可以幫助您節(jié)省時間，讓您將更多精力投入到數(shù)據(jù)分析和解釋上。

責任編輯：武曉燕來源：今日頭條

數(shù)據(jù)清洗數(shù)據(jù)驅(qū)動

51CTO技術(shù)棧公眾號

業(yè)務
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學堂精培企業(yè)培訓 CTO訓練營

<legend id="jkl69"><track id="jkl69"></track></legend>

<style id="jkl69"></style>

<cite id="jkl69"><rp id="jkl69"><span id="jkl69"></span></rp></cite>

<sub id="jkl69"></sub>