自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

AI.x社區(qū)

軟考社區(qū)

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會

公眾號矩陣

移動端

視頻課免費課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認證廠商認證 IT技術(shù)PMP項目管理免費題庫

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設(shè)置退出

30 個數(shù)據(jù)工程必備的Python 包

作者：deephub 2022-07-30 23:27:36

開發(fā) 前端

在本文中，將介紹一些非常獨特的并且好用的 Python 包，它們可以在許多方面幫助你構(gòu)建數(shù)據(jù)的工作流。

Python 可以說是最容易入門的編程語言，在numpy，scipy等基礎(chǔ)包的幫助下，對于數(shù)據(jù)的處理和機器學(xué)習(xí)來說Python可以說是目前最好的語言，在各位大佬和熱心貢獻者的幫助下Python擁有一個龐大的社區(qū)支持技術(shù)發(fā)展，開發(fā)兩個各種 Python 包來幫助數(shù)據(jù)人員的工作。

在本文中，將介紹一些非常獨特的并且好用的 Python 包，它們可以在許多方面幫助你構(gòu)建數(shù)據(jù)的工作流。

1、Knockknock

Knockknock是一個簡單的Python包，它會在機器學(xué)習(xí)模型訓(xùn)練結(jié)束或崩潰時通知您。我們可以通過多種渠道獲得通知，如電子郵件、Slack、Microsoft Teams等。

為了安裝該包，我們使用以下代碼。

pip install knockknock

例如，我們可以使用以下代碼將機器學(xué)習(xí)建模訓(xùn)練狀態(tài)通知到指定的電子郵件地址。

from knockknock import email_senderfrom sklearn.linear_model import LinearRegressionimport numpy as np@email_sender(recipient_emails=["<your_email@address.com>", "<your_second_email@address.com>"], sender_email="<sender_email@gmail.com>")def train_linear_model(your_nicest_parameters):x = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])y = np.dot(x, np.array([1, 2])) + 3 regression = LinearRegression().fit(x, y)return regression.score(x, y)

這樣就可以在該函數(shù)出現(xiàn)問題或者完成時獲得通知。

2、tqdm

當需要進行迭代或循環(huán)時，如果你需要顯示進度條?那么tqdm就是你需要的。這個包將在你的筆記本或命令提示符中提供一個簡單的進度計。

讓我們從安裝包開始。

pip install tqdm

然后可以使用以下代碼來顯示循環(huán)過程中的進度條。

from tqdm import tqdmq = 0for i in tqdm(range(10000000)):q = i +1

就像上面的gifg，它可以在notebook上顯示一個很好的進度條。當有一個復(fù)雜的迭代并且想要跟蹤進度時，它會非常有用。

3、Pandas-log

Panda -log可以對Panda的基本操作提供反饋，如.query、.drop、.merge等。它基于R的Tidyverse，可以使用它了解所有數(shù)據(jù)分析步驟。

安裝包

pip install pandas-log

安裝包之后，看看下面的示例。

import pandas as pdimport numpy as npimport pandas_logdf = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],"toy": [np.nan, 'Batmobile', 'Bullwhip'],"born": [pd.NaT, pd.Timestamp("1940-04-25"), pd.NaT]})

然后讓我們嘗試用下面的代碼做一個簡單的 pandas 操作記錄。

with pandas_log.enable():res = (df.drop("born", axis = 1).groupby('name'))

通過 pandas-log，我們可以獲取所有的執(zhí)行信息。

4、Emoji

顧名思義，Emoji 是一個支持 emoji 文本解析的 Python 包。通常，我們很難用 Python 處理表情符號，但 Emoji 包可以幫助我們進行轉(zhuǎn)換。

使用以下代碼安裝 Emoji 包。

pip install emoji

看看下面代碼：

import emojiprint(emoji.emojize('Python is :thumbs_up:'))

有了這個包，可以輕易的輸出表情符號。

5、TheFuzz

TheFuzz 使用 Levenshtein 距離來匹配文本以計算相似度。

pip install thefuzz

下面代碼介紹如何使用 TheFuzz 進行相似性文本匹配。

from thefuzz import fuzz, process#Testing the score between two sentencesfuzz.ratio("Test the word", "test the Word!")

TheFuzz 還可以同時從多個單詞中提取相似度分數(shù)。

choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]process.extract("new york jets", choices, limit=2)

TheFuzz 適用于任何文本數(shù)據(jù)相似性檢測，這個工作在nlp中非常重要。

6、Numerizer

Numerizer 可將寫入的數(shù)字文本轉(zhuǎn)換為對應(yīng)的整數(shù)或浮點數(shù)。

pip install numerizer

然后讓我們嘗試幾個輸入來進行轉(zhuǎn)換。

from numerizer import numerizenumerize('forty two')

如果使用另一種書寫風(fēng)格，它也可以工作的。

numerize('forty-two')

numerize('nine and three quarters')

如果輸入不是數(shù)字的表達式，那么將會保留：

numerize('maybe around nine and three quarters')

7、PyAutoGUI

PyAutoGUI 可以自動控制鼠標和鍵盤。

pip install pyautogui

然后我們可以使用以下代碼測試。

import pyautoguipyautogui.moveTo(10, 15)pyautogui.click()pyautogui.doubleClick()pyautogui.press('enter')

上面的代碼會將鼠標移動到某個位置并單擊鼠標。當需要重復(fù)操作（例如下載文件或收集數(shù)據(jù)）時，非常有用。

8、Weightedcalcs

Weightedcalcs 用于統(tǒng)計計算。用法從簡單的統(tǒng)計數(shù)據(jù)（例如加權(quán)平均值、中位數(shù)和標準變化）到加權(quán)計數(shù)和分布等。

pip install weightedcalcs

使用可用數(shù)據(jù)計算加權(quán)分布。

import seaborn as snsdf = sns.load_dataset('mpg')import weightedcalcs as wccalc = wc.Calculator("mpg")

然后我們通過傳遞數(shù)據(jù)集并計算預(yù)期變量來進行加權(quán)計算。

calc.distribution(df, "origin")

9、scikit-posthocs

scikit-posthocs 是一個用于“事后”測試分析的 python 包，通常用于統(tǒng)計分析中的成對比較。該軟件包提供了簡單的類似 scikit-learn API 來進行分析。

pip install scikit-posthocs

然后讓我們從簡單的數(shù)據(jù)集開始，進行 ANOVA 測試。

import statsmodels.api as saimport statsmodels.formula.api as sfaimport scikit_posthocs as spdf = sa.datasets.get_rdataset('iris').datadf.columns = df.columns.str.replace('.', '')lm = sfa.ols('SepalWidth ~ C(Species)', data=df).fit()anova = sa.stats.anova_lm(lm)print(anova)

獲得了 ANOVA 測試結(jié)果，但不確定哪個變量類對結(jié)果的影響最大，可以使用以下代碼進行原因的查看。

sp.posthoc_ttest(df, val_col='SepalWidth', group_col='Species', p_adjust='holm')

使用 scikit-posthoc，我們簡化了事后測試的成對分析過程并獲得了 P 值

10、Cerberus

Cerberus 是一個用于數(shù)據(jù)驗證的輕量級 python 包。

pip install cerberus

Cerberus 的基本用法是驗證類的結(jié)構(gòu)。

from cerberus import Validatorschema = {'name': {'type': 'string'}, 'gender':{'type': 'string'}, 'age':{'type':'integer'}}v = Validator(schema)

定義好需要驗證的結(jié)構(gòu)后，可以對實例進行驗證。

document = {'name': 'john doe', 'gender':'male', 'age': 15}v.validate(document)

如果匹配，則 Validator 類將輸出True 。這樣我們可以確保數(shù)據(jù)結(jié)構(gòu)是正確的。

11、ppscore

ppscore 用于計算與目標變量相關(guān)的變量的預(yù)測能力。該包計算可以檢測兩個變量之間的線性或非線性關(guān)系的分數(shù)。分數(shù)范圍從 0（無預(yù)測能力）到 1（完美預(yù)測能力）。

pip install ppscore

使用 ppscore 包根據(jù)目標計算分數(shù)。

import seaborn as snsimport ppscore as ppsdf = sns.load_dataset('mpg')pps.predictors(df, 'mpg')

結(jié)果進行了排序。排名越低變量對目標的預(yù)測能力越低。

12、Maya

Maya 用于盡可能輕松地解析 DateTime 數(shù)據(jù)。

pip install maya

然后我們可以使用以下代碼輕松獲得當前日期。

import mayanow = maya.now()print(now)

還可以為明天日期。

tomorrow = maya.when('tomorrow')tomorrow.datetime()

13、Pendulum

Pendulum 是另一個涉及 DateTime 數(shù)據(jù)的 python 包。它用于簡化任何 DateTime 分析過程。

pip install pendulum

我們可以對實踐進行任何的操作。

import pendulumnow = pendulum.now("Europe/Berlin")now.in_timezone("Asia/Tokyo")now.to_iso8601_string()now.add(days=2)

14、category_encoders

category_encoders 是一個用于類別數(shù)據(jù)編碼（轉(zhuǎn)換為數(shù)值數(shù)據(jù)）的python包。該包是各種編碼方法的集合，我們可以根據(jù)需要將其應(yīng)用于各種分類數(shù)據(jù)。

pip install category_encoders

可以使用以下示例應(yīng)用轉(zhuǎn)換。

from category_encoders import BinaryEncoderimport pandas as pdenc = BinaryEncoder(cols=['origin']).fit(df)numeric_dataset = enc.transform(df)numeric_dataset.head()

15、scikit-multilearn

scikit-multilearn 可以用于特定于多類分類模型的機器學(xué)習(xí)模型。該軟件包提供 API 用于訓(xùn)練機器學(xué)習(xí)模型以預(yù)測具有兩個以上類別目標的數(shù)據(jù)集。

pip install scikit-multilearn

利用樣本數(shù)據(jù)集進行多標簽KNN來訓(xùn)練分類器并度量性能指標。

from skmultilearn.dataset import load_datasetfrom skmultilearn.adapt import MLkNNimport sklearn.metrics as metricsX_train, y_train, feature_names, label_names = load_dataset('emotions', 'train')X_test, y_test, _, _ = load_dataset('emotions', 'test')classifier = MLkNN(k=3)prediction = classifier.fit(X_train, y_train).predict(X_test)metrics.hamming_loss(y_test, prediction)

16、Multiset

Multiset類似于內(nèi)置的set函數(shù)，但該包允許相同的字符多次出現(xiàn)。

pip install multiset

可以使用下面的代碼來使用 Multiset 函數(shù)。

from multiset import Multisetset1 = Multiset('aab')set1

17、Jazzit

Jazzit 可以在我們的代碼出錯或等待代碼運行時播放音樂。

pip install jazzit

使用以下代碼在錯誤情況下嘗試示例音樂。

from jazzit import error_track@error_track("curb_your_enthusiasm.mp3", wait=5)def run():for num in reversed(range(10)):print(10/num)

這個包雖然沒什么用，但是它的功能是不是很有趣，哈

18、handcalcs

handcalcs 用于簡化notebook中的數(shù)學(xué)公式過程。它將任何數(shù)學(xué)函數(shù)轉(zhuǎn)換為其方程形式。

pip install handcalcs

使用以下代碼來測試 handcalcs 包。使用 %%render 魔術(shù)命令來渲染 Latex 。

import handcalcs.renderfrom math import sqrt

%%rendera = 4b = 6c = sqrt(3*a + b/7)

19、NeatText

NeatText 可簡化文本清理和預(yù)處理過程。它對任何 NLP 項目和文本機器學(xué)習(xí)項目數(shù)據(jù)都很有用。

pip install neattext

使用下面的代碼，生成測試數(shù)據(jù)

import neattext as nt mytext = "This is the word sample but ,our WEBSITE is https://exaempleeele.com ?."docx = nt.TextFrame(text=mytext)

TextFrame 用于啟動 NeatText 類然后可以使用各種函數(shù)來查看和清理數(shù)據(jù)。

docx.describe()

使用 describe 函數(shù)，可以顯示每個文本統(tǒng)計信息。進一步清理數(shù)據(jù)，可以使用以下代碼。

docx.normalize()

20、Combo

Combo 是一個用于機器學(xué)習(xí)模型和分數(shù)組合的 python 包。該軟件包提供了一個工具箱，允許將各種機器學(xué)習(xí)模型訓(xùn)練成一個模型。也就是可以對模型進行整合。

pip install combo

使用來自 scikit-learn 的乳腺癌數(shù)據(jù)集和來自 scikit-learn 的各種分類模型來創(chuàng)建機器學(xué)習(xí)組合。

from sklearn.tree import DecisionTreeClassifierfrom sklearn.linear_model import LogisticRegressionfrom sklearn.ensemble import GradientBoostingClassifierfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.model_selection import train_test_splitfrom sklearn.datasets import load_breast_cancerfrom combo.models.classifier_stacking import Stackingfrom combo.utils.data import evaluate_print

接下來，看一下用于預(yù)測目標的單個分類器。

# Define data file and read X and yrandom_state = 42X, y = load_breast_cancer(return_X_y=True)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,random_state=random_state)# initialize a group of clfsclassifiers = [DecisionTreeClassifier(random_state=random_state),LogisticRegression(random_state=random_state),KNeighborsClassifier(),RandomForestClassifier(random_state=random_state),GradientBoostingClassifier(random_state=random_state)]clf_names = ['DT', 'LR', 'KNN', 'RF', 'GBDT']for i, clf in enumerate(classifiers):clf.fit(X_train, y_train)y_test_predict = clf.predict(X_test)evaluate_print(clf_names[i] + ' | ', y_test, y_test_predict)print()

使用 Combo 包的 Stacking 模型。

clf = Stacking(classifiers, n_folds=4, shuffle_data=False,keep_original=True, use_proba=False,random_state=random_state)clf.fit(X_train, y_train)y_test_predict = clf.predict(X_test)evaluate_print('Stacking | ', y_test, y_test_predict)

21、PyAztro

你是否需要星座數(shù)據(jù)或只是對今天的運氣感到好奇？可以使用 PyAztro 來獲得這些信息！這個包有幸運數(shù)字、幸運標志、心情等等。這是我們?nèi)斯ぶ悄芩忝幕A(chǔ)數(shù)據(jù)，哈

pip install pyaztro

使用以下代碼訪問今天的星座信息。

import pyaztropyaztro.Aztro(sign='gemini').description

22、Faker

Faker 可用于簡化生成合成數(shù)據(jù)。許多開發(fā)人員使用這個包來創(chuàng)建測試的數(shù)據(jù)。

pip install Faker

要使用 Faker 包生成合成數(shù)據(jù)

from faker import Fakerfake = Faker()

生成名字

fake.name()

每次從 Faker 類獲取 .name 屬性時，F(xiàn)aker 都會隨機生成數(shù)據(jù)。

23、Fairlearn

Fairlearn 用于評估和減輕機器學(xué)習(xí)模型中的不公平性。該軟件包提供了許多查看偏差所必需的 API。

pip install fairlearn

然后可以使用 Fairlearn 的數(shù)據(jù)集來查看模型中有多少偏差。

from fairlearn.metrics import MetricFrame, selection_ratefrom fairlearn.datasets import fetch_adultdata = fetch_adult(as_frame=True)X = data.datay_true = (data.target == '>50K') * 1sex = X['sex']selection_rates = MetricFrame(metrics=selection_rate,y_true=y_true,y_pred=y_true,sensitive_features=sex)fig = selection_rates.by_group.plot.bar(legend=False, rot=0,title='Fraction earning over $50,000')

Fairlearn API 有一個 selection_rate 函數(shù)，可以使用它來檢測組模型預(yù)測之間的分數(shù)差異，以便我們可以看到結(jié)果的偏差。

24、tiobeindexpy

tiobeindexpy 用于獲取 TIOBE 索引數(shù)據(jù)。 TIOBE 指數(shù)是一個編程排名數(shù)據(jù)，對于開發(fā)人員來說是非常重要的因為我們不想錯過編程世界的下一件大事。

pip install tiobeindexpy

可以通過以下代碼獲得當月前 20 名的編程語言排名。

from tiobeindexpy import tiobeindexpy as tbdf = tb.top_20()

25、pytrends

pytrends 可以使用 Google API 獲取關(guān)鍵字趨勢數(shù)據(jù)。如果想要了解當前的網(wǎng)絡(luò)趨勢或與我們的關(guān)鍵字相關(guān)的趨勢時，該軟件包非常有用。這個需要訪問google，所以你懂的。

pip install pytrends

假設(shè)我想知道與關(guān)鍵字“Present Gift”相關(guān)的當前趨勢，

from pytrends.request import TrendReqimport pandas as pdpytrend = TrendReq()keywords = pytrend.suggestions(keyword='Present Gift')df = pd.DataFrame(keywords)df

該包將返回與關(guān)鍵字相關(guān)的前 5 個趨勢。

26、visions

visions 是一個用于語義數(shù)據(jù)分析的 python 包。該包可以檢測數(shù)據(jù)類型并推斷列的數(shù)據(jù)應(yīng)該是什么。

pip install visions

可以使用以下代碼檢測數(shù)據(jù)中的列數(shù)據(jù)類型。這里使用 seaborn 的 Titanic 數(shù)據(jù)集。

import seaborn as snsfrom visions.functional import detect_type, infer_typefrom visions.typesets import CompleteSetdf = sns.load_dataset('titanic')typeset = CompleteSet()converting everything to stringsprint(detect_type(df, typeset))

27、Schedule

Schedule 可以為任何代碼創(chuàng)建作業(yè)調(diào)度功能

pip install schedule

例如，我們想10 秒工作一次：

import scheduleimport timedef job():print("I'm working...")schedule.every(10).seconds.do(job)while True:schedule.run_pending()time.sleep(1)

28、autocorrect

autocorrect 是一個用于文本拼寫更正的 python 包，可應(yīng)用于多種語言。用法很簡單，并且對數(shù)據(jù)清理過程非常有用。

pip install autocorrect

可以使用類似于以下代碼進行自動更正。

from autocorrect import Spellerspell = Speller()spell("I'm not sleaspy and tehre is no place I'm giong to.")

29、funcy

funcy 包含用于日常數(shù)據(jù)分析使用的精美實用功能。包中的功能太多了，我無法全部展示出來，有興趣的請查看他的文檔。

pip install funcy

這里只展示一個示例函數(shù)，用于從可迭代變量中選擇一個偶數(shù)，如下面的代碼所示。

from funcy import select, evenselect(even, {i for i in range (20)})

30、IceCream

IceCream 可以使調(diào)試過程更容易。該軟件包在打印/記錄過程中提供了更詳細的輸出。

pip install icecream

可以使用下面代碼

from icecream import icdef some_function(i):i = 4 + (1 * 2)/ 10 return i + 35ic(some_function(121))

也可以用作函數(shù)檢查器。

def foo():ic()if some_function(12):ic()else:ic()foo()

打印的詳細程度非常適合分析。

總結(jié)

在本文中，總結(jié)了 30個在數(shù)據(jù)工作中有用的獨特 Python 包。大多數(shù)軟件包易于使用且簡單明了，但有些可能功能較多需要進一步閱讀其文檔，如果你有興趣請去pypi網(wǎng)站搜索并查看該軟件包的主頁和文檔，希望本文對你有所幫助。

責(zé)任編輯：華軒來源：今日頭條

Python 編程語言數(shù)據(jù)

51CTO技術(shù)棧公眾號

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營

<blockquote id="hfmss"><p id="hfmss"></p></blockquote>