自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

我發(fā)現(xiàn)了用 Python 編寫(xiě)簡(jiǎn)潔代碼的秘訣!

開(kāi)發(fā)
編寫(xiě)簡(jiǎn)潔代碼對(duì)軟件項(xiàng)目的成功至關(guān)重要,但這需要嚴(yán)謹(jǐn)?shù)膽B(tài)度和持續(xù)的練習(xí)。作為數(shù)據(jù)科學(xué)家,我們往往更關(guān)注在Jupyter Notebooks中運(yùn)行代碼、尋找好的模型和獲取理想指標(biāo),而忽視了代碼的整潔度。但是,編寫(xiě)簡(jiǎn)潔代碼也是數(shù)據(jù)科學(xué)家的必修課,因?yàn)檫@能確保模型更快地投入生產(chǎn)環(huán)境。
編寫(xiě)簡(jiǎn)潔的代碼不僅是一種良好的編程實(shí)踐,更是確保代碼可維護(hù)性和可擴(kuò)展性的關(guān)鍵。無(wú)論是在開(kāi)發(fā)階段還是生產(chǎn)環(huán)境中,代碼質(zhì)量都至關(guān)重要。

作為數(shù)據(jù)科學(xué)家,我們常常使用 Jupyter Notebooks 進(jìn)行數(shù)據(jù)探索和模型開(kāi)發(fā)。在這個(gè)階段,我們關(guān)注的重點(diǎn)是快速驗(yàn)證想法和證明概念。然而,一旦模型準(zhǔn)備就緒,就需要將其部署到生產(chǎn)環(huán)境中,這時(shí)代碼質(zhì)量就顯得尤為重要。

生產(chǎn)代碼必須足夠健壯、可讀且易于維護(hù)。不幸的是,數(shù)據(jù)科學(xué)家編寫(xiě)的原型代碼通常難以滿足這些要求。作為一名機(jī)器學(xué)習(xí)工程師,我的職責(zé)就是確保代碼能夠順利地從概念驗(yàn)證階段過(guò)渡到生產(chǎn)環(huán)境。

因此,編寫(xiě)簡(jiǎn)潔的代碼對(duì)于提高開(kāi)發(fā)效率和降低維護(hù)成本至關(guān)重要。在本文中,我將分享一些 Python 編程技巧和最佳實(shí)踐,并通過(guò)簡(jiǎn)潔的代碼示例,向您展示如何提高代碼的可讀性和可維護(hù)性。

我衷心希望這篇文章能為 Python 愛(ài)好者提供有價(jià)值的見(jiàn)解,特別是能夠激勵(lì)更多的數(shù)據(jù)科學(xué)家重視代碼質(zhì)量,因?yàn)楦哔|(zhì)量的代碼不僅有利于開(kāi)發(fā)過(guò)程,更能確保模型成功地投入生產(chǎn)使用。

有意義的名稱(chēng)

很多開(kāi)發(fā)人員沒(méi)有遵循給變量和函數(shù)命名富有意義的名稱(chēng)這一最佳實(shí)踐。代碼的可讀性和可維護(hù)性因此大大降低。

命名對(duì)于代碼質(zhì)量至關(guān)重要。好的命名不僅能直觀地表達(dá)代碼的功能,而且還能避免過(guò)多的注釋和解釋?zhuān)岣叽a的整潔度。一個(gè)描述性強(qiáng)的名稱(chēng),就能讓函數(shù)的作用一目了然。

你給出的機(jī)器學(xué)習(xí)例子非常恰當(dāng)。比如加載數(shù)據(jù)集并將其分割為訓(xùn)練集和測(cè)試集這一常見(jiàn)任務(wù),如果使用富有意義的函數(shù)名如load_dataset()和split_into_train_test()就能立刻看出這兩個(gè)函數(shù)的用途,而不需要查閱注釋。

可讀性強(qiáng)的代碼不僅能讓其他開(kāi)發(fā)者更快理解,自己在未來(lái)維護(hù)時(shí)也能事半功倍。因此,我們應(yīng)當(dāng)養(yǎng)成良好的命名習(xí)慣,寫(xiě)出簡(jiǎn)潔直白的代碼。

以一個(gè)典型的機(jī)器學(xué)習(xí)例子為例:加載數(shù)據(jù)集并將其分割成訓(xùn)練集和測(cè)試集。

import pandas as pd
from sklearn.model_selection import train_test_split

def load_and_split(d):
    df = pd.read_csv(d)
    X = df.iloc[:, :-1]
    y = df.iloc[:, -1]
    X_train, X_test, y_train, y_test = train_test_split(X, 
                                                        y, 
                                                        test_size=0.2, 
                                                        random_state=42)
    return X_train, X_test, y_train, y_test

當(dāng)談到數(shù)據(jù)科學(xué)領(lǐng)域時(shí),大多數(shù)人都了解其中涉及的概念和術(shù)語(yǔ),例如 X 和 Y。然而,對(duì)于初入這一領(lǐng)域的人來(lái)說(shuō),是否將 CSV 文件的路徑命名為d是一個(gè)好的做法呢?另外,將特征命名為 X,將目標(biāo)命名為 y 是一個(gè)好的做法嗎?或許我們可以通過(guò)一個(gè)更具意義的例子來(lái)進(jìn)一步理解:

import pandas as pd
from sklearn.model_selection import train_test_split

def load_data_and_split_into_train_test(dataset_path):
    data_frame = pd.read_csv(dataset_path)
    features = data_frame.iloc[:, :-1]
    target = data_frame.iloc[:, -1]
    features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2, random_state=42)
    return features_train, features_test, target_train, target_test

這樣就更容易理解了。即使沒(méi)有使用過(guò) pandas 和 train_test_split 的經(jīng)驗(yàn),現(xiàn)在也能清楚地看到,這個(gè)函數(shù)是用來(lái)從 CSV 文件中加載數(shù)據(jù)(存儲(chǔ)在 dataset_path 中指定的路徑下),然后從數(shù)據(jù)框中提取特征和目標(biāo),最后計(jì)算訓(xùn)練集和測(cè)試集的特征和目標(biāo)。

這些變化使代碼更易讀和易懂,尤其是對(duì)于那些可能不熟悉機(jī)器學(xué)習(xí)代碼規(guī)范的人來(lái)說(shuō)。在這些代碼中,特征大多以X表示,目標(biāo)以y表示。

但也不要過(guò)度夸大命名,因?yàn)檫@并不會(huì)提供任何額外的信息。

來(lái)看另一個(gè)示例代碼片段:

import pandas as pd
from sklearn.model_selection import train_test_split

def load_data_from_csv_and_split_into_training_and_testing_sets(dataset_path_csv):
    data_frame_from_csv = pd.read_csv(dataset_path_csv)
    features_columns_data_frame = data_frame_from_csv.iloc[:, :-1]
    target_column_data_frame = data_frame_from_csv.iloc[:, -1]
    features_columns_data_frame_for_training, features_columns_data_frame_for_testing, target_column_data_frame_for_training, target_column_data_frame_for_testing = train_test_split(features_columns_data_frame, target_column_data_frame, test_size=0.2, random_state=42)
    return features_columns_data_frame_for_training, features_columns_data_frame_for_testing, target_column_data_frame_for_training, target_column_data_frame_for_testing

用戶提到的代碼讓人感覺(jué)信息過(guò)多,卻沒(méi)有提供任何額外的信息,反而會(huì)分散讀者的注意力。因此,建議在函數(shù)中添加有意義的名稱(chēng),以取得描述性和簡(jiǎn)潔性之間的平衡。至于是否需要說(shuō)明函數(shù)是從 CSV 加載數(shù)據(jù)集路徑,這取決于代碼的上下文和實(shí)際需求。

函數(shù)

函數(shù)的規(guī)模與功能應(yīng)該恰當(dāng)?shù)卦O(shè)計(jì)。它們應(yīng)該保持簡(jiǎn)潔,不超過(guò)20行,并將大塊內(nèi)容分離到新的函數(shù)中。更重要的是,函數(shù)應(yīng)該只負(fù)責(zé)一件事,而不是多個(gè)任務(wù)。如果需要執(zhí)行其他任務(wù),就應(yīng)該將其放到另一個(gè)函數(shù)中。舉個(gè)例子

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

def load_clean_feature_engineer_and_split(data_path):
    # Load data
    df = pd.read_csv(data_path)
    
    # Clean data
    df.dropna(inplace=True)
    df = df[df['Age'] > 0]
    
    # Feature engineering
    df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 18, 65, 99], labels=['child', 'adult', 'senior'])
    df['IsAdult'] = df['Age'] > 18
    
    # Data preprocessing
    scaler = StandardScaler()
    df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
    
    # Split data
    features = df.drop('Survived', axis=1)
    target = df['Survived']
    features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2, random_state=42)
    return features_train, features_test, target_train, target_test

你有沒(méi)有注意到違反了上述規(guī)則的行為?

雖然這個(gè)函數(shù)并不冗長(zhǎng),但明顯違反了一個(gè)函數(shù)只負(fù)責(zé)一件事的規(guī)則。另外,注釋表明這些代碼塊可以放在一個(gè)單獨(dú)的函數(shù)中,因?yàn)楦静恍枰獑涡凶⑨專(zhuān)ㄏ乱还?jié)將詳細(xì)介紹)。

一個(gè)重構(gòu)后的示例:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

def load_data(data_path):
    return pd.read_csv(data_path)

def clean_data(df):
    df.dropna(inplace=True)
    df = df[df['Age'] > 0]
    return df

def feature_engineering(df):
    df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 18, 65, 99], labels=['child', 'adult', 'senior'])
    df['IsAdult'] = df['Age'] > 18
    return df

def preprocess_features(df):
    scaler = StandardScaler()
    df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
    return df

def split_data(df, target_name='Survived'):
    features = df.drop(target_name, axis=1)
    target = df[target_name]
    return train_test_split(features, target, test_size=0.2, random_state=42)

if __name__ == "__main__":
  data_path = 'data.csv'
  df = load_data(data_path)
  df = clean_data(df)
  df = feature_engineering(df)
  df = preprocess_features(df)
  X_train, X_test, y_train, y_test = split_data(df)

在這個(gè)經(jīng)過(guò)重構(gòu)的代碼片段中,每個(gè)函數(shù)只做一件事,這樣就更容易閱讀代碼了。測(cè)試本身也變得更容易了,因?yàn)槊總€(gè)函數(shù)都可以獨(dú)立于其他函數(shù)進(jìn)行測(cè)試。

甚至連注釋也不再需要了,因?yàn)楝F(xiàn)在函數(shù)名本身就像是注釋。

注釋

有時(shí)注釋是有用的,但有時(shí)它們只是糟糕代碼的標(biāo)志。

正確使用注釋是為了彌補(bǔ)我們無(wú)法用代碼表達(dá)的缺陷。

當(dāng)需要在代碼中添加注釋時(shí),可以考慮是否真的需要它,或者是否可以將其放入一個(gè)新函數(shù)中,并為函數(shù)命名,這樣就能清楚地知道發(fā)生了什么,而注釋并不是必需的。

來(lái)修改一下之前函數(shù)一章中的代碼示例:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

def load_clean_feature_engineer_and_split(data_path):
    # Load data
    df = pd.read_csv(data_path)
    
    # Clean data
    df.dropna(inplace=True)
    df = df[df['Age'] > 0]
    
    # Feature engineering
    df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 18, 65, 99], labels=['child', 'adult', 'senior'])
    df['IsAdult'] = df['Age'] > 18
    
    # Data preprocessing
    scaler = StandardScaler()
    df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
    
    # Split data
    features = df.drop('Survived', axis=1)
    target = df['Survived']
    features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2, random_state=42)
    return features_train, features_test, target_train, target_test

代碼中注釋描述了每個(gè)代碼塊的作用,但實(shí)際上,注釋只是糟糕代碼的一個(gè)指標(biāo)。根據(jù)前一章的建議,將這些代碼塊放入單獨(dú)的函數(shù)中,并為每個(gè)函數(shù)起一個(gè)描述性的名稱(chēng),這樣可以提高代碼的可讀性,減少對(duì)注釋的需求。

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

def load_data(data_path):
    return pd.read_csv(data_path)

def clean_data(df):
    df.dropna(inplace=True)
    df = df[df['Age'] > 0]
    return df

def feature_engineering(df):
    df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 18, 65, 99], labels=['child', 'adult', 'senior'])
    df['IsAdult'] = df['Age'] > 18
    return df

def preprocess_features(df):
    scaler = StandardScaler()
    df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
    return df

def split_data(df, target_name='Survived'):
    features = df.drop('Survived', axis=1)
    target = df['Survived']
    return train_test_split(features, target, test_size=0.2, random_state=42)

if __name__ == "__main__":
  data_path = 'data.csv'
  df = load_data(data_path)
  df = clean_data(df)
  df = feature_engineering(df)
  df = preprocess_features(df)
  X_train, X_test, y_train, y_test = split_data(df)

代碼現(xiàn)在看起來(lái)像一個(gè)連貫的故事,不需要注釋就可以清楚地了解發(fā)生了什么。但還缺少最后一部分:文檔字符串。文檔字符串是 Python 的標(biāo)準(zhǔn),旨在提供可讀性和可理解性的代碼。每個(gè)生產(chǎn)代碼中的函數(shù)都應(yīng)該包含文檔字符串,描述其意圖、輸入?yún)?shù)和返回值信息。這些文檔字符串可以直接用于 Sphinx 等工具,其目的是為代碼創(chuàng)建文檔。

將文檔字符串添加到上述代碼片段中:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

def load_data(data_path):
    """
    Load data from a CSV file into a pandas DataFrame.
    
    Args:
      data_path (str): The file path to the dataset.
    
    Returns:
      DataFrame: The loaded dataset.
    """
    return pd.read_csv(data_path)

def clean_data(df):
    """
    Clean the DataFrame by removing rows with missing values and 
    filtering out non-positive ages.
    
    Args:
      df (DataFrame): The input dataset.
    
    Returns:
      DataFrame: The cleaned dataset.
    """
    df.dropna(inplace=True)
    df = df[df['Age'] > 0]
    return df

def feature_engineering(df):
    """
    Perform feature engineering on the DataFrame, including age 
    grouping and adult identification.
    
    Args:
      df (DataFrame): The input dataset.
    
    Returns:
      DataFrame: The dataset with new features added.
    """
    df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 18, 65, 99], labels=['child', 'adult', 'senior'])
    df['IsAdult'] = df['Age'] > 18
    return df

def preprocess_features(df):
    """
    Preprocess features by standardizing the 'Age' and 'Fare' 
    columns using StandardScaler.
    
    Args:
      df (DataFrame): The input dataset.
    
    Returns:
      DataFrame: The dataset with standardized features.
    """
    scaler = StandardScaler()
    df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
    return df

def split_data(df, target_name='Survived'):
    """
    Split the dataset into training and testing sets.
    
    Args:
      df (DataFrame): The input dataset.
      target_name (str): The name of the target variable column.
    
    Returns:
      tuple: Contains the training features, testing features, 
             training target, and testing target datasets.
    """
    features = df.drop(target_name, axis=1)
    target = df[target_name]
    return train_test_split(features, target, test_size=0.2, random_state=42)

if __name__ == "__main__":
    data_path = 'data.csv'
    df = load_data(data_path)
    df = clean_data(df)
    df = feature_engineering(df)
    df = preprocess_features(df)
    X_train, X_test, y_train, y_test = split_data(df)

集成開(kāi)發(fā)環(huán)境(如 VSCode)通常會(huì)提供 docstrings 擴(kuò)展功能,以便在函數(shù)定義下方添加多行字符串時(shí)自動(dòng)添加 docstrings。

這可以幫助你快速獲得所選的正確格式。

格式化

格式化是一個(gè)非常關(guān)鍵的概念。

代碼的閱讀頻率比編寫(xiě)頻率高。避免人們閱讀不規(guī)范和難以理解的代碼。

在 Python 中有一個(gè) PEP 8 樣式指南[1],可用于改善代碼的可讀性。

樣式指南包括如下重要規(guī)則:

  • 使用四個(gè)空格進(jìn)行代碼縮進(jìn)
  • 每行不超過(guò) 79 個(gè)字符
  • 避免不必要的空白,在某些情況下(例如括號(hào)內(nèi)、逗號(hào)和括號(hào)之間)

但請(qǐng)記住,格式化規(guī)則旨在提高代碼可讀性。有時(shí),嚴(yán)格遵循規(guī)則可能不合理,會(huì)降低代碼的可讀性。此時(shí)可以忽略某些規(guī)則。

《清潔代碼》一書(shū)中提到的其他重要格式化規(guī)則包括:

  • 使文件大小合理 (約 200 至 500 行),以促使更好的理解
  • 使用空行來(lái)分隔不同概念(例如,在初始化 ML 模型的代碼塊和運(yùn)行訓(xùn)練的代碼塊之間)
  • 將調(diào)用者函數(shù)定義在被調(diào)用者函數(shù)之上,幫助創(chuàng)建自然的閱讀流程

因此,與團(tuán)隊(duì)一起決定遵守的規(guī)則,并堅(jiān)持執(zhí)行!您可以利用集成開(kāi)發(fā)環(huán)境的擴(kuò)展功能來(lái)支持準(zhǔn)則遵守。例如,VSCode 提供了多種擴(kuò)展。您可以使用 Pylint[2] 和 autopep8[3] 等 Python 軟件包來(lái)格式化您的 Python 腳本。Pylint 是一個(gè)靜態(tài)代碼分析器,自動(dòng)對(duì)代碼進(jìn)行評(píng)分,而autopep8可以自動(dòng)格式化代碼,使其符合PEP8標(biāo)準(zhǔn)。

使用前面的代碼片段來(lái)進(jìn)一步了解。

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

def load_data(data_path):
    return pd.read_csv(data_path)

def clean_data(df):
    df.dropna(inplace=True)
    df = df[df['Age'] > 0]
    return df

def feature_engineering(df):
    df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 18, 65, 99], labels=['child', 'adult', 'senior'])
    df['IsAdult'] = df['Age'] > 18
    return df

def preprocess_features(df):
    scaler = StandardScaler()
    df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
    return df

def split_data(df, target_name='Survived'):
    features = df.drop(target_name, axis=1)
    target = df[target_name]
    return train_test_split(features, target, test_size=0.2, random_state=42)

if __name__ == "__main__":
  data_path = 'data.csv'
  df = load_data(data_path)
  df = clean_data(df)
  df = feature_engineering(df)
  df = preprocess_features(df)
  X_train, X_test, y_train, y_test = split_data(df)

將其保存到名為 train.py 的文件中,并運(yùn)行 Pylint 來(lái)檢查該代碼段的得分:

pylint train.py

輸出結(jié)果

************* Module train
train.py:29:0: W0311: Bad indentation. Found 2 spaces, expected 4 (bad-indentation)
train.py:30:0: W0311: Bad indentation. Found 2 spaces, expected 4 (bad-indentation)
train.py:31:0: W0311: Bad indentation. Found 2 spaces, expected 4 (bad-indentation)
train.py:32:0: W0311: Bad indentation. Found 2 spaces, expected 4 (bad-indentation)
train.py:33:0: W0311: Bad indentation. Found 2 spaces, expected 4 (bad-indentation)
train.py:34:0: C0304: Final newline missing (missing-final-newline)
train.py:34:0: W0311: Bad indentation. Found 2 spaces, expected 4 (bad-indentation)
train.py:1:0: C0114: Missing module docstring (missing-module-docstring)
train.py:5:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:5:14: W0621: Redefining name 'data_path' from outer scope (line 29) (redefined-outer-name)
train.py:8:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:8:15: W0621: Redefining name 'df' from outer scope (line 30) (redefined-outer-name)
train.py:13:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:13:24: W0621: Redefining name 'df' from outer scope (line 30) (redefined-outer-name)
train.py:18:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:18:24: W0621: Redefining name 'df' from outer scope (line 30) (redefined-outer-name)
train.py:23:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:23:15: W0621: Redefining name 'df' from outer scope (line 30) (redefined-outer-name)
train.py:29:2: C0103: Constant name "data_path" doesn't conform to UPPER_CASE naming style (invalid-name)

------------------------------------------------------------------
Your code has been rated at 3.21/10

滿分 10 分,只有 3.21 分。

你可以選擇手動(dòng)修復(fù)這些問(wèn)題然后重新運(yùn)行,或者使用autopep8軟件包來(lái)自動(dòng)解決一些問(wèn)題。下面我們選擇第二種方法。

autopep8 --in-place --aggressive --aggressive train.py

現(xiàn)在的 train.py 腳本如下所示:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


def load_data(data_path):
    return pd.read_csv(data_path)


def clean_data(df):
    df.dropna(inplace=True)
    df = df[df['Age'] > 0]
    return df


def feature_engineering(df):
    df['AgeGroup'] = pd.cut(
        df['Age'], bins=[
            0, 18, 65, 99], labels=[
            'child', 'adult', 'senior'])
    df['IsAdult'] = df['Age'] > 18
    return df


def preprocess_features(df):
    scaler = StandardScaler()
    df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
    return df


def split_data(df, target_name='Survived'):
    features = df.drop(target_name, axis=1)
    target = df[target_name]
    return train_test_split(features, target, test_size=0.2, random_state=42)


if __name__ == "__main__":
    data_path = 'data.csv'
    df = load_data(data_path)
    df = clean_data(df)
    df = feature_engineering(df)
    df = preprocess_features(df)
    X_train, X_test, y_train, y_test = split_data(df)

再次運(yùn)行 Pylint 后,我們得到了 5.71 分(滿分 10 分),這主要是由于缺少函數(shù)的文檔說(shuō)明:

************* Module train
train.py:1:0: C0114: Missing module docstring (missing-module-docstring)
train.py:6:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:6:14: W0621: Redefining name 'data_path' from outer scope (line 38) (redefined-outer-name)
train.py:10:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:10:15: W0621: Redefining name 'df' from outer scope (line 39) (redefined-outer-name)
train.py:16:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:16:24: W0621: Redefining name 'df' from outer scope (line 39) (redefined-outer-name)
train.py:25:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:25:24: W0621: Redefining name 'df' from outer scope (line 39) (redefined-outer-name)
train.py:31:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:31:15: W0621: Redefining name 'df' from outer scope (line 39) (redefined-outer-name)
train.py:38:4: C0103: Constant name "data_path" doesn't conform to UPPER_CASE naming style (invalid-name)

------------------------------------------------------------------
Your code has been rated at 5.71/10 (previous run: 3.21/10, +2.50)

現(xiàn)在我已經(jīng)添加了文檔說(shuō)明,并修復(fù)了最后的缺失點(diǎn)。

現(xiàn)在的最終代碼是這樣的:

"""
This script aims at providing an end-to-end training pipeline.

Author: Patrick 

Date: 2/14/2024
"""

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


def load_data(data_path):
    """
    Load dataset from a specified CSV file into a pandas DataFrame.

    Args:
        data_path (str): The file path to the dataset.

    Returns:
        DataFrame: The loaded dataset.
    """
    return pd.read_csv(data_path)


def clean_data(df):
    """
    Clean the input DataFrame by removing rows with 
    missing values and filtering out entries with non-positive ages.

    Args:
        df (DataFrame): The input dataset.

    Returns:
       DataFrame: The cleaned dataset.
    """
    df.dropna(inplace=True)
    df = df[df['Age'] > 0]
    return df


def feature_engineering(df):
    """
    Perform feature engineering on the DataFrame, 
    including creating age groups and determining if the individual is an adult.

    Args:
        df (DataFrame): The input dataset.

    Returns:
        DataFrame: The dataset with new features added.
    """
    df['AgeGroup'] = pd.cut(
        df['Age'], bins=[
            0, 18, 65, 99], labels=[
            'child', 'adult', 'senior'])
    df['IsAdult'] = df['Age'] > 18
    return df


def preprocess_features(df):
    """
    Preprocess the 'Age' and 'Fare' features of the 
    DataFrame using StandardScaler to standardize the features.

    Args:
        df (DataFrame): The input dataset.

    Returns:
        DataFrame: The dataset with standardized features.
    """
    scaler = StandardScaler()
    df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
    return df


def split_data(df, target_name='Survived'):
    """
    Split the DataFrame into training and testing sets.

    Args:
        df (DataFrame): The dataset to split.
        target_name (str, optional): The name of the target variable column. Defaults to 'Survived'.

    Returns:
        tuple: The training and testing features and target datasets.
    """
    features = df.drop(target_name, axis=1)
    target = df[target_name]
    return train_test_split(features, target, test_size=0.2, random_state=42)


if __name__ == "__main__":
    data = load_data("data.csv")
    data = clean_data(data)
    data = feature_engineering(data)
    data = preprocess_features(data)
    X_train, X_test, y_train, y_test = split_data(data)

運(yùn)行 Pylint 現(xiàn)在返回 10 分:

pylint train.py

-------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 7.50/10, +2.50)

這突出顯示了 Pylint 的功能之強(qiáng)大,它可以幫助您簡(jiǎn)化代碼并快速符合 PEP8 標(biāo)準(zhǔn)。

錯(cuò)誤處理是另一個(gè)關(guān)鍵概念。它能確保你的代碼在遇到意外情況時(shí)不會(huì)崩潰或產(chǎn)生錯(cuò)誤結(jié)果。

舉個(gè)例子,假設(shè)您在API后端部署了一個(gè)模型,用戶可以向該部署的模型發(fā)送數(shù)據(jù)。然而,用戶可能會(huì)發(fā)送錯(cuò)誤的數(shù)據(jù),而你的應(yīng)用程序如果崩潰了,可能會(huì)給用戶留下不好的印象,并可能因此責(zé)備您的應(yīng)用程序開(kāi)發(fā)不到位。

如果用戶能夠獲取明確的錯(cuò)誤代碼和相關(guān)信息,清晰地指出他們的錯(cuò)誤,那就更好了。這正是Python中異常的作用所在。

舉例來(lái)說(shuō),用戶可以上傳一個(gè)CSV文件到您的應(yīng)用程序,將其加載到pandas數(shù)據(jù)框架中,然后將數(shù)據(jù)傳給模型進(jìn)行預(yù)測(cè)。這樣,您可以擁有類(lèi)似下面這樣的函數(shù):

import pandas as pd

def load_data(data_path):
    """
    Load dataset from a specified CSV file into a pandas DataFrame.

    Args:
        data_path (str): The file path to the dataset.

    Returns:
        DataFrame: The loaded dataset.
    """
    return pd.read_csv(data_path)

到目前為止,一切順利。但如果用戶沒(méi)有提供 CSV 文件,會(huì)發(fā)生什么情況呢?

你的程序?qū)⒈罎ⅲ⒊霈F(xiàn)以下錯(cuò)誤信息:

FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

你正在使用API,它只會(huì)以HTTP 500代碼響應(yīng)用戶,告訴他們"服務(wù)器內(nèi)部出錯(cuò)"。用戶可能會(huì)因此責(zé)怪您的應(yīng)用程序,因?yàn)樗麄儫o(wú)法確定自己是否對(duì)該錯(cuò)誤負(fù)有責(zé)任。更好的處理方法是添加一個(gè)try-except塊,并捕獲FileNotFoundError以正確處理這種情況。

import pandas as pd
import logging

def load_data(data_path):
    """
    Load dataset from a specified CSV file into a pandas DataFrame.

    Args:
        data_path (str): The file path to the dataset.

    Returns:
        DataFrame: The loaded dataset.
    """
    try:
        return pd.read_csv(data_path)
    except FileNotFoundError:
        logging.error("The file at path %s does not exist. Please ensure that you have uploaded the file properly.", data_path)

我們目前只能記錄該錯(cuò)誤消息。最佳做法是定義一個(gè)自定義異常,然后在應(yīng)用程序接口中進(jìn)行處理,以向用戶返回特定的錯(cuò)誤代碼。

import pandas as pd
import logging

class DataLoadError(Exception):
    """Exception raised when the data cannot be loaded."""
    def __init__(self, message="Data could not be loaded"):
        self.message = message
        super().__init__(self.message)

def load_data(data_path):
    """
    Load dataset from a specified CSV file into a pandas DataFrame.

    Args:
        data_path (str): The file path to the dataset.

    Returns:
        DataFrame: The loaded dataset.
    """
    try:
        return pd.read_csv(data_path)
    except FileNotFoundError:
        logging.error("The file at path %s does not exist. Please ensure that you have uploaded the file properly.", data_path)
        raise DataLoadError(f"The file at path {data_path} does not exist. Please ensure that you have uploaded the file properly.")

然后,在應(yīng)用程序接口的主要函數(shù)中:

try:
    df = load_data('path/to/data.csv')
    # Further processing and model prediction
except DataLoadError as e:
    # Return a response to the user with the error message
    # For example: return Response({"error": str(e)}, status=400)

用戶將收到 400 錯(cuò)誤代碼(錯(cuò)誤請(qǐng)求),并將收到有關(guān)錯(cuò)誤原因的錯(cuò)誤消息。

現(xiàn)在他了解了應(yīng)該怎么做,并不會(huì)再責(zé)備程序工作不正常。

面向?qū)ο缶幊?/h2>

面向?qū)ο缶幊?OOP)是Python中一個(gè)重要的編程范式,即使是初學(xué)者也應(yīng)該熟悉。那么,什么是OOP呢?

面向?qū)ο缶幊淌且环N將數(shù)據(jù)和行為封裝到單個(gè)對(duì)象中的編程方式,為程序提供了清晰的結(jié)構(gòu)。

采用OOP有以下幾個(gè)主要好處:

  • 通過(guò)封裝隱藏內(nèi)部細(xì)節(jié),提高代碼模塊化。
  • 繼承機(jī)制允許代碼復(fù)用,提高開(kāi)發(fā)效率。
  • 將復(fù)雜問(wèn)題分解為小對(duì)象,專(zhuān)注于逐個(gè)解決。
  • 提升代碼可讀性和可維護(hù)性。

OOP還有其他一些優(yōu)點(diǎn),上述幾點(diǎn)是最為關(guān)鍵的。

現(xiàn)在讓我們看一個(gè)簡(jiǎn)單的例子,我們創(chuàng)建了一個(gè)名為T(mén)rainingPipeline的類(lèi),包含幾個(gè)基本函數(shù):

from abc import ABC, abstractmethod

class TrainingPipeline(ABC):
    def __init__(self, data_path, target_name):
        """
        Initialize the TrainingPipeline.

        Args:
            data_path (str): The file path to the dataset.
            target_name (str): Name of the target column.
        """
        self.data_path = data_path
        self.target_name = target_name
        self.data = None
        self.X_train = None
        self.X_test = None
        self.y_train = None
        self.y_test = None

    @abstractmethod
    def load_data(self):
        """Load dataset from data path."""
        pass

    @abstractmethod
    def clean_data(self):
        """Clean the data."""
        pass

    @abstractmethod
    def feature_engineering(self):
        """Perform feature engineering."""
        pass

    @abstractmethod
    def preprocess_features(self):
        """Preprocess features."""
        pass

    @abstractmethod
    def split_data(self):
        """Split data into training and testing sets."""
        pass

    def run(self):
        """Run the training pipeline."""
        self.load_data()
        self.clean_data()
        self.feature_engineering()
        self.preprocess_features()
        self.split_data()

這是一個(gè)抽象基類(lèi),只定義了從基類(lèi)派生出來(lái)的類(lèi)必須實(shí)現(xiàn)的抽象方法。

這對(duì)于定義所有子類(lèi)都必須遵循的藍(lán)圖或模板非常有用。

下面是一個(gè)子類(lèi)示例:

import pandas as pd
from sklearn.preprocessing import StandardScaler

class ChurnPredictionTrainPipeline(TrainingPipeline):
    def load_data(self):
        """Load dataset from data path."""
        self.data = pd.read_csv(self.data_path)

    def clean_data(self):
        """Clean the data."""
        self.data.dropna(inplace=True)

    def feature_engineering(self):
        """Perform feature engineering."""
        categorical_cols = self.data.select_dtypes(include=['object', 'category']).columns
        self.data = pd.get_dummies(self.data, columns=categorical_cols, drop_first=True)

    def preprocess_features(self):
        """Preprocess features."""
        numerical_cols = self.data.select_dtypes(include=['int64', 'float64']).columns
        scaler = StandardScaler()
        self.data[numerical_cols] = scaler.fit_transform(self.data[numerical_cols])

    def split_data(self):
        """Split data into training and testing sets."""
        features = self.data.drop(self.target_name, axis=1)
        target = self.data[self.target_name]
        self.features_train, self.features_test, self.target_train, self.target_test = train_test_split(features, target, test_size=0.2, random_state=42)

這樣做的好處是,你可以創(chuàng)建一個(gè)自動(dòng)調(diào)用訓(xùn)練管道方法的應(yīng)用程序,還可以創(chuàng)建不同的訓(xùn)練管道類(lèi)。它們始終是兼容的,并且必須遵循抽象基類(lèi)中定義的藍(lán)圖。

測(cè)試

測(cè)試可以決定整個(gè)項(xiàng)目的成敗。

測(cè)試確實(shí)可能會(huì)增加一些開(kāi)發(fā)時(shí)間投入,但從長(zhǎng)遠(yuǎn)來(lái)看,它能夠極大地提高代碼質(zhì)量、可維護(hù)性和可靠性。

測(cè)試對(duì)于確保項(xiàng)目的成功至關(guān)重要,盡管一開(kāi)始編寫(xiě)測(cè)試代碼會(huì)耗費(fèi)一些時(shí)間,但這是一種非常值得的投資。不編寫(xiě)測(cè)試可能會(huì)在短期內(nèi)加快開(kāi)發(fā)速度,但從長(zhǎng)遠(yuǎn)來(lái)看,缺乏測(cè)試會(huì)帶來(lái)嚴(yán)重的代價(jià):

  • 代碼庫(kù)擴(kuò)大后,任何小小修改都可能導(dǎo)致意外的破壞
  • 新版本需要大量修復(fù),給客戶帶來(lái)不佳體驗(yàn)
  • 開(kāi)發(fā)人員畏懼修改代碼庫(kù),新功能發(fā)布受阻

因此,遵循 TDD 原則對(duì)于提高代碼質(zhì)量和開(kāi)發(fā)效率至關(guān)重要。TDD 的三個(gè)核心原則是:

  1. 在開(kāi)始編寫(xiě)生產(chǎn)代碼之前,先編寫(xiě)一個(gè)失敗的單元測(cè)試
  2. 編寫(xiě)的單元測(cè)試內(nèi)容不要多于足以導(dǎo)致失敗的內(nèi)容
  3. 編寫(xiě)的生產(chǎn)代碼不能多于足以通過(guò)當(dāng)前失敗測(cè)試的部分。

這種測(cè)試先行的模式能促使開(kāi)發(fā)者先思考代碼設(shè)計(jì)。

Python 擁有諸如 unittest 和 pytest 等優(yōu)秀測(cè)試框架,其中 pytest 因其簡(jiǎn)潔語(yǔ)法而更加易用。盡管短期增加了開(kāi)發(fā)量,但測(cè)試絕對(duì)是保證項(xiàng)目長(zhǎng)期成功所必需的。

再次看看前一章中的 ChurnPredictionTrainPipeline 類(lèi):

import pandas as pd
from sklearn.preprocessing import StandardScaler

class ChurnPredictionTrainPipeline(TrainingPipeline):
    def load_data(self):
        """Load dataset from data path."""
        self.data = pd.read_csv(self.data_path)

    ...

使用 pytest 為加載數(shù)據(jù)添加單元測(cè)試:

import os
import shutil
import logging
from unittest.mock import patch
import joblib
import pytest
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from churn_library import ChurnPredictor

@pytest.fixture
def path():
    """
    Return the path to the test csv data file.
    """
    return r"./data/bank_data.csv"

def test_import_data_returns_dataframe(path):
    """
    Test that import data can load the CSV file into a pandas dataframe.
    """
    churn_predictor = ChurnPredictionTrainPipeline(path, "Churn")
    churn_predictor.load_data()

    assert isinstance(churn_predictor.data, pd.DataFrame)


def test_import_data_raises_exception():
    """
    Test that exception of "FileNotFoundError" gets raised in case the CSV
    file does not exist.
    """
    with pytest.raises(FileNotFoundError):
        churn_predictor = ChurnPredictionTrainPipeline("non_existent_file.csv",
                                                       "Churn")
        churn_predictor.load_data()


def test_import_data_reads_csv(path):
    """
    Test that the pandas.read_csv function gets called.
    """
    with patch("pandas.read_csv") as mock_csv:
        churn_predictor = ChurnPredictionTrainPipeline(path, "Churn")
        churn_predictor.load_data()
        mock_csv.assert_called_once_with(path)

這些單元測(cè)試包括

  • 測(cè)試 CSV 文件能否加載到 pandas 數(shù)據(jù)框架中。
  • 測(cè)試 CSV 文件不存在時(shí)是否會(huì)拋出 FileNotFoundError 異常。
  • 測(cè)試是否調(diào)用了 pandas 的 read_csv 函數(shù)。

這個(gè)過(guò)程并不完全是 TDD,因?yàn)樵谔砑訂卧獪y(cè)試之前,我已經(jīng)開(kāi)發(fā)了代碼。但在理想情況下,你甚至可以在實(shí)現(xiàn) load_data 函數(shù)之前編寫(xiě)這些單元測(cè)試。

結(jié)論

四條簡(jiǎn)單設(shè)計(jì)規(guī)則,目的是讓代碼更加簡(jiǎn)潔、可讀和易維護(hù)。這四條規(guī)則是:

  1. 運(yùn)行所有測(cè)試(最為重要)
  2. 消除重復(fù)代碼
  3. 體現(xiàn)程序員的原本意圖
  4. 減少類(lèi)和方法的數(shù)量(最不重要)

前三條規(guī)則側(cè)重于代碼重構(gòu)方面。在最初編碼時(shí)不要過(guò)于追求完美,可以先寫(xiě)出簡(jiǎn)單甚至"丑陋"的代碼,待代碼能夠運(yùn)行后,再通過(guò)重構(gòu)來(lái)遵循上述規(guī)則,使代碼變得優(yōu)雅。

推薦"先實(shí)現(xiàn),后重構(gòu)"的編程方式。不要一開(kāi)始就過(guò)分追求完美,而是先讓代碼運(yùn)行起來(lái),功能被實(shí)現(xiàn),之后再反復(fù)重構(gòu),循序漸進(jìn)地遵從這四條簡(jiǎn)單設(shè)計(jì)原則,從而提高代碼質(zhì)量。

編寫(xiě)簡(jiǎn)潔代碼對(duì)軟件項(xiàng)目的成功至關(guān)重要,但這需要嚴(yán)謹(jǐn)?shù)膽B(tài)度和持續(xù)的練習(xí)。作為數(shù)據(jù)科學(xué)家,我們往往更關(guān)注在Jupyter Notebooks中運(yùn)行代碼、尋找好的模型和獲取理想指標(biāo),而忽視了代碼的整潔度。但是,編寫(xiě)簡(jiǎn)潔代碼也是數(shù)據(jù)科學(xué)家的必修課,因?yàn)檫@能確保模型更快地投入生產(chǎn)環(huán)境。

當(dāng)編寫(xiě)需要重復(fù)使用的代碼時(shí),我們應(yīng)當(dāng)堅(jiān)持編寫(xiě)簡(jiǎn)潔代碼。起步可以從簡(jiǎn)單開(kāi)始,不要一開(kāi)始就過(guò)于追求完美,而是要反復(fù)打磨代碼。永遠(yuǎn)不要忘記為函數(shù)編寫(xiě)單元測(cè)試,以確保功能的正常運(yùn)行,避免將來(lái)擴(kuò)展時(shí)出現(xiàn)重大問(wèn)題。

堅(jiān)持一些原則,比如消除重復(fù)代碼、體現(xiàn)代碼意圖等,能讓你遠(yuǎn)離"永遠(yuǎn)不要改變正在運(yùn)行的系統(tǒng)"的思維定式。這些原則我正在學(xué)習(xí)并應(yīng)用到日常工作中,它們確實(shí)很有幫助,但全面掌握需要漫長(zhǎng)的過(guò)程和持續(xù)的努力。

最后,要盡可能自動(dòng)化,利用集成開(kāi)發(fā)環(huán)境提供的擴(kuò)展功能,來(lái)幫助遵守清潔代碼規(guī)則,提高工作效率。

參考資料

[1]PEP 8 樣式指南: https://peps.python.org/pep-0008/

[2]Pylint: https://pylint.readthedocs.io/en/stable/

[3]autopep8: https://pypi.org/project/autopep8/

責(zé)任編輯:武曉燕 來(lái)源: 數(shù)據(jù)STUDIO
相關(guān)推薦

2024-05-20 08:25:55

2021-07-10 07:40:27

Excel數(shù)據(jù)分析大數(shù)據(jù)

2024-11-08 14:18:38

2021-10-29 11:45:26

Python代碼Python 3.

2020-12-09 14:54:36

數(shù)字化轉(zhuǎn)型企業(yè)

2020-04-01 08:40:44

Vue.jsweb開(kāi)發(fā)

2018-03-07 09:35:08

Python淘寶數(shù)據(jù)

2020-09-01 07:27:17

遠(yuǎn)程工作新冠疫情谷歌

2020-05-18 08:42:23

CSS背景圖像前端開(kāi)發(fā)

2021-04-25 11:31:45

React代碼整潔代碼的實(shí)踐

2024-07-02 10:24:35

2022-04-26 06:43:12

文檔TCPLinux

2022-11-30 09:18:51

JavaMyBatisMQ

2021-05-13 16:34:20

TCP客戶端

2021-04-22 07:47:47

JavaJDKMYSQL

2022-06-27 06:23:23

代碼編程

2022-12-15 10:52:26

代碼開(kāi)發(fā)

2020-03-10 10:43:21

機(jī)器學(xué)習(xí)人工智能計(jì)算機(jī)

2020-08-13 18:54:53

Python代碼解釋器

2020-08-06 16:34:48

Python開(kāi)發(fā)工具
點(diǎn)贊
收藏

51CTO技術(shù)棧公眾號(hào)