自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

借助LLM實(shí)現(xiàn)模型選擇和試驗(yàn)自動(dòng)化

譯文 精選
人工智能
代碼生成和評(píng)估最近已經(jīng)成為許多商業(yè)產(chǎn)品提供的重要功能,以幫助開(kāi)發(fā)人員處理代碼。LLM還可以進(jìn)一步用于處理數(shù)據(jù)科學(xué)工作,尤其是模型選擇和試驗(yàn)。本文將探討如何將自動(dòng)化用于模型選擇和試驗(yàn)。

譯者 | 布加迪

審校 | 重樓

大語(yǔ)言模型(LLM)已成為一種工具,從回答問(wèn)題到生成任務(wù)列表,它們?cè)谠S多方面簡(jiǎn)化了我們的工作。如今個(gè)人和企業(yè)已經(jīng)使用LLM來(lái)幫助完成工作。

代碼生成和評(píng)估最近已經(jīng)成為許多商業(yè)產(chǎn)品提供的重要功能,以幫助開(kāi)發(fā)人員處理代碼。LLM還可以進(jìn)一步用于處理數(shù)據(jù)科學(xué)工作,尤其是模型選擇和試驗(yàn)。

本文將探討如何將自動(dòng)化用于模型選擇和試驗(yàn)。

借助LLM實(shí)現(xiàn)模型選擇和試驗(yàn)自動(dòng)化

我們將設(shè)置用于模型訓(xùn)練的數(shù)據(jù)集和用于自動(dòng)化的代碼。在這個(gè)例子中,我們將使用來(lái)自Kaggle的信用汽車(chē)欺詐數(shù)據(jù)集。以下是我為預(yù)處理過(guò)程所做的準(zhǔn)備。

import pandas as pd

df = pd.read_csv('fraud_data.csv')
df = df.drop(['trans_date_trans_time', 'merchant', 'dob', 'trans_num', 'merch_lat', 'merch_long'], axis =1)

df = df.dropna().reset_index(drop = True)
df.to_csv('fraud_data.csv', index = False)

我們將只使用一些數(shù)據(jù)集,丟棄所有缺失的數(shù)據(jù)。這不是最優(yōu)的過(guò)程,但我們關(guān)注的是模型選擇和試驗(yàn)。

接下來(lái),我們將為我們的項(xiàng)目準(zhǔn)備一個(gè)文件夾,將所有相關(guān)文件放在那里。首先,我們將為環(huán)境創(chuàng)建requirements.txt文件。你可以用下面的軟件包來(lái)填充它們。

openai
pandas
scikit-learn
pyyaml

接下來(lái),我們將為所有相關(guān)的元數(shù)據(jù)使用YAML文件。這將包括OpenAI API密鑰、要測(cè)試的模型、評(píng)估度量指標(biāo)和數(shù)據(jù)集的位置。

llm_api_key: "YOUR-OPENAI-API-KEY"
default_models:
  - LogisticRegression
  - DecisionTreeClassifier
  - RandomForestClassifier
metrics: ["accuracy", "precision", "recall", "f1_score"]
dataset_path: "fraud_data.csv"

然后,我們導(dǎo)入這個(gè)過(guò)程中使用的軟件包。我們將依靠Scikit-Learn用于建模過(guò)程,并使用OpenAI的GPT-4作為L(zhǎng)LM。

import pandas as pd
import yaml
import ast
import re
import sklearn
from openai import OpenAI
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

此外,我們將設(shè)置輔助(helper)函數(shù)和信息來(lái)幫助該過(guò)程。從數(shù)據(jù)集加載到數(shù)據(jù)預(yù)處理,配置加載器在如下的函數(shù)中。

model_mapping = {
    "LogisticRegression": LogisticRegression,
    "DecisionTreeClassifier": DecisionTreeClassifier,
    "RandomForestClassifier": RandomForestClassifier
}

def load_config(config_path='config.yaml'):
    with open(config_path, 'r') as file:
        config = yaml.safe_load(file)
    return config

def load_data(dataset_path):
    return pd.read_csv(dataset_path)

def preprocess_data(df):
    label_encoders = {}
    for column in df.select_dtypes(include=['object']).columns:
        le = LabelEncoder()
        df[column] = le.fit_transform(df[column])
        label_encoders[column] = le
    return df, label_encoders

在同一個(gè)文件中,我們將LLM設(shè)置為扮演機(jī)器學(xué)習(xí)角色的專(zhuān)家。我們將使用下面的代碼來(lái)啟動(dòng)它。

def call_llm(prompt, api_key):
    client = OpenAI(api_key=api_key)
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are an expert in machine learning and able to evaluate the model well."},
            {"role": "user", "content": prompt}
        ]
    )
    return response.choices[0].message.content.strip()

你可以將LLM模型更改為所需的模型,比如來(lái)自Hugging Face的開(kāi)源模型,但我們建議暫且堅(jiān)持使用OpenAI。

我將在下面的代碼中準(zhǔn)備一個(gè)函數(shù)來(lái)清理LLM結(jié)果。這確保了輸出可以用于模型選擇和試驗(yàn)步驟的后續(xù)過(guò)程。

def clean_hyperparameter_suggestion(suggestion):
    pattern = r'\{.*?\}'
    match = re.search(pattern, suggestion, re.DOTALL)
    if match:
        cleaned_suggestion = match.group(0)
        return cleaned_suggestion
    else:
        print("Could not find a dictionary in the hyperparameter suggestion.")
        return None

def extract_model_name(llm_response, available_models):
    for model in available_models:
        pattern = r'\b' + re.escape(model) + r'\b'
        if re.search(pattern, llm_response, re.IGNORECASE):
            return model
    return None

def validate_hyperparameters(model_class, hyperparameters):
    valid_params = model_class().get_params()
    invalid_params = []
    for param, value in hyperparameters.items():
        if param not in valid_params:
            invalid_params.append(param)
        else:
            if param == 'max_features' and value == 'auto':
                print(f"Invalid value for parameter '{param}': '{value}'")
                invalid_params.append(param)
    if invalid_params:
        print(f"Invalid hyperparameters for {model_class.__name__}: {invalid_params}")
        return False
    return True

def correct_hyperparameters(hyperparameters, model_name):
    corrected = False
    if model_name == "RandomForestClassifier":
        if 'max_features' in hyperparameters and hyperparameters['max_features'] == 'auto':
            print("Correcting 'max_features' from 'auto' to 'sqrt' for RandomForestClassifier.")
            hyperparameters['max_features'] = 'sqrt'
            corrected = True
    return hyperparameters, corrected

然后,我們將需要該函數(shù)來(lái)啟動(dòng)模型和評(píng)估訓(xùn)練過(guò)程。下面的代碼將用于通過(guò)接受分割器數(shù)據(jù)集、我們要映射的模型名稱以及超參數(shù)來(lái)訓(xùn)練模型。結(jié)果將是度量指標(biāo)和模型對(duì)象。

def train_and_evaluate(X_train, X_test, y_train, y_test, model_name, hyperparameters=None):
    if model_name not in model_mapping:
        print(f"Valid model names are: {list(model_mapping.keys())}")
        return None, None

    model_class = model_mapping.get(model_name)
    try:
        if hyperparameters:
            hyperparameters, corrected = correct_hyperparameters(hyperparameters, model_name)
            if not validate_hyperparameters(model_class, hyperparameters):
                return None, None
            model = model_class(**hyperparameters)
        else:
            model = model_class()
    except Exception as e:
        print(f"Error instantiating model with hyperparameters: {e}")
        return None, None
    try:
        model.fit(X_train, y_train)
    except Exception as e:
        print(f"Error during model fitting: {e}")
        return None, None


    y_pred = model.predict(X_test)
    metrics = {
        "accuracy": accuracy_score(y_test, y_pred),
        "precision": precision_score(y_test, y_pred, average='weighted', zero_division=0),
        "recall": recall_score(y_test, y_pred, average='weighted', zero_division=0),
        "f1_score": f1_score(y_test, y_pred, average='weighted', zero_division=0)
    }
    return metrics, model

準(zhǔn)備就緒后,我們就可以設(shè)置自動(dòng)化過(guò)程了。有幾個(gè)步驟我們可以實(shí)現(xiàn)自動(dòng)化,其中包括:

1.訓(xùn)練和評(píng)估所有模型

2. LLM選擇最佳模型

3. 檢查最佳模型的超參數(shù)調(diào)優(yōu)

4. 如果LLM建議,自動(dòng)運(yùn)行超參數(shù)調(diào)優(yōu)

def run_llm_based_model_selection_experiment(df, config):
    #Model Training
    X = df.drop("is_fraud", axis=1)
    y = df["is_fraud"]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    available_models = config['default_models']
    model_performance = {}

    for model_name in available_models:
        print(f"Training model: {model_name}")
        metrics, _ = train_and_evaluate(X_train, X_test, y_train, y_test, model_name)
        model_performance[model_name] = metrics
        print(f"Model: {model_name} | Metrics: {metrics}")

    #LLM selecting the best model
    sklearn_version = sklearn.__version__
    prompt = (
        f"I have trained the following models with these metrics: {model_performance}. "
        "Which model should I select based on the best performance?"
    )
    best_model_response = call_llm(prompt, config['llm_api_key'])
    print(f"LLM response for best model selection:\n{best_model_response}")

    best_model = extract_model_name(best_model_response, available_models)
    if not best_model:
        print("Error: Could not extract a valid model name from LLM response.")
        return
    print(f"LLM selected the best model: {best_model}")

    #Check for hyperparameter tuning
    prompt_tuning = (
        f"The selected model is {best_model}. Can you suggest hyperparameters for better performance? "
        "Please provide them in Python dictionary format, like {'max_depth': 5, 'min_samples_split': 4}. "
        f"Ensure that all suggested hyperparameters are valid for scikit-learn version {sklearn_version}, "
        "and avoid using deprecated or invalid values such as 'max_features': 'auto'. "
        "Don't provide any explanation or return in any other format."
    )
    tuning_suggestion = call_llm(prompt_tuning, config['llm_api_key'])
    print(f"Hyperparameter tuning suggestion received:\n{tuning_suggestion}")

    cleaned_suggestion = clean_hyperparameter_suggestion(tuning_suggestion)
    if cleaned_suggestion is None:
        suggested_params = None
    else:
        try:
            suggested_params = ast.literal_eval(cleaned_suggestion)
            if not isinstance(suggested_params, dict):
                print("Hyperparameter suggestion is not a valid dictionary.")
                suggested_params = None
        except (ValueError, SyntaxError) as e:
            print(f"Error parsing hyperparameter suggestion: {e}")
            suggested_params = None

    #Automatically run hyperparameter tuning if suggested
    if suggested_params:
        print(f"Running {best_model} with suggested hyperparameters: {suggested_params}")
        tuned_metrics, _ = train_and_evaluate(
            X_train, X_test, y_train, y_test, best_model, hyperparameters=suggested_params
        )
        print(f"Metrics after tuning: {tuned_metrics}")
    else:
        print("No valid hyperparameters were provided for tuning.")

在上面的代碼中,我指定了LLM如何根據(jù)試驗(yàn)評(píng)估我們的每個(gè)模型。我們使用以下提示根據(jù)模型的性能來(lái)選擇要使用的模型。

prompt = (
        f"I have trained the following models with these metrics: {model_performance}. "
        "Which model should I select based on the best performance?")

你始終可以更改提示,以實(shí)現(xiàn)模型選擇的不同規(guī)則。

一旦選擇了最佳模型,我將使用以下提示來(lái)建議應(yīng)該使用哪些超參數(shù)用于后續(xù)過(guò)程。我還指定了Scikit-Learn版本,因?yàn)槌瑓?shù)因版本的不同而有變化。

prompt_tuning = (
        f"The selected model is {best_model}. Can you suggest hyperparameters for better performance? "
        "Please provide them in Python dictionary format, like {'max_depth': 5, 'min_samples_split': 4}. "
        f"Ensure that all suggested hyperparameters are valid for scikit-learn version {sklearn_version}, "
        "and avoid using deprecated or invalid values such as 'max_features': 'auto'. "
        "Don't provide any explanation or return in any other format.")

你可以以任何想要的方式更改提示,比如通過(guò)更大膽地嘗試調(diào)優(yōu)超參數(shù),或添加另一種技術(shù)。

我把上面的所有代碼放在一個(gè)名為automated_model_llm.py的文件中。最后,添加以下代碼以運(yùn)行整個(gè)過(guò)程。

def main():
    config = load_config()
    df = load_data(config['dataset_path'])
    df, _ = preprocess_data(df)
    run_llm_based_model_selection_experiment(df, config)


if __name__ == "__main__":
    main()

一旦一切準(zhǔn)備就緒,你就可以運(yùn)行以下代碼來(lái)執(zhí)行代碼。

python automated_model_llm.py

輸出:

LLM selected the best model: RandomForestClassifier
Hyperparameter tuning suggestion received:
{
'n_estimators': 100,
'max_depth': None,
'min_samples_split': 2,
'min_samples_leaf': 1,
'max_features': 'sqrt',
'bootstrap': True
}
Running RandomForestClassifier with suggested hyperparameters: {'n_estimators': 100, 'max_depth': None, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'bootstrap': True}
Metrics after tuning: {'accuracy': 0.9730041532071989, 'precision': 0.9722907483489197, 'recall': 0.9730041532071989, 'f1_score': 0.9724045530119824}

這是我試驗(yàn)得到的示例輸出。它可能和你的不一樣。你可以設(shè)置提示和生成參數(shù),以獲得更加多變或嚴(yán)格的LLM輸出。然而,如果你正確構(gòu)建了代碼的結(jié)構(gòu),可以將LLM運(yùn)用于模型選擇和試驗(yàn)自動(dòng)化。

結(jié)論

LLM已經(jīng)應(yīng)用于許多使用場(chǎng)景,包括代碼生成。通過(guò)運(yùn)用LLM(比如OpenAI GPT模型),我們就很容易委派LLM處理模型選擇和試驗(yàn)這項(xiàng)任務(wù),只要我們正確地構(gòu)建輸出的結(jié)構(gòu)。在本例中,我們使用樣本數(shù)據(jù)集對(duì)模型進(jìn)行試驗(yàn),讓LLM選擇和試驗(yàn)以改進(jìn)模型。

原文標(biāo)題:Model Selection and Experimentation Automation with LLMs,作者:Cornellius Yudha Wijaya

責(zé)任編輯:姜華 來(lái)源: 51CTO內(nèi)容精選
相關(guān)推薦

2024-06-11 10:41:14

2021-11-29 18:11:33

自動(dòng)化現(xiàn)代化網(wǎng)絡(luò)優(yōu)化

2015-10-21 15:08:25

電纜自動(dòng)化

2020-05-29 09:03:36

SD-WAN自動(dòng)化網(wǎng)絡(luò)

2022-07-05 08:26:10

Python報(bào)表自動(dòng)化郵箱

2021-10-14 09:55:28

AnsibleanacronLinux

2020-01-16 09:00:00

AI人工智能ML

2017-12-17 21:58:18

2021-09-17 15:56:14

數(shù)據(jù)平臺(tái)自動(dòng)化

2009-12-23 16:27:49

WPF UI自動(dòng)化模型

2023-09-01 09:21:03

Python自動(dòng)化測(cè)試

2018-01-30 10:24:41

2021-07-14 13:11:02

papermillJupyterPython

2023-08-17 10:14:58

物聯(lián)網(wǎng)家庭自動(dòng)化

2013-09-11 09:04:48

2020-03-31 10:58:35

網(wǎng)絡(luò)自動(dòng)化SD-WAN軟件定義網(wǎng)絡(luò)

2022-01-14 11:51:00

測(cè)試工具自動(dòng)化

2010-09-27 09:13:36

Visual Stud

2017-07-21 09:14:21

2023-02-15 08:21:22

點(diǎn)贊
收藏

51CTO技術(shù)棧公眾號(hào)