自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

機(jī)器學(xué)習(xí)如何訓(xùn)練最終模型

企業(yè)動(dòng)態(tài)
對于剛剛接觸、或跨界轉(zhuǎn)行至機(jī)器學(xué)習(xí)的朋友來說,“如何訓(xùn)練最終模型”可謂是一個(gè)經(jīng)典話題。

 對于剛剛接觸、或跨界轉(zhuǎn)行至機(jī)器學(xué)習(xí)的朋友來說,“如何訓(xùn)練最終模型”可謂是一個(gè)經(jīng)典話題。對此,Jason Brownlee博士專門撰文解答這個(gè)疑問(原文鏈接:http://machinelearningmastery.com/train-final-machine-learning-model/),開數(shù)科技在此對文章進(jìn)行了中文編譯,希望能夠?yàn)檎趯W(xué)習(xí)中的朋友們帶去一些幫助。

原文作者:Dr. Jason Brownlee

中文編譯:R.

特邀校審:Dr. Xu.Tang

來源:開數(shù)科技(微信公眾號:open01tech)

How to Train a Final Machine Learning Model

機(jī)器學(xué)習(xí)如何訓(xùn)練最終模型

The machine learning model that we use to make predictions on new data is called the final model.

機(jī)器學(xué)習(xí)過程中,我們用來對新數(shù)據(jù)進(jìn)行預(yù)測的模型被稱為最終模型。

There can be confusion in applied machine learning about how to train a final model.

而對于如何訓(xùn)練最終模型,初學(xué)者可能會(huì)產(chǎn)生疑問或困惑。

This error is seen with beginners to the field who ask questions such as:

例如,初學(xué)者可能會(huì)提出以下問題:

• How do I predict with cross validation?

· 我應(yīng)該如何通過交叉驗(yàn)證進(jìn)行預(yù)測?

• Which model do I choose from cross-validation?

· 根據(jù)交叉驗(yàn)證我應(yīng)該選擇哪個(gè)模型?

• Do I use the model after preparing it on the training dataset?

· 我應(yīng)該使用在訓(xùn)練集上建立的模型嗎?

This post will clear up the confusion.

本文的目的在于解答這些問題。

In this post, you will discover how to finalize your machine learning model in order to make predictions on new data.

通過本文,你將會(huì)了解如何最終選定你的機(jī)器學(xué)習(xí)模型,從而對新的數(shù)據(jù)進(jìn)行預(yù)測。

Let’s get started.

讓我們開始吧。

What is a Final Model?

什么是“最終模型”?

A final machine learning model is a model that you use to make predictions on new data.

在機(jī)器學(xué)習(xí)中,“最終模型”是指用來預(yù)測新數(shù)據(jù)的模型。

That is, given new examples of input data, you want to use the model to predict the expected output. This may be a classification (assign a label) or a regression (a real value).

也就是說,在給定的新輸入數(shù)據(jù)樣例上,你可以使用最終模型預(yù)測出期待的輸出結(jié)果。這有可能是一個(gè)分類問題(數(shù)據(jù)標(biāo)注)或者是一個(gè)回歸問題(數(shù)值估計(jì))。

For example, whether the photo is a picture of a dog or a cat, or the estimated number of sales for tomorrow.

比如我們可以通過模型,去判斷某個(gè)照片中是汪還是咪,又或者可以去預(yù)估明天的銷售額。

The goal of your machine learning project is to arrive at a final model that performs the best, where “best” is defined by:

進(jìn)行機(jī)器學(xué)習(xí)的目的是訓(xùn)練一個(gè)“最好”的最終模型。這里“最好”是由以下因素決定的:

• Data: the historical data that you have available.

· 數(shù)據(jù):可用的歷史數(shù)據(jù)。

• Time: the time you have to spend on the project.

· 時(shí)間:用來訓(xùn)練模型的時(shí)間。

• Procedure: the data preparation steps, algorithm or algorithms, and the chosen algorithm configurations.

· 過程:數(shù)據(jù)準(zhǔn)備步驟、算法或算法集,以及如何配置這些算法。

In your project, you gather the data, spend the time you have, and discover the data preparation procedures, algorithm to use, and how to configure it.

總體說來,整個(gè)過程涉及數(shù)據(jù)收集、訓(xùn)練、合理的設(shè)置流程、選擇合適的算法,并進(jìn)行正確配置。

The final model is the pinnacle of this process, the end you seek in order to start actually making predictions.

“最終模型”則是整個(gè)過程的終點(diǎn),通過它你可以開始對實(shí)際數(shù)據(jù)進(jìn)行預(yù)測。

The Purpose of Train/Test Sets

使用訓(xùn)練/測試數(shù)據(jù)集的目的

Why do we use train and test sets?

為什么要使用訓(xùn)練/測試數(shù)據(jù)集?

Creating a train and test split of your dataset is one method to quickly evaluate the performance of an algorithm on your problem.

通過把數(shù)據(jù)分割成訓(xùn)練集/測試集,能夠快速地評估你的算法的性能如何。

The training dataset is used to prepare a model, to train it.

訓(xùn)練數(shù)據(jù)集是用來形成、并訓(xùn)練模型的。

We pretend the test dataset is new data where the output values are withheld from the algorithm. We gather predictions from the trained model on the inputs from the test dataset and compare them to the withheld output values of the test set.

對于測試數(shù)據(jù),我們假設(shè)測試數(shù)據(jù)集是新的數(shù)據(jù),在模型訓(xùn)練過程中隱藏已知的輸出值(事實(shí)上我們是知道輸出值的)?;跍y試數(shù)據(jù)的輸入和在訓(xùn)練數(shù)據(jù)上構(gòu)建的模型,我們可以預(yù)測測試數(shù)據(jù)上的輸出值并將它們與真實(shí)輸出進(jìn)行比較。

Comparing the predictions and withheld outputs on the test dataset allows us to compute a performance measure for the model on the test dataset. This is an estimate of the skill of the algorithm trained on the problem when making predictions on unseen data.

通過將測試數(shù)據(jù)集的預(yù)測結(jié)果和我們事先已知的輸出結(jié)果進(jìn)行比對,可以衡量模型在測試數(shù)據(jù)上的表現(xiàn),從而估計(jì)模型在未知數(shù)據(jù)集上的預(yù)測能力。

Let’s unpack this further

讓我們進(jìn)一步展開來解釋

When we evaluate an algorithm, we are in fact evaluating all steps in the procedure, including how the training data was prepared (e.g. scaling), the choice of algorithm (e.g. kNN), and how the chosen algorithm was configured (e.g. k=3).

當(dāng)我們評估一個(gè)算法,我們實(shí)際上是評估計(jì)算過程中的所有步驟,包括如何準(zhǔn)備訓(xùn)練數(shù)據(jù)(例如:縮放)、算法的選擇(例如:KNN),以及如何配置我們的算法(例如:K = 3)。

The performance measure calculated on the predictions is an estimate of the skill of the whole procedure.

所謂模型預(yù)測性能的優(yōu)劣,也是對計(jì)算過程中所有涉及環(huán)節(jié)的綜合評估。

We generalize the performance measure from:

一些評定因素包括:

• “the skill of the procedure on the test set“

· “測試/訓(xùn)練環(huán)節(jié)所用方法及性能”

to

• “the skill of the procedure on unseen data“.

· 到“通過計(jì)算模型在測試數(shù)據(jù)集上的預(yù)測精度來估計(jì)它在未知數(shù)據(jù)上的預(yù)測能力”。

This is quite a leap and requires that:

這兩者之間實(shí)際上有相當(dāng)大的距離,這個(gè)過程需要滿足一下條件:

• The procedure is sufficiently robust that the estimate of skill is close to what we actually expect on unseen data.

· 《模型具有足夠的魯棒性使得這種估計(jì)能夠充分接近模型在未知數(shù)據(jù)集上的預(yù)測精度。

· The choice of performance measure accurately captures what we are interested in measuring in predictions on unseen data.

· 評價(jià)指標(biāo)的選擇能夠真實(shí)反映我們對于數(shù)據(jù)預(yù)測的關(guān)注點(diǎn)。

• The choice of data preparation is well understood and repeatable on new data, and reversible if predictions need to be returned to their original scale or related to the original input values.

· 數(shù)據(jù)的預(yù)處理是合理的,并且能夠在新數(shù)據(jù)集上重復(fù); 同時(shí)如果預(yù)測過程需要回溯到原數(shù)據(jù)的量綱上,那么預(yù)處理過程還要是可逆的。

• The choice of algorithm makes sense for its intended use and operational environment (e.g. complexity or chosen programming language).

· 算法的選擇應(yīng)該考慮其實(shí)際的應(yīng)用目標(biāo)和操作環(huán)境(例如算法復(fù)雜度或編程語言的選擇)。

A lot rides on the estimated skill of the whole procedure on the test set.

機(jī)器學(xué)習(xí)方法在測試數(shù)據(jù)上的表現(xiàn)將會(huì)決定我們最終模型,包括數(shù)據(jù)預(yù)處理過程、具體模型類型、參數(shù)的選擇和訓(xùn)練環(huán)境等諸多因素。

In fact, using the train/test method of estimating the skill of the procedure on unseen data often has a high variance (unless we have a heck of a lot of data to split). This means that when it is repeated, it gives different results, often very different results.

事實(shí)上,使用訓(xùn)練/測試數(shù)據(jù)分割法來估計(jì)模型在未知數(shù)據(jù)上的預(yù)測能力往往會(huì)有很大的分歧(除非有海量的數(shù)據(jù)進(jìn)行分割)。也就是說,在不同的未知數(shù)據(jù)上,同一個(gè)模型的預(yù)測能力可能會(huì)有明顯的差異。

The outcome is that we may be quite uncertain about how well the procedure actually performs on unseen data and how one procedure compares to another.

其結(jié)果是,我們可能不是非常確定模型在未知數(shù)據(jù)及上的表現(xiàn)如何,以及模型之間的差異如何。

Often, time permitting, we prefer to use k-fold cross-validation instead.

如果時(shí)間允許的話,使用交叉驗(yàn)證可能也是個(gè)不錯(cuò)的方法。

The Purpose of k-fold Cross Validation

交叉驗(yàn)證的目的

Why do we use k-fold cross validation?

為什么要使用交叉驗(yàn)證?

Cross-validation is another method to estimate the skill of a method on unseen data. Like using a train-test split.

類似前面提到的訓(xùn)練數(shù)據(jù)集預(yù)測方法,“交叉驗(yàn)證”是另一種用來估計(jì)模型在未知數(shù)據(jù)集上預(yù)測能力的方法。

Cross-validation systematically creates and evaluates multiple models on multiple subsets of the dataset.

交叉驗(yàn)證系統(tǒng)的在原數(shù)據(jù)的多個(gè)子集創(chuàng)建多個(gè)模型,并進(jìn)行評估。

This, in turn, provides a population of performance measures.

這同時(shí)提供了相關(guān)模型的一組評價(jià)指標(biāo)。

• We can calculate the mean of these measures to get an idea of how well the procedure performs on average.

· 我們可以對這組評價(jià)指標(biāo)取均值以評估模型的性能。

• We can calculate the standard deviation of these measures to get an idea of how much the skill of the procedure is expected to vary in practice.

· 我們可以計(jì)算出這些指標(biāo)的標(biāo)準(zhǔn)偏差,從而了解在真實(shí)數(shù)據(jù)集中會(huì)產(chǎn)生多大范圍的變化。

This is also helpful for providing a more nuanced comparison of one procedure to another when you are trying to choose which algorithm and data preparation procedures to use.

這也有助于更細(xì)致的比較該選擇何種算法或采用何種數(shù)據(jù)預(yù)處理方法。

Also, this information is invaluable as you can use the mean and spread to give a confidence interval on the expected performance on a machine learning procedure in practice.

此外,這些信息的價(jià)值還在于你能計(jì)算它們的均值和范圍來構(gòu)建機(jī)器學(xué)習(xí)模型的預(yù)測能力的置信區(qū)間。

Both train-test splits and k-fold cross validation are examples of resampling methods.

訓(xùn)練/測試數(shù)據(jù)集和交叉驗(yàn)證都是使用重采樣的方法。

Why do we use Resampling Methods?

為什么要使用重采樣方法?

The problem with applied machine learning is that we are trying to model the unknown.

應(yīng)用機(jī)器學(xué)習(xí)的目的在于我們希望通過模型對未知數(shù)據(jù)進(jìn)行預(yù)測。

On a given predictive modeling problem, the ideal model is one that performs the best when making predictions on new data.

對于一個(gè)既定預(yù)測的模型,理想狀態(tài)是該模型對新數(shù)據(jù)能夠給出接近真實(shí)情況的預(yù)測結(jié)果。

We don’t have new data, so we have to pretend with statistical tricks.

但在此之前,我們沒有新的數(shù)據(jù),所以我們不得不通過統(tǒng)計(jì)方法來模擬。

The train-test split and k-fold cross validation are called resampling methods. Resampling methods are statistical procedures for sampling a dataset and estimating an unknown quantity.

訓(xùn)練/測試數(shù)據(jù)集和交叉驗(yàn)證采用所謂的“重采樣方法”。“重采樣方法”是對數(shù)據(jù)集進(jìn)行采樣,并估計(jì)未知量的統(tǒng)計(jì)方法。

In the case of applied machine learning, we are interested in estimating the skill of a machine learning procedure on unseen data. More specifically, the skill of the predictions made by a machine learning procedure.

在應(yīng)用機(jī)器學(xué)習(xí)時(shí),我們關(guān)注的是模型的預(yù)測能力; 具體來說,就是模型預(yù)測值的準(zhǔn)確性。

Once we have the estimated skill, we are finished with the resampling method.

一旦我們估計(jì)出模型的預(yù)測精度,那么重采樣方法的任務(wù)也就結(jié)束了。

• If you are using a train-test split, that means you can discard the split datasets and the trained model.

· 如果你使用的是一個(gè)隨機(jī)分割的訓(xùn)練與測試數(shù)據(jù)集,這意味著你現(xiàn)在可以無視這個(gè)數(shù)據(jù)集和相關(guān)的訓(xùn)練模型了。

• If you are using k-fold cross-validation, that means you can throw away all of the trained models.

· 如果你使用的是k-fold交叉驗(yàn)證,這意味著你可以扔掉所有在數(shù)據(jù)子集上訓(xùn)練的模型了。

They have served their purpose and are no longer needed.

因?yàn)樗鼈兊娜蝿?wù)已經(jīng)完成了。

You are now ready to finalize your model.

你現(xiàn)在即將完成你的模型了。

How to Finalize a Model?

如何完成模型?

You finalize a model by applying the chosen machine learning procedure on all of your data.

你可以將機(jī)器學(xué)習(xí)生成的模型應(yīng)用在你全部的數(shù)據(jù)上。

That’s it.

就是這樣。

With the finalized model, you can:

對于最終模型,您可以:

• Save the model for later or operational use.

· 保存模型為以后或操作使用。

• Make predictions on new data.

· 對新數(shù)據(jù)作出預(yù)測。

What about the cross-validation models or the train-test datasets?

那交叉驗(yàn)證模型或訓(xùn)練/測試數(shù)據(jù)集呢?

They’ve been discarded. They are no longer needed. They have served their purpose to help you choose a procedure to finalize.

它們已經(jīng)完成自身的使命,以后也就不再需要它們了。

關(guān)于作者:Dr. Jason Brownlee

Dr. Jason Brownlee is a husband, proud father, academic researcher, author, professional developer and a machine learning practitioner. He is dedicated to helping developers get started and get good at applied machine learning.

特邀校審:Dr. Xu.Tang

新加坡國立大學(xué)統(tǒng)計(jì)學(xué)博士,原大公國際數(shù)據(jù)分析經(jīng)理,現(xiàn)開數(shù)科技高級數(shù)據(jù)挖掘與分析師。

關(guān)于開數(shù)科技:

開數(shù)科技(OPEN01)致力于以世界領(lǐng)先的人工智能大數(shù)據(jù)處理技術(shù)、獨(dú)到的IT架構(gòu)、深度學(xué)習(xí)以及模式識(shí)別算法,為各行業(yè)用戶提供實(shí)時(shí)、高效、多維度的數(shù)據(jù)分析產(chǎn)品和服務(wù)。核心團(tuán)隊(duì)成員匯集來自美國MIT、哈佛大學(xué)、紐約州立大學(xué)、英國劍橋大學(xué)等大數(shù)據(jù)專家,以及來自羅蘭貝格、埃森哲等戰(zhàn)略運(yùn)營專家。

責(zé)任編輯:Jane 來源: 互聯(lián)網(wǎng)
相關(guān)推薦

2020-08-10 15:05:02

機(jī)器學(xué)習(xí)人工智能計(jì)算機(jī)

2022-03-28 09:00:00

SQL數(shù)據(jù)庫機(jī)器學(xué)習(xí)

2022-09-19 15:37:51

人工智能機(jī)器學(xué)習(xí)大數(shù)據(jù)

2021-02-26 10:45:49

PyCaret低代碼Python

2024-11-04 00:24:56

2024-11-26 09:33:44

2024-12-26 00:46:25

機(jī)器學(xué)習(xí)LoRA訓(xùn)練

2018-11-07 09:00:00

機(jī)器學(xué)習(xí)模型Amazon Sage

2019-05-07 11:18:51

機(jī)器學(xué)習(xí)人工智能計(jì)算機(jī)

2017-07-13 10:12:58

機(jī)器學(xué)習(xí)

2020-09-22 14:59:52

機(jī)器學(xué)習(xí)人工智能計(jì)算機(jī)

2021-04-09 14:49:02

人工智能機(jī)器學(xué)習(xí)

2018-03-09 09:00:00

前端JavaScript機(jī)器學(xué)習(xí)

2017-08-25 14:05:01

機(jī)器學(xué)習(xí)算法模型

2020-06-10 12:19:21

機(jī)器學(xué)習(xí)技術(shù)人工智能

2017-07-07 14:41:13

機(jī)器學(xué)習(xí)神經(jīng)網(wǎng)絡(luò)JavaScript

2021-11-02 09:40:50

TensorFlow機(jī)器學(xué)習(xí)人工智能

2022-06-02 15:42:05

Python機(jī)器學(xué)習(xí)

2021-11-14 22:20:45

人工智能機(jī)器學(xué)習(xí)技術(shù)

2021-04-22 08:00:00

人工智能機(jī)器學(xué)習(xí)數(shù)據(jù)
點(diǎn)贊
收藏

51CTO技術(shù)棧公眾號