Python機(jī)器學(xué)習(xí):超參數(shù)調(diào)優(yōu)
1.什么是超參數(shù)
超參數(shù)(hyper parameters)就是機(jī)器學(xué)習(xí)或深度學(xué)習(xí)算法中需要預(yù)先設(shè)置的參數(shù),這些參數(shù)不是通過訓(xùn)練數(shù)據(jù)學(xué)習(xí)到的參數(shù);原始算法一般只給出超參數(shù)的取值范圍和含義,根據(jù)不同的應(yīng)用場景,同一個算法的同一超參數(shù)設(shè)置也不同。
那超參數(shù)應(yīng)該如何設(shè)置呢?似乎沒有捷徑,去嘗試不同的取值,比較不同的結(jié)果取最好的結(jié)果。
本文整理了不同的嘗試方法,如下:
- RandomSearch
- GridSearch
- 貝葉斯優(yōu)化(Bayesian optimization)
2. GridSearchCV
暴力窮舉是尋找最優(yōu)超參數(shù)一種簡單有效的方法,但是對于算法龐大的超參數(shù)空間來說,窮舉會損耗大量的時間,特別是多個超參數(shù)情況下。GridSearchCV的做法是縮減了超參數(shù)值的空間,只搜索人為重要的超參數(shù)和有限的固定值。同時結(jié)合了交叉驗證的方式來搜索最優(yōu)的超參數(shù)。
拿lightgbm為例子:
- import pandas as pd
- import numpy as np
- import math
- import warnings
- import lightgbm as lgb
- from sklearn.model_selection import GridSearchCV
- from sklearn.model_selection import RandomizedSearchCV
- lg = lgb.LGBMClassifier(silent=False)
- param_dist = {"max_depth": [2, 3, 4, 5, 7, 10],
- "n_estimators": [50, 100, 150, 200],
- "min_child_samples": [2,3,4,5,6]
- }
- grid_search = GridSearchCV(estimator=lg, n_jobs=10, param_grid=param_dist, cv = 5, scoring='f1', verbose=5)
- grid_search.fit(X_train, y)
- grid_search.best_estimator_, grid_search.best_score_
- # Fitting 5 folds for each of 120 candidates, totalling 600 fits
- # [Parallel(n_jobs=10)]: Using backend LokyBackend with 10 concurrent workers.
- # [Parallel(n_jobs=10)]: Done 52 tasks | elapsed: 2.5s
- # [Parallel(n_jobs=10)]: Done 142 tasks | elapsed: 6.6s
- # [Parallel(n_jobs=10)]: Done 268 tasks | elapsed: 14.0s
- # [Parallel(n_jobs=10)]: Done 430 tasks | elapsed: 25.5s
- # [Parallel(n_jobs=10)]: Done 600 out of 600 | elapsed: 40.6s finished
- # (LGBMClassifier(max_depth=10, min_child_samples=6, n_estimators=200,
- # silent=False), 0.6359524127649383)
從上面可知,GridSearchCV搜索過程
- 模型estimator:lgb.LGBMClassifier
- param_grid:模型的超參數(shù),上面例子給出了3個參數(shù),值得數(shù)量分別是6,4,5,組合起來的搜索空間是120個
- cv:交叉驗證的折數(shù)(上面例子5折交叉), 算法訓(xùn)練的次數(shù)總共為120*5=600
- scoring:模型好壞的評價指標(biāo)分?jǐn)?shù),如F1值
- 搜索返回: 最好的模型 best_estimator_和最好的分?jǐn)?shù)
3. RandomSearchCV
和GridSearchCV一樣,RandomSearchCV也是在有限的超參數(shù)空間(人為重要的超參數(shù))中搜索最優(yōu)超參數(shù)。不一樣的地方在于搜索超參數(shù)的值不是固定,是在一定范圍內(nèi)隨機(jī)的值。不同超參數(shù)值的組合也是隨機(jī)的。值的隨機(jī)性可能會彌補(bǔ)GridSearchCV超參數(shù)值固定的有限組合,但也可能更壞。
- Better than grid search in various senses but still expensive to guarantee good coverage
- import pandas as pd
- import numpy as np
- import math
- import warnings
- import lightgbm as lgb
- from scipy.stats import uniform
- from sklearn.model_selection import GridSearchCV
- from sklearn.model_selection import RandomizedSearchCV
- lg = lgb.LGBMClassifier(silent=False)
- param_dist = {"max_depth": range(2,15,1),
- "n_estimators": range(50,200,4),
- "min_child_samples": [2,3,4,5,6],
- }
- random_search = RandomizedSearchCV(estimator=lg, n_jobs=10, param_distparam_distributions=param_dist, n_iter=100, cv = 5, scoring='f1', verbose=5)
- random_search.fit(X_train, y)
- random_search.best_estimator_, random_search.best_score_
- # Fitting 5 folds for each of 100 candidates, totalling 500 fits
- # [Parallel(n_jobs=10)]: Using backend LokyBackend with 10 concurrent workers.
- # [Parallel(n_jobs=10)]: Done 52 tasks | elapsed: 6.6s
- # [Parallel(n_jobs=10)]: Done 142 tasks | elapsed: 12.9s
- # [Parallel(n_jobs=10)]: Done 268 tasks | elapsed: 22.9s
- # [Parallel(n_jobs=10)]: Done 430 tasks | elapsed: 36.2s
- # [Parallel(n_jobs=10)]: Done 500 out of 500 | elapsed: 42.0s finished
- # (LGBMClassifier(max_depth=11, min_child_samples=3, n_estimators=198,
- # silent=False), 0.628180299445963)
從上可知,基本和GridSearchCV類似,不同之處如下:
- n_iter:隨機(jī)搜索值的數(shù)量
- param_distributions:搜索值的范圍,除了list之外,也可以是某種分布如uniform均勻分布等
4. 貝葉斯優(yōu)化(Bayesian optimization)
不管是GridSearchCV還是RandomSearchCV, 都是在調(diào)參者給定的有限范圍內(nèi)搜索全部或者部分參數(shù)的組合情況下模型的最佳表現(xiàn);可想而知最優(yōu)模型參數(shù)取決于先驗的模型參數(shù)和有限范圍,某些情況下并一定是最優(yōu)的, 而且暴力搜索對于大的候選參數(shù)空間也是很耗時的。
我們換了角度來看待參數(shù)搜索的問題:我們的目的是選擇一個最優(yōu)的參數(shù)組合,使得訓(xùn)練的模型在給定的數(shù)據(jù)集上表現(xiàn)最好,所以可以理解成是一個最優(yōu)化問題。
我們通常使用梯度下降的方式來迭代的計算最優(yōu)值,但是梯度下降要求優(yōu)化的函數(shù)是固定并且是可導(dǎo)的,如交叉熵loss等。對于參數(shù)搜索問題, 我們要在眾多模型(不同的超參)中找一個效果最好的模型,判斷是否最好是由模型決定的,而模型是一個黑盒子,不知道是什么結(jié)構(gòu),以及是否是凸函數(shù),是沒辦法使用梯度下降方法。
這種情況下,貝葉斯優(yōu)化是一種解決方案。貝葉斯優(yōu)化把搜索的模型空間假設(shè)為高斯分布,利用高斯過程,按迭代的方式每次計算得到比當(dāng)前最優(yōu)參數(shù)期望提升的新的最優(yōu)參數(shù)。
通用的算法如下:
- Input:f是模型, M是高斯擬合函數(shù), X是參數(shù), S是參數(shù)選擇算法Acquisition Function
- 初始化高斯分布擬合的數(shù)據(jù)集D,為(x,y), x是超參數(shù),y是超參數(shù)的x的執(zhí)行結(jié)果(如精確率等)
- 迭代T次
- 每次迭代,用D數(shù)據(jù)集擬合高斯分布函數(shù)
- 根據(jù)擬合的函數(shù),根據(jù)Acquisition Function(如Expected improvement算法),在參數(shù)空間選擇一個比當(dāng)前最優(yōu)解更優(yōu)的參數(shù)xi
- 將參數(shù)xi代入模型f(訓(xùn)練一個模型),得出相應(yīng)的yi(新模型的精確率等)
- (xi,yi)重新加入擬合數(shù)據(jù)集D,再一次迭代
由此可知,貝葉斯優(yōu)化每次都利用上一次參數(shù)選擇。而GridSearchCV和RandomSearchCV每一次搜索都是獨立的。
到此,簡單介紹了貝葉斯優(yōu)化的理論知識。有很多第三方庫實現(xiàn)了貝葉斯優(yōu)化的實現(xiàn),如 advisor,bayesian-optimization,Scikit-Optimize和GPyOpt等。本文以GPyOpt和bayesian-optimization為例子。
- pip install gpyopt
- pip install bayesian-optimization
- pip install scikit-optimize
- gpyopt例子
- import GPy
- import GPyOpt
- from GPyOpt.methods import BayesianOptimization
- from sklearn.model_selection import train_test_split
- from sklearn.model_selection import cross_val_score
- from sklearn.datasets import load_iris
- from xgboost import XGBRegressor
- import numpy as np
- iris = load_iris()
- X = iris.data
- y = iris.target
- x_train, x_test, y_train, y_test = train_test_split(X,y,test_size = 0.3,random_state = 14)
- # 超參數(shù)搜索空間
- bds = [{'name': 'learning_rate', 'type': 'continuous', 'domain': (0, 1)},
- {'name': 'gamma', 'type': 'continuous', 'domain': (0, 5)},
- {'name': 'max_depth', 'type': 'continuous', 'domain': (1, 50)}]
- # Optimization objective 模型F
- def cv_score(parameters):
- parametersparameters = parameters[0]
- score = cross_val_score(
- XGBRegressor(learning_rate=parameters[0],
- gamma=int(parameters[1]),
- max_depth=int(parameters[2])),
- X, y, scoring='neg_mean_squared_error').mean()
- score = np.array(score)
- return score
- # acquisition就是選擇不同的Acquisition Function
- optimizer = GPyOpt.methods.BayesianOptimization(f = cv_score, # function to optimize
- domain = bds, # box-constraints of the problem
- acquisition_type ='LCB', # LCB acquisition
- acquisition_weight = 0.1) # Exploration exploitation
- x_best = optimizer.X[np.argmax(optimizer.Y)]
- print("Best parameters: learning_rate="+str(x_best[0])+",gamma="+str(x_best[1])+",max_depth="+str(x_best[2]))
- # Best parameters: learning_rate=0.4272184438229706,gamma=1.4805727469635759,max_depth=41.8460390442754
- bayesian-optimization例子
- from sklearn.datasets import make_classification
- from xgboost import XGBRegressor
- from sklearn.model_selection import cross_val_score
- from bayes_opt import BayesianOptimization
- iris = load_iris()
- X = iris.data
- y = iris.target
- x_train, x_test, y_train, y_test = train_test_split(X,y,test_size = 0.3,random_state = 14)
- bds ={'learning_rate': (0, 1),
- 'gamma': (0, 5),
- 'max_depth': (1, 50)}
- # Optimization objective
- def cv_score(learning_rate, gamma, max_depth):
- score = cross_val_score(
- XGBRegressor(learning_ratelearning_rate=learning_rate,
- gamma=int(gamma),
- max_depth=int(max_depth)),
- X, y, scoring='neg_mean_squared_error').mean()
- score = np.array(score)
- return score
- rf_bo = BayesianOptimization(
- cv_score,
- bds
- )
- rf_bo.maximize()
- rf_bo.max
- | iter | target | gamma | learni... | max_depth |
- -------------------------------------------------------------
- | 1 | -0.0907 | 0.7711 | 0.1819 | 20.33 |
- | 2 | -0.1339 | 4.933 | 0.6599 | 8.972 |
- | 3 | -0.07285 | 1.55 | 0.8247 | 33.94 |
- | 4 | -0.1359 | 4.009 | 0.3994 | 25.55 |
- | 5 | -0.08773 | 1.666 | 0.9551 | 48.67 |
- | 6 | -0.05654 | 0.0398 | 0.3707 | 1.221 |
- | 7 | -0.08425 | 0.6883 | 0.2564 | 33.25 |
- | 8 | -0.1113 | 3.071 | 0.8913 | 1.051 |
- | 9 | -0.9167 | 0.0 | 0.0 | 2.701 |
- | 10 | -0.05267 | 0.0538 | 0.1293 | 1.32 |
- | 11 | -0.08506 | 1.617 | 1.0 | 32.68 |
- | 12 | -0.09036 | 2.483 | 0.2906 | 33.21 |
- | 13 | -0.08969 | 0.4662 | 0.3612 | 34.74 |
- | 14 | -0.0723 | 1.295 | 0.2061 | 1.043 |
- | 15 | -0.07531 | 1.903 | 0.1182 | 35.11 |
- | 16 | -0.08494 | 2.977 | 1.0 | 34.57 |
- | 17 | -0.08506 | 1.231 | 1.0 | 36.05 |
- | 18 | -0.07023 | 2.81 | 0.838 | 36.16 |
- | 19 | -0.9167 | 1.94 | 0.0 | 36.99 |
- | 20 | -0.09041 | 3.894 | 0.9442 | 35.52 |
- | 21 | -0.1182 | 3.188 | 0.01882 | 35.14 |
- | 22 | -0.08521 | 0.931 | 0.05693 | 31.66 |
- | 23 | -0.1003 | 2.26 | 0.07555 | 31.78 |
- | 24 | -0.1018 | 0.08563 | 0.9838 | 32.22 |
- | 25 | -0.1017 | 0.8288 | 0.9947 | 30.57 |
- | 26 | -0.9167 | 1.943 | 0.0 | 30.2 |
- | 27 | -0.08506 | 1.518 | 1.0 | 35.04 |
- | 28 | -0.08494 | 3.464 | 1.0 | 32.36 |
- | 29 | -0.1224 | 4.296 | 0.4472 | 33.47 |
- | 30 | -0.1017 | 0.0 | 1.0 | 35.86 |
- =============================================================
- {'target': -0.052665895082105285,
- 'params': {'gamma': 0.05379782654053811,
- 'learning_rate': 0.1292986176550608,
- 'max_depth': 1.3198257775801387}}
bayesian-optimization只支持最大化,如果score是越小越好,可以加一個負(fù)號轉(zhuǎn)化為最大值優(yōu)化。
兩者的優(yōu)化結(jié)果并不是十分一致的,可能和實現(xiàn)方式和選擇的算法不同,也和初始化的擬合數(shù)據(jù)集有關(guān)系。
5. 總結(jié)
本文介紹三種超參數(shù)優(yōu)化的策略,希望對你有幫助。簡要總結(jié)如下:
- GridSearchCV網(wǎng)格搜索,給定超參和取值范圍,遍歷所有組合得到最優(yōu)參數(shù)。首先你要給定一個先驗的取值,不能取得太多,否則組合太多,耗時太長??梢詥l(fā)式的嘗試。
- RandomSearchCV隨機(jī)搜索,搜索超參數(shù)的值不是固定,是在一定范圍內(nèi)隨機(jī)的值
- 貝葉斯優(yōu)化,采用高斯過程迭代式的尋找最優(yōu)參數(shù),每次迭代都是在上一次迭代基礎(chǔ)上擬合高斯函數(shù)上,尋找比上一次迭代更優(yōu)的參數(shù),推薦gpyopt庫