AutoML: Ensemble！自動化集成學習戰力大評比 — — AutoKeras vs. Auto-Sklearn vs. TPOT vs. FLAML

Alan Wang

50 min readJan 24, 2022

Photo by Marjan Blan | @marjanblan on Unsplash

AutoML（自動化機器學習）越來越流行，但你知道集成學習（ensemble learning）也能搞自動化嗎？其實，有些套件早就在這麼做了，你可能只是還不知道而已。

什麼是Auto ML？又有什麼樣的特點？｜數位時代 BusinessNext

今年一月中，Google發表了自動化機器學習工具（Cloud Auto ML），引發台灣業界熱烈討論。到底所謂的自動化機器學習是什麼意思？對於台灣產業界來說又有何種影響性呢？

www.bnext.com.tw

我早些時候自己做的某本書是關於 AutoKeras，順便提了 FLAML 這個 AutoML（自動化機器學習）套件，當時也順便接觸了比較早問世的 Auto-Sklearn，但第一印象沒有特別喜歡。

後來進一步了解，才發現 Auto-Sklearn 本身是基於集成學習，而即使是 FLAML 和 AutoKeras 也隱藏了一些相關功能。本篇我也研究了另一個較早的套件 TPOT，它跟集成學習同樣有沾上邊。

[Day 17] 集成式學習 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

Ensemble learning 又稱集成學習，指的是以一個系統化的方式將好幾個監督式學習的模型結合在一起，目的是希望結合眾多的模型產生一個更強大的模型。在許多科學競賽中Ensemble…

ithelp.ithome.com.tw

Auto-Sklearn 與 FLAML 背後都依賴 scikit-learn，而 scikit-learn 自己其實就有一系列集成學習類別，從隨機森林和 Gradient Tree Boosting 等分類器到 bagging、voting 和 stacking 法都有。只是與其很麻煩的自己一個個兜模型跟調校，使用這類 AutoML 工具來最佳化「集結眾人之力」就很有吸引力了。

今天我們就來看看幾種 AutoML 套件的自動化集成學習是如何使用，並用兩個資料集比較它們在分類任務的效果。為了比較起見，本篇所有的訓練都在 Google Colab 上用普通執行階段進行。

當然，集成學習的一大問題，就是它理論上訓練更耗時，而一般很少人會討論集成學習的時間效益。本篇我試著讓各套件花差不多一樣的時間訓練，看它們各自能做到什麼程度。

Meet the Team

AutoKeras

AutoKeras 是深度學習自動化套件，用 ENAS（Efficient Neural Architecture Search）搭配貝氏優化（Bayesian optimization）來做高效率的網路態射（network morphisms），也就是試著調整神經網路架構來改進它的表現。你得指定要搜尋的模型數量（預設 100），並可以指定模型架構的搜尋法（貪婪、貝氏、隨機或 hyperband，但我猜它們本質上都是建立在 Bayesian 之上；貪婪最利於提升單一模型效果，而貝氏與 hyperband 適合快速探索）。

本質上 AutoKeras 看似沒有集成學習功能，但其實只要定義一個多任務（multi-task）模型就行了 — — 在同一個模型裡放好幾個小模型，處理同樣的輸入資料，然後用全連接層（相當於另一個模型）合併並過濾所有輸出，算是 bagging 或 stacking 的某種形式。AutoKeras 模型甚至自帶分類編碼器，所以用起來稍微省事一點點。話說回來，有鑑於神經網路所需的運算資源，要同時訓練十幾個子模型恐怕就相當困難。

Auto-Sklearn

以 scikit-learn 為基礎，使用元學習（meta learning）選擇模型、以 Bayesian optimization 調校超參數，並用 bagging ensemble selection 產出最終模型（預設包含 50 個模型）。這是本文四種套件中唯一保證會產生集成學習模型的。你得指定總搜尋時間（預設 3600 秒），而預設上每個模型會分配到總時間的 1/10 時間來訓練。

此套件的麻煩之一是它仍然依賴於舊版的 scikit-learn 與一些套件，所以裝在電腦上容易跟其他套件起衝突。

TPOT（Tree-based Pipeline Optimization Tool）

以 scikit-learn 為基礎，著重在用遺傳演算法（genetic programming）和樹狀結構來產生表現最佳的 scikit-learn 管線（pipeline）。訓練完成後，它可以把這個 pipeline 產生成純 scikit-learn 程式碼檔案，這個 pipeline 很可能會用到某種集成學習法（比如使用隨機森林或某個提升法模型，並有可能搭配 stacking）。你得指定要搜尋多少 generations（預設 100），並可指定突變率、交配率等因子。

這套件還有一個衍生的 TPOT-NN 選項，會把 PyTorch 神經網路模型拉進來一起訓練，但考慮到訓練時間，本篇我不會使用它。

FLAML（A Fast Library for Automated Machine Learning & Tuning）

本來叫做 A Fast and Lightweight AutoML 不過似乎改名了…

用了微軟開發的新演算法 Frugal Optimization 以及 BlendSearch 來減少搜尋超參數的時間和運算成本，著重在快速建模，所以儘管會同時搜尋多重模型，預設上只會傳回表現最佳的那個。但若將 ensemble 參數設為 True 的話，則會使用 scikit-learn 的 StackingClassifier，你也可以選擇指定 stacking 集成模型的後設模型（meta-model）或最終預測器（final estimator）要用哪一種模型。

除了 scikit-learn 內建的一些模型以外，FLAML 會使用一些近期較流行的提升法集成模型如 LightGBM、XGBoost，而這些模型即使獨立使用也不難取得相當好的表現。和 Auto-Sklearn 一樣，FLAML 的訓練限制是時間（預設 3600 秒），而且會在開始訓練時告訴你它預估的所需最少時間與充分訓練時間。因此你在建模有明顯時間考量的話，FLAML 會是個好用的選擇。

FLAML 還支援其他不同的任務，並使用一些不同的模型：用 Prophet／ARIMA／SARIMAX 做時間序列預測，以及用 PyTorch 的 Huggingface transformer 做 NLP 分析，不過在本篇就不討論了。

第一個資料集

Photo by National Cancer Institute on Unsplash

我們先從小規模的資料集著手，使用 scikit-learn 內建的乳癌資料集（2 個分類，30 個特徵，569 筆資料）。下面載入資料集並將之分割為訓練集和測試集：

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_splitx, y = load_breast_cancer(return_X_y=True)x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=42)

AutoKeras：一心多工

安裝 AutoKeras：

!pip3 install autokeras

為了建立多任務模型，我們得使用 AutoModel 類別，以便用類似 Keras Functional API 的方式定義模型搜尋空間：

import autokeras as ak
import tensorflow as tfinput_node = ak.StructuredDataInput()  # 輸入層model1 = ak.StructuredDataBlock()(input_node)  # 模型 1
model2 = ak.StructuredDataBlock()(input_node)  # 模型 2
model3 = ak.StructuredDataBlock()(input_node)  # 模型 3output_node = ak.Merge()([model1, model2, model3])  # 合併層
output_node = ak.ClassificationHead()(output_node)  # 分類層clf = ak.AutoModel(
    inputs=input_node, outputs=output_node, 
    max_trials=50, overwrite=True)  # 搜尋 50 個模型
clf.fit(
    x_train, y_train, 
    callbacks=[tf.keras.callbacks.EarlyStopping(patience=5)])

三個模型都直接沿用 StructuredDataBlock 類別，基本上是幾個 Dense（全連接層），可能再加上類別編碼層（不過這個資料集沒有文字需要編碼）和正規化層。我們直接讓 AutoKeras 去調校三個子模型，它們的輸出結果則會由 Merge 類別匯合，方法有可能是 add（向量相加）或 concatenate（向量合併）。ClassificationHead 類別則是全連接層加 dropout，好過濾三個模型的結果。

訓練輸出結果如下：

Trial 50 Complete [00h 00m 14s]
val_loss: 0.06873828172683716

Best val_loss So Far: 0.038515619933605194
Total elapsed time: 00h 09m 42s
INFO:tensorflow:Oracle triggered exit
Epoch 1/36
15/15 [==============================] - 2s 6ms/step - loss: 0.3130 - accuracy: 0.8659
Epoch 2/36
15/15 [==============================] - 0s 5ms/step - loss: 0.1947 - accuracy: 0.9297
Epoch 3/36
15/15 [==============================] - 0s 5ms/step - loss: 0.1646 - accuracy: 0.9341
Epoch 4/36
15/15 [==============================] - 0s 6ms/step - loss: 0.1405 - accuracy: 0.9473
Epoch 5/36
15/15 [==============================] - 0s 6ms/step - loss: 0.1227 - accuracy: 0.9495
Epoch 6/36
15/15 [==============================] - 0s 5ms/step - loss: 0.1119 - accuracy: 0.9582
Epoch 7/36
15/15 [==============================] - 0s 6ms/step - loss: 0.1008 - accuracy: 0.9670
Epoch 8/36
15/15 [==============================] - 0s 5ms/step - loss: 0.0933 - accuracy: 0.9692
Epoch 9/36
15/15 [==============================] - 0s 5ms/step - loss: 0.0827 - accuracy: 0.9780
Epoch 10/36
15/15 [==============================] - 0s 5ms/step - loss: 0.0763 - accuracy: 0.9780
Epoch 11/36
15/15 [==============================] - 0s 5ms/step - loss: 0.0708 - accuracy: 0.9824
Epoch 12/36
15/15 [==============================] - 0s 6ms/step - loss: 0.0701 - accuracy: 0.9780
Epoch 13/36
15/15 [==============================] - 0s 6ms/step - loss: 0.0618 - accuracy: 0.9824
Epoch 14/36
15/15 [==============================] - 0s 6ms/step - loss: 0.0605 - accuracy: 0.9824
Epoch 15/36
15/15 [==============================] - 0s 6ms/step - loss: 0.0503 - accuracy: 0.9868
Epoch 16/36
15/15 [==============================] - 0s 6ms/step - loss: 0.0439 - accuracy: 0.9846
Epoch 17/36
15/15 [==============================] - 0s 6ms/step - loss: 0.0512 - accuracy: 0.9824
Epoch 18/36
15/15 [==============================] - 0s 6ms/step - loss: 0.0413 - accuracy: 0.9890
Epoch 19/36
15/15 [==============================] - 0s 6ms/step - loss: 0.0405 - accuracy: 0.9846
Epoch 20/36
15/15 [==============================] - 0s 6ms/step - loss: 0.0311 - accuracy: 0.9890
Epoch 21/36
15/15 [==============================] - 0s 5ms/step - loss: 0.0319 - accuracy: 0.9912
Epoch 22/36
15/15 [==============================] - 0s 6ms/step - loss: 0.0293 - accuracy: 0.9934
Epoch 23/36
15/15 [==============================] - 0s 5ms/step - loss: 0.0248 - accuracy: 0.9934
Epoch 24/36
15/15 [==============================] - 0s 6ms/step - loss: 0.0218 - accuracy: 0.9912
Epoch 25/36
15/15 [==============================] - 0s 5ms/step - loss: 0.0216 - accuracy: 0.9890
Epoch 26/36
15/15 [==============================] - 0s 5ms/step - loss: 0.0202 - accuracy: 0.9978
Epoch 27/36
15/15 [==============================] - 0s 6ms/step - loss: 0.0199 - accuracy: 0.9978
Epoch 28/36
15/15 [==============================] - 0s 6ms/step - loss: 0.0153 - accuracy: 0.9978
Epoch 29/36
15/15 [==============================] - 0s 7ms/step - loss: 0.0118 - accuracy: 0.9978
Epoch 30/36
15/15 [==============================] - 0s 6ms/step - loss: 0.0165 - accuracy: 0.9934
Epoch 31/36
15/15 [==============================] - 0s 5ms/step - loss: 0.0098 - accuracy: 1.0000
Epoch 32/36
15/15 [==============================] - 0s 6ms/step - loss: 0.0084 - accuracy: 1.0000
Epoch 33/36
15/15 [==============================] - 0s 5ms/step - loss: 0.0077 - accuracy: 1.0000
Epoch 34/36
15/15 [==============================] - 0s 6ms/step - loss: 0.0062 - accuracy: 1.0000
Epoch 35/36
15/15 [==============================] - 0s 5ms/step - loss: 0.0087 - accuracy: 0.9978
Epoch 36/36
15/15 [==============================] - 0s 6ms/step - loss: 0.0070 - accuracy: 1.0000
INFO:tensorflow:Assets written to: ./auto_model/best_model/assets<tensorflow.python.keras.callbacks.History at 0x7fb54cb8f210>

訓練時間約為 10 分鐘。來畫出模型架構圖：

from tensorflow.keras.utils import plot_modelplot_model(model)

可以看到三個子模型略有不同，而合併方式則是 concatenate。

來檢驗模型對測試集的預測效果：

predicted = clf.predict(x_test).flatten()from sklearn.metrics import mean_squared_error, accuracy_score, average_precision_scoreprint('Prection loss (MSE):', mean_squared_error(
    y_test, predicted).round(5))
print('Prection accuracy:', accuracy_score(
    y_test, predicted).round(5))
print('Prection PR AUC:', average_precision_score(
    y_test, predicted).round(5))

得到

4/4 [==============================] - 0s 4ms/stepPrection loss (MSE): 0.03509
Prection accuracy: 0.96491
Prection PR AUC: 0.962

然後檢視對各分類的預測效果：

from sklearn.metrics import classification_reportprint(classification_report(
    y_test, predicted, target_names=('Malignant', 'Benign')))

這會產生

              precision    recall  f1-score   support   Malignant       0.95      0.95      0.95        43
      Benign       0.97      0.97      0.97        71    accuracy                           0.96       114
   macro avg       0.96      0.96      0.96       114
weighted avg       0.96      0.96      0.96       114

Auto-Skleran：海選人才

安裝 Auto-Sklearn：

!pip3 install auto-sklearn

在 Colab 安裝完後會看到某些套件不相容的錯誤訊息，但只要重新啟動執行階段即可。

Auto-Sklearn 的分類器有兩個，我們要用實驗性的 Auto-Sklearn 2.0 分類器：

from autosklearn.experimental.askl2 import AutoSklearn2Classifierclf = AutoSklearn2Classifier(time_left_for_this_task=600)
clf.fit(x_train, y_train)

這會訓練 600 秒，每一個模型則預設會訓練十分之一的時間（60 秒，或者用 per_run_time_limit 參數指定）。

Auto-Sklearn 的一個問題是它不太會輸出什麼訊息。等到訓練完成後，我們才能檢視它的「模型排行榜」：

clf.leaderboard(

可見 Auto-Sklearn 列出了表現最好的模型，例如 Online Passive-Aggressive ，Multi-layer Perceptron（多層感知器），Gradient Boosting（梯度提升）等分類器，以及它們在集成模型裡的權重。你能如下檢視最終集成模型的內容：

print(clf.show_models())

輸出的文字訊息很長，但擷取每一小段開頭就如下：

[(0.020000, SimpleClassificationPipeline({'balancing:strategy': 'none', 'classifier:__choice__': 'mlp', ...)(0.020000, SimpleClassificationPipeline({'balancing:strategy': 'none', 'classifier:__choice__': 'mlp', ...)(0.020000, SimpleClassificationPipeline({'balancing:strategy': 'weighting', 'classifier:__choice__': 'passive_aggressive', ...)(0.020000, SimpleClassificationPipeline({'balancing:strategy': 'weighting', 'classifier:__choice__': 'mlp', ...)(0.020000, SimpleClassificationPipeline({'balancing:strategy': 'none', 'classifier:__choice__': 'gradient_boosting', ...)(0.020000, SimpleClassificationPipeline({'balancing:strategy': 'none', 'classifier:__choice__': 'random_forest', ...)(0.020000, SimpleClassificationPipeline({'balancing:strategy': 'none', 'classifier:__choice__': 'sgd', ...)(0.020000, SimpleClassificationPipeline({'balancing:strategy': 'weighting', 'classifier:__choice__': 'extra_trees', ...)]

第一個數字是權重，後面是模型名稱。當然實際上可能不只這些，因為預設的集成模型大小是 50（在 AutoSklearn2Classifier 使用 ensemble_size 參數指定）。

模型對測試集的預測效果如下（這部分程式與前面相同）：

Prection loss (MSE): 0.00877
Prection accuracy: 0.99123
Prection PR AUC: 0.98611              precision    recall  f1-score   support   Malignant       1.00      0.98      0.99        43
      Benign       0.99      1.00      0.99        71    accuracy                           0.99       114
   macro avg       0.99      0.99      0.99       114
weighted avg       0.99      0.99      0.99       114

TPOT：培養冠軍

安裝套件：

!pip3 install tpot

下面我們要 TPOT 跑 20 代訓練，每一代有 20 個模型：

from tpot import TPOTClassifierclf = TPOTClassifier(
    population_size=20, generations=20, verbosity=2)
clf.fit(x_train, y_train)

訓練的輸出訊息為：

Generation 1 - Current best internal CV score: 0.9758241758241759Generation 2 - Current best internal CV score: 0.9758241758241759Generation 3 - Current best internal CV score: 0.9758241758241759Generation 4 - Current best internal CV score: 0.9824175824175825Generation 5 - Current best internal CV score: 0.9824175824175825Generation 6 - Current best internal CV score: 0.9824175824175825Generation 7 - Current best internal CV score: 0.9824175824175825Generation 8 - Current best internal CV score: 0.9846153846153847Generation 9 - Current best internal CV score: 0.9846153846153847Generation 10 - Current best internal CV score: 0.9846153846153847Generation 11 - Current best internal CV score: 0.9868131868131869Generation 12 - Current best internal CV score: 0.9868131868131869Generation 13 - Current best internal CV score: 0.9868131868131869Generation 14 - Current best internal CV score: 0.9868131868131869Generation 15 - Current best internal CV score: 0.9868131868131869Generation 16 - Current best internal CV score: 0.989010989010989Generation 17 - Current best internal CV score: 0.989010989010989Generation 18 - Current best internal CV score: 0.989010989010989Generation 19 - Current best internal CV score: 0.989010989010989Generation 20 - Current best internal CV score: 0.989010989010989Best pipeline: MLPClassifier(SelectFwe(StandardScaler(XGBClassifier(input_matrix, learning_rate=0.001, max_depth=5, min_child_weight=20, n_estimators=100, n_jobs=1, subsample=0.6000000000000001, verbosity=0)), alpha=0.042), alpha=0.01, learning_rate_init=0.001)
TPOTClassifier(generations=20, population_size=20, verbosity=2)

過程大約 10 多分鐘。前面提過 TPOT 主要的目的是產生最佳化的 scikit-learn pipeline。其模型能將這個內容匯出為一個 .py 檔：

clf.export('./tpot_pipeline.py')

我們可以讀取該檔來檢視內容：

with open('tpot_pipeline.py') as f:
    for line in f:
        print(line, end='')

這會印出

import numpy as np
import pandas as pd
from sklearn.feature_selection import SelectFwe, f_classif
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import StandardScaler
from tpot.builtins import StackingEstimator
from xgboost import XGBClassifier# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=None)# Average CV score on the training set was: 0.989010989010989
exported_pipeline = make_pipeline(
    StackingEstimator(estimator=XGBClassifier(learning_rate=0.001, max_depth=5, min_child_weight=20, n_estimators=100, n_jobs=1, subsample=0.6000000000000001, verbosity=0)),
    StandardScaler(),
    SelectFwe(score_func=f_classif, alpha=0.042),
    MLPClassifier(alpha=0.01, learning_rate_init=0.001)
)exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

細看便會發現 TPOT 替我們建立了一些資料預處理功能（你得自行換掉 CSV 檔名），以及模型的 pipeline，用 stacking 法拿 100 個 XGBoost 分類器實現集成學習、再由多層感知器擔任後設模型。這也展示了 AutoML 套件在建模語法能替我們省下多少篇幅。

此模型對測試集的預測效果如下：

Prection loss (MSE): 0.03509
Prection accuracy: 0.96491
Prection PR AUC: 0.95417              precision    recall  f1-score   support   Malignant       0.98      0.93      0.95        43
      Benign       0.96      0.99      0.97        71    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114

FLAML：時間大師

安裝套件：

!pip3 install flaml

在訓練 FLAML 模型時，若要啟用集成學習，記得把 ensemble 參數設為 True：

from flaml import AutoMLclf = AutoML()
clf.fit(x_train, y_train, task='classification',
    ensemble=True, time_budget=600)

訓練時間也是 600 秒。和 Auto-Sklearn 不同的是，FLAML 會用這全部的時間訓練多個模型，有多少時間就跑多久，而且還會事先告訴你它預估需要的最少時間／充足時間，並在訓練結束時告訴你現在的最佳模型是花多久找到的。

FLAML 使用的模型是事先就決定好的，但你也可以用 estimator_list 參數傳入一個串列來指定（模型名稱可參考官方文件；注意 catboost（即 CATboost 模型）需另外安裝）。但得注意，有些模型在套用集成學習時可能會引發錯誤。

訓練訊息如下：

[flaml.automl: 01-23 17:40:52] {2007} INFO - task = classification
[flaml.automl: 01-23 17:40:52] {2009} INFO - Data split method: stratified
[flaml.automl: 01-23 17:40:52] {2013} INFO - Evaluation method: cv
[flaml.automl: 01-23 17:40:52] {2113} INFO - Minimizing error metric: 1-roc_auc
[flaml.automl: 01-23 17:40:52] {2170} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'lrl1']
[flaml.automl: 01-23 17:40:52] {2437} INFO - iteration 0, current learner lgbm
[flaml.automl: 01-23 17:40:52] {2551} INFO - Estimated sufficient time budget=527s. Estimated necessary time budget=12s....中略...[flaml.automl: 01-23 17:50:51] {2603} INFO -  at 599.6s, estimator lgbm's best error=0.0065, best estimator xgb_limitdepth's best error=0.0058
[flaml.automl: 01-23 17:50:51] {2437} INFO - iteration 1445, current learner xgb_limitdepth
[flaml.automl: 01-23 17:50:51] {2603} INFO -  at 599.7s, estimator xgb_limitdepth's best error=0.0058, best estimator xgb_limitdepth's best error=0.0058
[flaml.automl: 01-23 17:50:51] {2437} INFO - iteration 1446, current learner lgbm
[flaml.automl: 01-23 17:50:51] {2603} INFO -  at 599.8s, estimator lgbm's best error=0.0065, best estimator xgb_limitdepth's best error=0.0058
[flaml.automl: 01-23 17:50:51] {2437} INFO - iteration 1447, current learner xgb_limitdepth
[flaml.automl: 01-23 17:50:51] {2603} INFO -  at 599.9s, estimator xgb_limitdepth's best error=0.0058, best estimator xgb_limitdepth's best error=0.0058...中略...[flaml.automl: 01-23 17:50:55] {2767} INFO - ensemble: StackingClassifier(estimators=[('xgb_limitdepth',
    <flaml.model.XGBoostLimitDepthEstimator object at 0x7f1814b2bfd0>),
    ('lgbm', <flaml.model.LGBMEstimator object at 0x7f1814af4810>),
    ('extra_tree', <flaml.model.ExtraTreesEstimator object at 0x7f1814adf650>),
    ('xgboost', <flaml.model.XGBoostSklearnEstimator object at 0x7f1814a59a10>),
    ('rf', <flaml.model.RandomForestEstimator object at 0x7f1814a664d0>)],
    n_jobs=-1, passthrough=True)
[flaml.automl: 01-23 17:50:55] {2199} INFO - fit succeeded
[flaml.automl: 01-23 17:50:55] {2201} INFO - Time taken to find the best model: 150.72784996032715

在訓練開頭會看到 FLAML 列出它要使用的模型（如 LightGBM、XGBoost、隨機森林、Extra Tree），以及估計的所需訓練時間，結尾則會顯示找到最佳模型所花的時間、以及最終的最佳模型（在此是以 stacking 法收集以上表現夠好的模型，而 StackingClassifier 預設的後設模型會是 logistic regression）。你也可以讀取 clf.model 屬性來重新檢視這個模型。

後面我們會看到如何對 FLAML 指定集成學習用的後設模型。

模型的預測效果如下：

Prection loss (MSE): 0.02632
Prection accuracy: 0.97368
Prection PR AUC: 0.9673              precision    recall  f1-score   support   Malignant       0.98      0.95      0.96        43
      Benign       0.97      0.99      0.98        71    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114

第二個資料集

接下來玩大一點，使用信用卡交易資料集，記錄了 2013 年 9 月兩天內的 28 多萬筆歐洲信用卡記錄：

Credit Card Fraud Detection

Anonymized credit card transactions labeled as fraudulent or genuine

www.kaggle.com

資料集內除了時間、分類（0 = 正常交易，1 = 詐欺交易）和金額以外，還有 28 個以 PCA 篩選出來、意義未公開的特徵。值得注意的是分類 1 只占總資料的 0.172%，比例極度不均，這會使模型更難以學習分類 1 的特徵。所以我們也能順便看一下各套件對這種資料集的表現如何。

首先上傳資料集後載入之，並丟掉時間欄位：

import pandas as pddf = pd.read_csv('./creditcard.csv')
df.pop('Time')
df.info()

輸出結果如下：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 30 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   V1      284807 non-null  float64
 1   V2      284807 non-null  float64
 2   V3      284807 non-null  float64
 3   V4      284807 non-null  float64
 4   V5      284807 non-null  float64
 5   V6      284807 non-null  float64
 6   V7      284807 non-null  float64
 7   V8      284807 non-null  float64
 8   V9      284807 non-null  float64
 9   V10     284807 non-null  float64
 10  V11     284807 non-null  float64
 11  V12     284807 non-null  float64
 12  V13     284807 non-null  float64
 13  V14     284807 non-null  float64
 14  V15     284807 non-null  float64
 15  V16     284807 non-null  float64
 16  V17     284807 non-null  float64
 17  V18     284807 non-null  float64
 18  V19     284807 non-null  float64
 19  V20     284807 non-null  float64
 20  V21     284807 non-null  float64
 21  V22     284807 non-null  float64
 22  V23     284807 non-null  float64
 23  V24     284807 non-null  float64
 24  V25     284807 non-null  float64
 25  V26     284807 non-null  float64
 26  V27     284807 non-null  float64
 27  V28     284807 non-null  float64
 28  Amount  284807 non-null  float64
 29  Class   284807 non-null  int64  
dtypes: float64(29), int64(1)
memory usage: 65.2 MB

接著分割資料集為訓練集和測試集：

train, test = train_test_split(df, test_size=0.2, random_state=42)x_train = train.drop(['Class'], axis=1)
y_train = train.Class
x_test = test.drop(['Class'], axis=1)
y_test = test.Class

第二資料集的預測結果

這回我試著讓每個模型花差不多半小時來訓練：

AutoKeras

這回我把模型的寫法稍微改變，每個子模型都由正規化層和全連接層構成，並各指定不同的 dropout 率（0%、25%、50%），這樣它們本質上就會一定有些不同。但基於資料變多，測試 10 個模型就會花掉半小時多一點：

from keras_tuner.engine.hyperparameters import Choiceinput_node = ak.StructuredDataInput()model1 = ak.Normalization()(input_node)
model1 = ak.DenseBlock(
    dropout=Choice(name='dropout', values=[0])
    )(model1)model2 = ak.Normalization()(input_node)
model2 = ak.DenseBlock(
    dropout=Choice(name='dropout', values=[0.25])
    )(model2)model3 = ak.Normalization()(input_node)
model3 = ak.DenseBlock(
    dropout=Choice(name='dropout', values=[0.5])
    )(model3)output_node = ak.Merge()([model1, model2, model3])
output_node = ak.ClassificationHead()(output_node)clf = ak.AutoModel(
    inputs=input_node, outputs=output_node, 
    max_trials=10, overwrite=True)
clf.fit(
    x_train, y_train, 
    callbacks=[tf.keras.callbacks.EarlyStopping(patience=5)])

訓練出來的模型如下：

這次剛好第一個模型（左）有三層 Dense，第二個兩層，第三個則是一層。我沒有顯示它們的形狀，不然每一層的字就會小到看不見了 XD

我們對第二資料集所做的測試集評估和前面是一樣的，所以重複的程式碼我就不再寫了。預測結果如下：

Prection loss (MSE): 0.00058
Prection accuracy: 0.99942
Prection PR AUC: 0.68014              precision    recall  f1-score   support      Normal       1.00      1.00      1.00     56864
       Fraud       0.87      0.79      0.82        98    accuracy                           1.00     56962
   macro avg       0.93      0.89      0.91     56962
weighted avg       1.00      1.00      1.00     56962

Auto-Sklearn

在訓練 Auto-Sklearn 模型時，除了指定 time_left_for_this_task 參數為 1800 秒以外，沒有任何呼叫上的差別：

clf = AutoSklearn2Classifier(time_left_for_this_task=1800)
clf.fit(x_train, y_train)

它產生的模型排行榜如下：

預測結果如下：

Prection loss (MSE): 0.0004
Prection accuracy: 0.9996
Prection PR AUC: 0.76761              precision    recall  f1-score   support      Normal       1.00      1.00      1.00     56864
       Fraud       0.95      0.81      0.87        98    accuracy                           1.00     56962
   macro avg       0.98      0.90      0.94     56962
weighted avg       1.00      1.00      1.00     56962

TPOT

同樣的，TPOT 模型只有更改 population_size 以及 generations 參數。但即使如此，這套件執行起來仍極為耗時，結果花了快 2 個小時：

from tpot import TPOTClassifierclf = TPOTClassifier(
    population_size=5, generations=5, verbosity=2)
clf.fit(x_train, y_train)

訓練結果如下：

Optimization Progress: 42/? [1:47:39<00:00, 164.51s/pipeline]Generation 1 - Current best internal CV score: 0.9994996598564814Generation 2 - Current best internal CV score: 0.9994996598564814Generation 3 - Current best internal CV score: 0.9994996598564814Generation 4 - Current best internal CV score: 0.9994996598564814Generation 5 - Current best internal CV score: 0.9994996598564814Best pipeline: ExtraTreesClassifier(input_matrix, bootstrap=False, criterion=gini, max_features=0.8500000000000001, min_samples_leaf=11, min_samples_split=16, n_estimators=100)
TPOTClassifier(generations=5, population_size=5, verbosity=2)

這回它產生的 pipeline 沒有用 stacking 法，只用一個內含 100 個模型的極限隨機森林分類器（這還是可以算集成學習）。

模型的預測表現如下：

Prection loss (MSE): 0.00056
Prection accuracy: 0.99944
Prection PR AUC: 0.68185              precision    recall  f1-score   support      Normal       1.00      1.00      1.00     56864
       Fraud       0.90      0.76      0.82        98    accuracy                           1.00     56962
   macro avg       0.95      0.88      0.91     56962
weighted avg       1.00      1.00      1.00     56962

FLAML

我在 FLAML 做了兩次嘗試。第一次只是把訓練時間改成 1800 秒：

clf = AutoML()
clf.fit(x_train, y_train, task='classification', 
        ensemble=True, time_budget=1800)

產生的最終模型如下：

StackingClassifier(estimators=[
  ('lgbm', <flaml.model.LGBMEstimator object at 0x7f1815977790>),
  ('xgb_limitdepth', <flaml.model.XGBoostLimitDepthEstimator object at 0x7f1815a00f90>)],
  n_jobs=-1, passthrough=True)

雖然訓練時有四、五個模型在跑，這個 stacking 集成模型只有納入 LightGBM 和 XGBoost 模型（後者看名稱似乎是有加上樹深度限制）。而它的預測效果如下：

Prection loss (MSE): 0.00077
Prection accuracy: 0.99923
Prection PR AUC: 0.56311              precision    recall  f1-score   support      Normal       1.00      1.00      1.00     56864
       Fraud       0.88      0.64      0.74        98    accuracy                           1.00     56962
   macro avg       0.94      0.82      0.87     56962
weighted avg       1.00      1.00      1.00     56962

很有趣，在四個套件中，FLAML 對分類 1 的預測能力（可用 f1-score 和 PR AUC 分數表示）是最差的。

查了一下官方文件有提到可以讓你對 fit() 傳權重給 sample_weights 參數（這可以手動設定或用 compute_sample_weight 來產生），加強模型對分類 1 的重視程度。但問題在於：

StackingClassifier 並不會把 sample_weights 參數傳給底下的模型（所以訓練完後會產生錯誤）；
即使不使用集成學習，權重也需要繼續手動調校，不然即使用自動計算的權重，模型也會過度容易將正常交易預測為詐欺。

目前看來 FLAML 對此沒有好的解法，所以我嘗試了另一個方式，手動指定 stacking 模型中的後設模型。記得前面 TPOT 曾經用過的 MLP 嗎？

from sklearn.neural_network import MLPClassifierclf = AutoML()
clf.fit(x_train, y_train, task='classification', 
        estimator_list=[
            'lgbm', 'xgb_limitdepth', 'rf', 'extra_tree'
        ],
        ensemble={
            'final_estimator': MLPClassifier(),
            'passthrough': False,
        }, 
        time_budget=1800)

passthrough 參數代表要不要傳原始特徵給後設模型，設為 False 就只會傳回各模型的預測結果。我也用 estimator_list 指定使用其中幾個之前表現較好的模型，雖然這對產生的結果沒有影響（stacking 模型還是只有 LightGBM 與 XGBoost-limitdepth）。

重新檢視預測結果：

Prection loss (MSE): 0.00042
Prection accuracy: 0.99958
Prection PR AUC: 0.75847              precision    recall  f1-score   support      Normal       1.00      1.00      1.00     56864
       Fraud       0.94      0.81      0.87        98    accuracy                           1.00     56962
   macro avg       0.97      0.90      0.93     56962
weighted avg       1.00      1.00      1.00     56962

這回對分類 1 的預測能力明顯改善了！

結語

在這兩個資料集的分類預測上，Auto-Sklearn 出乎預料，其集成學習模型明顯都是表現最好的，但其他套件也大多相去不遠。至於面對分類不均的資料時，則是 AutoKeras 與 Auto-Sklearn 表現較佳（但 FLAML 靠著修改後設模型扳回了一成）。

其實這些套件多少都內含一些資料預處理功能，有的會嘗試平衡資料、填補缺失值、做正規化等等，但可能需要花點時間看文件跟原始碼才能知道細節。而這些套件也都處於某種程度的發展，所以將來其表現說不定還會有變化。

就我自己之前把玩的經驗，在結構化資料任務使用 AutoKeras 搞集成學習，效果其實很少會比單一模型好到哪去，而在它最擅長的影像和 NLP 領域，有些模型就是比其他的強太多了。我在書裡用單一神經網路預測本文的第二個資料集，表現其實比上面用三個子模型還更準一點點呢。而我用 ResNet + XCeption 還打不過一個 EfficientNet，用幾種文本辨識器加幾種詞庫模型也沒有比一個 BERT 來得好。

此外像是在 FLAML 裡，若你發現某幾個模型收斂特別明顯，那麼單獨用它們來訓練（讓它們可多佔一點運算資源），也有可能得到更好的結果。比如 LightGBM 或 XGBoost 在迴歸任務獨立訓練的話，於相同的時間內常能輕鬆打敗一票 scikit-learn 分類器，連 Auto-Sklearn 的集成模型都望塵莫及。所以有時並不是把所有的模型拉進來就一定更好，而是應該選擇在特定時空背景下最適用的模型。

而現在，AutoML 在這方面說不定能比多數人做得更好，而且如上只要幾行程式就能建模。剩下就看你願意給它多少時間了。

無論如何，下面我稍微整理了個人對本篇四個套件的感想：

AutoKeras：能訓練強大的神經網路，但時間也難以估計，對硬體需求亦較大。而即使對較簡單的資料集，全連接層模型也需要大量的測試才能產生明顯的收斂，多重子模型也不見得會優於調校過的單一模型。此套件的強項依然是在影像與 NLP 等領域。
Auto-Sklearn：可指定時限，產生的集成學習模型預測力極佳，但訓練時間可能要稍長一點，才能確保它能有夠多的模型收斂（這可能需要多次嘗試）。訓練過程幾乎沒有訊息，難以確認進度，而相依套件比較舊且更新較慢，使得在本機安裝比較麻煩。
TPOT：遺傳演算法很有趣，也會估計運算時間，但坦白說效果和時間都不是最理想，除非你真的需要替一個只有 scikit-learn 的環境產生最佳化模型管線以及可重現的程式碼。
FLAML：可指定時限，擬合快速且一開始就能大概知道模型訓練需要多少時間，並有豐富的回饋訊息。集成學習的功能則比 Auto-Sklearn 簡單得多，納入的模型數量也少（雖然有 LightGBM 這類模型可用），所以其長處還是在於尋找單一最佳模型。

有興趣繼續進一步鑽研的人則可參考：

AutoML: Ensemble！自動化集成學習戰力大評比 — — AutoKeras vs. Auto-Sklearn vs. TPOT vs. FLAML

什麼是Auto ML？又有什麼樣的特點？｜數位時代 BusinessNext

今年一月中，Google發表了自動化機器學習工具（Cloud Auto ML），引發台灣業界熱烈討論。到底所謂的自動化機器學習是什麼意思？對於台灣產業界來說又有何種影響性呢？

[Day 17] 集成式學習 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

Ensemble learning 又稱集成學習，指的是以一個系統化的方式將好幾個監督式學習的模型結合在一起，目的是希望結合眾多的模型產生一個更強大的模型。在許多科學競賽中Ensemble…

Meet the Team

AutoKeras

Auto-Sklearn

TPOT（Tree-based Pipeline Optimization Tool）

FLAML（A Fast Library for Automated Machine Learning & Tuning）

第一個資料集

AutoKeras：一心多工

Auto-Skleran：海選人才

TPOT：培養冠軍

FLAML：時間大師

第二個資料集

Credit Card Fraud Detection

Anonymized credit card transactions labeled as fraudulent or genuine

第二資料集的預測結果

AutoKeras

Auto-Sklearn

TPOT

FLAML

結語

API 文件

套件背後的論文

Written by Alan Wang

No responses yet

AutoML: Ensemble！ 自動化集成學習戰力大評比 — — AutoKeras vs. Auto-Sklearn vs. TPOT vs. FLAML

什麼是Auto ML？又有什麼樣的特點？｜數位時代 BusinessNext

今年一月中，Google發表了自動化機器學習工具（Cloud Auto ML），引發台灣業界熱烈討論。到底所謂的自動化機器學習是什麼意思？對於台灣產業界來說又有何種影響性呢？

[Day 17] 集成式學習 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

Ensemble learning 又稱集成學習，指的是以一個系統化的方式將好幾個監督式學習的模型結合在一起，目的是希望結合眾多的模型產生一個更強大的模型。在許多科學競賽中Ensemble…

Meet the Team

AutoKeras

Auto-Sklearn

TPOT（Tree-based Pipeline Optimization Tool）

FLAML（A Fast Library for Automated Machine Learning & Tuning）

第一個資料集

AutoKeras：一心多工

Auto-Skleran：海選人才

TPOT：培養冠軍

FLAML：時間大師

第二個資料集

Credit Card Fraud Detection

Anonymized credit card transactions labeled as fraudulent or genuine

第二資料集的預測結果

AutoKeras

Auto-Sklearn

TPOT

FLAML

結語

API 文件

套件背後的論文

Written by Alan Wang

No responses yet

AutoML: Ensemble！自動化集成學習戰力大評比 — — AutoKeras vs. Auto-Sklearn vs. TPOT vs. FLAML