Model Stacking：提升準確率的萬用法寶

6 min readFeb 6, 2024

模型堆疊（Model Stacking）是一種廣泛應用的技術，透過多個異質性模型來提升準確度的手法統稱。這種方法特別適用於對準確率非常要求的使用情境，並且在Kaggle比賽中會經常看到。舉例來說，年齡預測問題上也許CNN、Transformer兩種模型訓練出來的準確率也許相近，但是在不同資料上有好有壞，將這些模型的輸出融合，可以彌補單一種類模型弱點而得到更高準確率。Jump to Enligsh version.

需要注意的是，只要符合以上描述的手法，都可以歸類為模型堆疊，因此實作方式相當多變，而大多數實作都有資料洩漏（Data Leakage）疑慮，也就是訓練模型時，使用到保留集（Holdout Set）的資料，使得準確率被高估並且造成模型過擬和。因此本篇將介紹其中一種最實用且廣泛的實作，同時消除資料洩漏疑慮。

名詞定義

原始資料分成訓練集（Training Set）、保留集（Holdout Set）。

使用者定義在Level i有Mi個模型
會用到K-fold Cross Validation
Training Set跟Holdout Set中的每筆資料維度在越後Level會變越長，而資料筆數永遠不變

Training Set 的特性

根據訓練方法可以繼續切割為子訓練集、子驗證集和子測試集
一定有 Ground Truth

Holdout Set 的特性

可能是模型上線後，真實世界資料中的資料
可能是預先切割好用來評估模型效能的資料
可以沒有 Ground Truth

訓練流程概觀

用 Training Set 訓練 Level 0 的 M0 個模型，將預測結果（降雨機率、評價分數）、特徵（例如深度學習模型最後一層拉直），添加在 Training Set 每筆資料的維度上。

Level 1 同理，只是輸入的 Training Set 多了預設上一層預測結果、特徵。預測結果是必要放入的，這樣會有殘差預測效果。特徵視模型種類選擇性放入。

Level 2 同理，是最後一層。作為特例，不製作新的 Taining set，將預測結果輸出，計算 Error。

提供一個年齡預測的例子方便想像，總共兩層級，L0 使用預訓練的 VGG16 和 ResNet50，將預測結果還有Feature Map拉直放入下一層Training Set；為了簡化，暫時不處理 ResNet50 特徵跟上層預測結果融合問題，第二層只使用 XGBoost。

過程中一直反覆訓練同樣的 Training Set ，並將預測結果放回去，不會資料洩漏嗎？

有機制機制使得 Training set 的第 A 筆資料新增的維度，永遠不是由用看過第 A 筆資料的模型所新增維度，所以可以避免，後面介紹這個機制。

推論流程概觀

已經訓練好每一層級的模型：

用 Level 0 的模型預測 Holdout Set，得到預測結果與特徵，放入Holdout Set 之中。

Level 1 同理，製作出新的 Holdout Set。

Level 2 平均產生最後預測結果。

將資料放回 Holdout Set ，不會資料外洩嗎？

Level 0、1 與 2 的模型從來沒有用 Holdout set 訓練過，所以沒有。

訓練 Level L 到 L+1 機制

直到剛剛我還沒有提到個別模型是怎麼被訓練的，現在就來看看吧！Cross Validation 切成 K fold（K=2）為例子：

Fold 1 訓練用 Split 2 訓練出 M 個模型，預測在 Split 1。將預測結果、特徵新增到 Split 1 的維度上。

Fold 2 同理。

現在 Split 1 有新增維度、Split 2 也有新增維度，但 Split 1 資料從沒被用來訓練過 Fold 1 模型；Split 2 資料從沒被用來訓練過 Fold 2 模型。這就是之前提到避免資料外洩的機制。

將資料重新洗牌後，變成新的 Training Set。

推論 Level L 到 L+1 機制

用 Fold 1 訓練的模型推論 Holdout Set，得到預測結果和特徵。

用 Fold 2 訓練的模型預測 Holdout Set，得到預測結果、特徵。

Fold 1 特徵、預測結果與 Fold 2 融合，例如：預測年齡平均、特徵向量平均。

融合後結果作為新的 Holdout Set。

參考資料

Model Stacking: The Swiss Army Knife for Improving Accuracy

Model Stacking is a widely-used technique that employs multiple heterogeneous models to enhance accuracy collectively. This method is particularly suitable for scenarios where high accuracy is crucial and is frequently observed in Kaggle competitions. For instance, in a problem like age prediction, the accuracy achieved by two models, such as CNN and Transformer, may be similar, but they may perform differently on various datasets. By merging the outputs of these models, one can compensate for the weaknesses of individual models and achieve higher accuracy.

It’s essential to note that any technique conforming to the description above can be classified as model stacking. Therefore, the implementation methods vary greatly, and most implementations raise concerns about data leakage, wherein the use of data from the holdout set during model training overestimates accuracy and leads to model overfitting. Thus, this article will introduce one of the most practical and widely-used implementations while simultaneously addressing concerns about data leakage.

Terminology Definitions

The original data is divided into a training set and a holdout set.

Users define Mi models at Level i.
K-fold cross validation is employed.
The dimensions of each data point in the Training Set and Holdout Set grow longer with each subsequent level, while the number of data points remains constant.

Characteristics of the Training Set

Based on the training method, it can be further divided into subsets: training, validation, and testing subsets.
Ground truth is always present.

Characteristics of the Holdout Set

It may consist of real-world data encountered after the model goes live.
It may be pre-split data used to evaluate model performance.
Ground truth may not be available.

Overview of the Training Process

Train M0 models at Level 0 using the training set. Incorporate prediction results (e.g., rainfall probability, evaluation scores) and features (e.g., flattened final layers of deep learning models) into the dimensions of each data point in the training set.

Similarly, as the feature dimensions grow longer at Level 1, the training set is augmented with the previous layer’s prediction results and features. Prediction results are necessary to achieve residual prediction effects, while feature inclusion depends on the model type.

At Level 2, as the final layer, no new training set is created. Instead, prediction results are outputted, and errors are calculated.

To illustrate with an example of age prediction spanning two layers, L0 utilizes pre-trained VGG16 and ResNet50 models. Their prediction results and feature maps are flattened and added to the subsequent training set. For simplicity, ResNet50 features are temporarily excluded from fusion with the upper layer’s prediction results, and only XGBoost is used in the second layer. Throughout this process, the same training set is repeatedly trained, and prediction results are incorporated back into it, avoiding data leakage concerns.

Inference Process Overview

Models at each level have already been trained:

Use Level 0 models to predict the Holdout Set, obtaining prediction results and features, which are then added to the Holdout Set.

Similarly, for Level 1, a new Holdout Set is created.

At Level 2, average the generated final prediction results. Reintegrate the data into the Holdout Set without causing data leakage.

Training Mechanism from Level L to L+1

Until now, I haven’t mentioned how individual models are trained. Let’s delve into it now! Taking K-fold Cross Validation with K=2 as an example:

Train M models with Split 2 in Fold 1, predict on Split 1, and add the prediction results and features to the dimensions of Split 1.

Likewise for Fold 2.

Now, Split 1 has added dimensions, and Split 2 has also added dimensions. However, Split 1 data was never used to train the Fold 1 model, and Split 2 data was never used to train the Fold 2 model. This mechanism avoids data leakage, as mentioned earlier.