機器學習探索之旅:三本書個人見解

聞浩凱 Hao-Kai Wen
3 min readFeb 2, 2024

--

最近(原文2023.08.25)通勤時間閱讀了三本書,分別是〈Designing Machine Learning Systems〉、〈Machine Learning System Design Interview〉以及〈Machine Learning Design Pattern〉,分享一下我對於這三本書的看法。Jump to English version

不推薦 〈Machine Learning Design Pattern〉

因為程式範例過多,過於具體,我認為可以給關鍵字而給不要予過多程式碼。大部分內容對我來說都過於熟悉,不過仍然有優點,像是實務上設計分類器會有中行類別用來處理不明確的類別,還有雜湊特徵等。總之很虧就是了。

微推薦〈Machine Learning System Design Interview〉

書中探討搜尋、推薦、排序、點擊預測等主題,並且有考慮到實務上資料的迭代速度,看完以後,我可以理解不同類型資料該怎麼抽取特徵,特定問題對應的模型架構以及候選模型有哪些,模型設計考量,像是 一階段或二階段偵測器,早或晚融合特徵。沒做過的主題可以快速掌握重點,但如果有做過的主題會得不到新觀點。

最推薦〈Designing Machine Learning Systems〉

特別是章節8、9、6、4 值得看,對於沒有網路機器學習經驗的人一定會覺得新奇,並且涵蓋其他書不常見但值得思考的議題,以下我針對最感興趣的章節做說明:

章節8:資料分佈飄移與監控

資料分佈飄移有三種,其中一個是共變飄移。舉例來說,一個使用者興趣預測模型,其中一個特徵是年齡。假如現今高齡使用者喜歡看電視,而壯年使用者喜歡玩手機,在此時時刻訓練好模型,那麼20年後,如果使用者偏好不隨年齡改變,訓練好的模型,準確度會大幅下降,因為20年後高齡使用者也就是原本壯年使用者是喜歡玩手機的。偵測資料飄移方式有很多種,其中一種是建立輔助用的指標觀察模型準確度有沒有下降,另外一種是統計檢定法,可以用Alibi Detect,支援各種維度資料的統計檢定,這個議題對我來說是有趣的。還有提到退化反饋模式,訓練機制中如果下一批訓練資料的標註受到先前預測模型的影響就會自我應驗,就好像篩選履歷不用能力而是用公司內部人的學歷來篩選。這本書也有提到修正的策略。

章節9:持續學習與線上環境測試

重新訓練一個新模型或者細調舊模型作為新版本,這之間如何取捨?如何決定什麼時候要重新訓練,可能可以用效能衰減、時間週期、模型飄移與新資料增加量決定。還有決定要不要上新模型時,A/B testing需要較多的流量,因此替代方案還有 交錯預測結果給使用者看,以及不好實現的吃角子老虎機演算法選出最好方法。

章節6: 模型部署與離線評估

實驗追蹤與版本控制,像是,DVC、weight&bias,實務上來說很多公司可能沒有做這些其實跟程式碼版本控制一樣重要的事情。

章節4: 特徵工程

這本書資料洩漏整理的很好,例如:在分割資料集之前,就取出統計資訊用來訓練。偵測太有效的特徵,嘗試解讀為什麼它這麼有效,有可能是因為它們潛在資料外洩了。還提到關於篩選特徵與解析特徵,可以用 XGBoost 自帶這項功能或者 SHAP 統定檢定方法。

總之,強力推薦〈Designing Machine Learning Systems〉的章節8、9、6、4。

Exploring Machine Learning: Personal Insights on Three Books

Recently (as of August 25, 2023), during my commuting time, I read three books: “Designing Machine Learning Systems,” “Machine Learning System Design Interview,” and “Machine Learning Design Pattern.” I’d like to share my thoughts on these three books.

Not recommended: “Machine Learning Design Pattern”

I don’t recommend this book because it contains too many code examples, making it overly specific. I believe it could have provided keywords without overwhelming the reader with excessive code. Most of the content was familiar to me, but it does have some merits, such as designing classifiers with intermediate classes for handling ambiguous categories and using hash features. Overall, it falls short.

Mild recommendation: “Machine Learning System Design Interview”

The book discusses topics like search, recommendation, ranking, and click prediction. It also considers the practical iteration speed of data. After reading, I gained insights into how to extract features for different types of data, the model architectures for specific problems, candidate models, and model design considerations. It covers topics like one-stage or two-stage detectors and early or late fusion of features. It’s a quick grasp for new topics but may not provide new perspectives for those already familiar with the subjects.

Highly recommended: “Designing Machine Learning Systems”

Particularly, chapters 8, 9, 6, and 4 are worth reading. For individuals without experience in internet machine learning, these chapters offer novel insights and cover less common but thought-provoking topics. Here are my explanations for the chapters I found most interesting:

Chapter 8: Data Distribution Drift and Monitoring

It explores three types of data distribution drift, including covariate shift. It discusses methods for detecting data drift, such as creating auxiliary indicators to observe model accuracy and statistical testing using tools like Alibi Detect. The concept of degradative feedback mode is introduced, where training data labels are influenced by the predictions of the previous model.

Chapter 9: Continuous Learning and Online Environment Testing

It addresses the decision-making process between retraining a new model or fine-tuning an old model as a new version. Factors influencing this decision include performance degradation, time cycles, model drift, and increased data volume. The chapter also discusses A/B testing alternatives, such as interleaving prediction results for users and using a challenging algorithm to select the best method.

Chapter 6: Model Deployment and Offline Evaluation

It emphasizes the importance of experiment tracking and version control, mentioning tools like DVC and weight&bias. Many companies may overlook these crucial aspects similar to code version control.

Chapter 4: Feature Engineering

The book provides excellent organization on data leakage, suggesting actions such as extracting statistical information before splitting the dataset. It discusses detecting overly effective features and interpreting their effectiveness, mentioning XGBoost’s built-in feature or SHAP for feature selection and interpretation.

In summary, I highly recommend chapters 8, 9, 6, and 4 of “Designing Machine Learning Systems.”

--

--