NTU DBME5028 Week5 Introduction to Machine Learning

Introduction to Machine Learning
Learn from Hands-on
Wei-Hsiang, Yu
Data Scientist, aetherAI
2021. Fall

Recap – Core idea of machine learning
Field of study that gives computers the ability to
learn without being explicitly programmed.
- Arthur Lee Samuel, 1959
『
』
2

General workflow of machine learning process
Step1: 定義問題
Step2: 蒐集 & 清理資料
Step3: 選擇 & 建立模型
Step4: 評估關鍵指標
Step5: 做一份好看的簡報
3

● 醫學 ”影像” 常見問題：分類、偵測、分割
Step1: 定義問題
4

● 還有很多其他類型的問題不在討論範圍內
Step1: 定義問題
5

Step1: 定義問題
● 哪裡有資料
○ Kaggle
○ Grand-Challenges
○ Papers (https://paperswithcode.com/datasets )
○ 自己蒐集!
6

Step1: 定義問題
● 哪裡有資料
○ Kaggle
○ Grand-Challenges
○ Papers
○ 自己蒐集!
For course demo
7

1. 根據你預計進行的模型轉換輸入格式
2. 正規化輸入
3. Train – Test Split
4. 改變輸出格式 (It depends)
8

Data Normalization (re-scale)
● 最好、保證、千萬要做 rescale
w1
w2
w1
w2
9
W2 的修正(ΔW)對於 loss 的影響比較大
● 影響訓練的過程
○ 不同 scale 的 weights 修正時會需要不
同的 learning rates
■ 不用 adaptive learning rate 是做
不好的
○ 在同個 scale 下，loss 的等高線會較接
近圓形
➔ gradient 的方向會指向圓心 (最低點)

Step1: 定義問題
● 顯然我們今天是個分類問題
10

● 分類模型有很多種：Logistic Regression, SVM, Decision
Tree, XGBoost, …, 先挑個基本款把流程建起來
● 基本元素
○ 模型
○ 預測
○ 評估
11

● Important metric for classification task
○ Area Under Receiver-Operating-Curve (AUROC, ROC)：Ability for your model to pull the
target distribution from noise distribution.
12

Issue: overfitting?
13
training set validation set

Issue: overfitting?
● 通常也會反應在 evaluation metrices 上
14
Training Iterations
Loss
/
Error
Training Error
Testing Error

Issue: overfitting?
15
Overfitting?

Issue: overfitting?
16
Overfitting? – Not a problem in this case

Issue – overfitting
: 假設之後遇到怎麼辦
General idea for overfitting
- Find ways to screw up you model!
Common ways to handle overfitting
● Train / Test Split
● EarlyStopping
● Regularization
● Data augmentation
● Maybe imbalance data?
● Modify loss function
● …
17

: 假設之後遇到怎麼辦
General idea for overfitting
- Find ways to screw up you model!
Common ways to handle overfitting
● Train / Test Split
● EarlyStopping
● Regularization
● Data augmentation
● Maybe imbalance data?
● Modify loss function
● …
18
Not cover today

: Earlystopping
● 在一些比較極端的例子當中, 早點停下來可能比較有利
○ loss 會持續讓模型的權重改變 → 搞爛模型
○ 基本上只能防止把模型搞爛，不會把一個本來就跑不起來的模型救回來
19
Wx
Loss
/
Error
Validation
Train

: Regularization
● 限制 weights 的大小 – 使得 output values 不會因為 inputs 的微小變化造成劇烈的改變
20
wi 較小 ➔ Δxi 對 ̂
y 造成的影響(Δ̂
y)較小
➔ 對 input 變化比較不敏感 ➔ Generalization 好

: Regularization
● 限制 weights 的大小 – 使得 output values 不會因為 inputs 的微小變化造成劇烈的改變
● L2-norm 比 L1-norm 更常使用
21
Cost = Loss + α * Reg
𝐿1 = ෍
𝑖=1
𝑁
|𝑊𝑖|
𝐿2 = ෍
𝑖=1
𝑁
|𝑊𝑖|2
- (Lasso)
- (Ridge)

: Regularization
22
In pyTorch
In scikit-learn
In XGBoost In CatBoost

Issue - evaluation metrics
: How to convince reader A model is better than B
● 在大多 CS 的論文中常有的問題：Performance 比之前好一點點？到底是運氣好還是真的有效
23
https://arxiv.org/pdf/1608.06993.pdf

24

● In many medical journals
25
https://pubmed.ncbi.nlm.nih.gov/30312179/
https://pubmed.ncbi.nlm.nih.gov/32140566/

26
● Estimation of confidence interval and its “significance” (NOTE: “significant” 這個詞請千萬不要亂用)
○ 每一次實驗都會有一筆結果 (ex. Acc, AUC, Recall, mAP, …)
■ 在跑 N 次實驗後，使用統計方法計算
○ 公式解
○ 模擬解

Estimation of confidence interval
: Basic statistics recap
27
中央極限定理 (Central Limit Theorem)
由一具有平均數 μ，標準差 σ 的母體中抽取樣本大小為 n 的簡
單隨機樣本，當樣本大小 n 夠大時，樣本平均數的抽樣分配會
近似於常態分配。
Population distribution, Sample distribution, and Sampling distribution
Sampling distribution (of the mean)
從一個分布中隨機抽樣一筆資料，
該數會有多少機率落在 a – b 之
間 (~68% 落在 1 個標準差內;
~95% 落在 2 個標準差內)

● Hypothesis testing and Interval Estimation
Example :
某實驗中，兩組白老鼠注射後 (一組有打藥；另一組
打食鹽水)，某測量的生理指標如下
GroupA: 86,72,74,85,76,79,82,83,83,79,82
GroupB: 81,77,63,75,69,86,81,60
問該藥是否對某生理指標有影響？
● Null hypothesis (H0)： μA = μB
● t-test:
● Confidence estimation:
○ Reject H0 if intervals have no overlaps 28

● Different metrics may require different testing
29

● Different metrics may require different testing
30

● 終極招數 Bootstrapping
- 統計值的抽樣分布逼近常態
● 操作步驟
1. 從現有樣本群中，以抽後放回方法抽取
N 個樣本
2. 從這個樣本分布中，計算目標統計值
3. 重複上述 1, 2 M 次，得到統計值的抽樣
分布
4. 將該抽樣分布排序
5. 2.5 與 97.5 分位 (2.5%, 97.5%) 的數值
即為 95% CI (50% 為平均數)
6. 收工
31

Step1: 定義問題
32
還有沒有其他方法可以讓
模型有機會表現更好?

Hyper-parameter tuning
● 當你確定模型可行，為了擠出最後一丁點
model performance 的時候，不妨試看看
○ 比方說 …
■ 打比賽
■ 等一下要跟老闆 meeting, 但已
經沒梗了
33

● 當你確定模型可行，為了擠出最後一
丁點 model performance 的時候，不妨試
看看
○ 比方說 …
■ 打比賽
經沒梗了
34

● 當你確定模型可行，為了擠出最後一
丁點 model performance 的時候，不妨試
看看
○ 比方說 …
■ 打比賽
經沒梗了
35

● Many packages can help you tune hyper-parameters
36
● Standard searching methods

Today NOT Going To Cover
● Tree-based methods
○ Decision Tree, Random Forest, GBM, XGBoost
● Some reference for you to study
○ Decision Tree
○ Bagging：Learn from bootstrap with samples (Trees are independent)
■ Random Forest
○ Boosting：Additive learning (Use later trees to cover errors from previous trees)
■ GBM
○ Combined
■ XGBoost
■ LightGBM & CatBoost
● You can play around with sample codes
○ https://github.com/Kaminyou/110-1-NTU-DBME5028/tree/main/week5-machine_learning 37

NTU DBME5028 Week5 Introduction to Machine Learning

Recommended

Recommended

More Related Content

Similar to NTU DBME5028 Week5 Introduction to Machine Learning

Similar to NTU DBME5028 Week5 Introduction to Machine Learning (20)

More from Sean Yu

More from Sean Yu (10)

Recently uploaded

Recently uploaded (20)

NTU DBME5028 Week5 Introduction to Machine Learning