VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
NTU DBME5028 Week5 Introduction to Machine Learning
1. Introduction to Machine Learning
Learn from Hands-on
Wei-Hsiang, Yu
Data Scientist, aetherAI
2021. Fall
2. Recap – Core idea of machine learning
Field of study that gives computers the ability to
learn without being explicitly programmed.
- Arthur Lee Samuel, 1959
『
』
2
3. General workflow of machine learning process
Step1: 定義問題
Step2: 蒐集 & 清理資料
Step3: 選擇 & 建立模型
Step4: 評估關鍵指標
Step5: 做一份好看的簡報
3
4. General workflow of machine learning process
● 醫學 ”影像” 常見問題:分類、偵測、分割
Step1: 定義問題
Step2: 蒐集 & 清理資料
Step3: 選擇 & 建立模型
Step4: 評估關鍵指標
Step5: 做一份好看的簡報
4
5. General workflow of machine learning process
● 還有很多其他類型的問題不在討論範圍內
Step1: 定義問題
Step2: 蒐集 & 清理資料
Step3: 選擇 & 建立模型
Step4: 評估關鍵指標
Step5: 做一份好看的簡報
5
12. General workflow of machine learning process
● Important metric for classification task
○ Area Under Receiver-Operating-Curve (AUROC, ROC):Ability for your model to pull the
target distribution from noise distribution.
12
Step4: 評估關鍵指標
17. Issue – overfitting
: 假設之後遇到怎麼辦
General idea for overfitting
- Find ways to screw up you model!
Common ways to handle overfitting
● Train / Test Split
● EarlyStopping
● Regularization
● Data augmentation
● Maybe imbalance data?
● Modify loss function
● …
17
18. Issue – overfitting
: 假設之後遇到怎麼辦
General idea for overfitting
- Find ways to screw up you model!
Common ways to handle overfitting
● Train / Test Split
● EarlyStopping
● Regularization
● Data augmentation
● Maybe imbalance data?
● Modify loss function
● …
18
Not cover today
22. Issue – overfitting
: Regularization
22
In pyTorch
In scikit-learn
In XGBoost In CatBoost
23. Issue - evaluation metrics
: How to convince reader A model is better than B
● 在大多 CS 的論文中常有的問題:Performance 比之前好一點點?到底是運氣好還是真的有效
23
https://arxiv.org/pdf/1608.06993.pdf
24. Issue - evaluation metrics
: How to convince reader A model is better than B
24
https://arxiv.org/pdf/2105.11293.pdf
https://arxiv.org/pdf/1911.06667.pdf
25. Issue - evaluation metrics
: How to convince reader A model is better than B
● In many medical journals
25
https://pubmed.ncbi.nlm.nih.gov/30312179/
https://pubmed.ncbi.nlm.nih.gov/32140566/
26. Issue - evaluation metrics
: How to convince reader A model is better than B
26
● Estimation of confidence interval and its “significance” (NOTE: “significant” 這個詞請千萬不要亂用)
○ 每一次實驗都會有一筆結果 (ex. Acc, AUC, Recall, mAP, …)
■ 在跑 N 次實驗後,使用統計方法計算
○ 公式解
○ 模擬解
27. Estimation of confidence interval
: Basic statistics recap
27
中央極限定理 (Central Limit Theorem)
由一具有平均數 μ,標準差 σ 的母體中抽取樣本大小為 n 的簡
單隨機樣本,當樣本大小 n 夠大時,樣本平均數的抽樣分配會
近似於常態分配。
Population distribution, Sample distribution, and Sampling distribution
Sampling distribution (of the mean)
從一個分布中隨機抽樣一筆資料,
該數會有多少機率落在 a – b 之
間 (~68% 落在 1 個標準差內;
~95% 落在 2 個標準差內)
28. Estimation of confidence interval
● Hypothesis testing and Interval Estimation
Example :
某實驗中,兩組白老鼠注射後 (一組有打藥;另一組
打食鹽水),某測量的生理指標如下
GroupA: 86,72,74,85,76,79,82,83,83,79,82
GroupB: 81,77,63,75,69,86,81,60
問該藥是否對某生理指標有影響?
● Null hypothesis (H0): μA = μB
● t-test:
● Confidence estimation:
○ Reject H0 if intervals have no overlaps 28
37. Today NOT Going To Cover
● Tree-based methods
○ Decision Tree, Random Forest, GBM, XGBoost
● Some reference for you to study
○ Decision Tree
○ Bagging:Learn from bootstrap with samples (Trees are independent)
■ Random Forest
○ Boosting:Additive learning (Use later trees to cover errors from previous trees)
■ GBM
○ Combined
■ XGBoost
■ LightGBM & CatBoost
● You can play around with sample codes
○ https://github.com/Kaminyou/110-1-NTU-DBME5028/tree/main/week5-machine_learning 37