Random Forest Intro [랜덤포레스트 설명]

김현우 a.k.a 순록킴
yBigTa 9기
심리학 & 컴퓨터과학

the Tree
Contents { random forest,
ensemble,
bias & variance,
bagging };
Bias
Variance
Ensemble
Bootstrap

수능 점수 평균의
신뢰구간?

평가원은 모수를 알고 있다

등급컷 및 평균에 대한 모델을 갖고 있다

개인 점수
모수
모델값
데이터
잔차
오차
추정

𝑿𝒊, 𝒀𝒊
개인
점수
잔차

Problem of Overfitting
Bias & Variance

Bias - Variance trade off [or dilemma]

그렇다면 overfitting은?
High LowVariance Bias

Tree를 크게(=leaf node가 많다) 만들어 놨더니
high variance가 문제
Bias - Variance dilemma[or trade-off]
그렇다고 pruning 통해서 tree를
작게 만들었더니 이제는 high bias가 문제

Overfitting을 위한 해결책?

Random Forestan ensemble learning method,
by constructing multitude of decision trees

How the Forest works _classification

How it works _classification
치킨 피자 치킨 치킨
= 치킨 먹는 날

How the Forest works _regression

How it works _regression
2.5 4 3.7 2.9
= 3.2병

Resampling methods
1. Cross-validation
- Validation set approach
- Leave-One-Out Cross-validation
- k-Fold Cross-validation
2. Bootstrap

Validation set approach
Cross-validation
Training Set Validation Set [or Hold-out set]
Randomly
usually almost Half

Validation set approach
Cross-validation
• Randomly
• Almost half
모델 안정성 저해

Leave-one-out CV [LOOCV]
Cross-validation
T r a i n i n g S e t

Cross-validation
Leave-one-out CV [LOOCV]
• No randomness
• 항상 같은 결과값
• n-1 observations(data)
→ Fitting이 n번 이루어져야 한다
→ Computational problems

k-Fold CV
Cross-validation
T r a i n i n g S e tTest set
Randomly

k-Fold CV
Cross-validation
• k-Fold: split 횟수
• Usually 10-folds
• Lower variance than LOOCV
• Computational advantage

Cross-Validation
Validation set
approach
LOOCVk-Fold CV
1-Fold n-Foldk-Fold
Bias Bias Bias
Variance Variance Variance

Bootstrap
"Bootstrap is one of the biggest statistical breakthrough
in the 21th century."
Harvard statistics professor, Joe Blitztein

Bootstrap
pull one's own by one's bootstrap
불가능한 일을 해낸다는 관용어구

Bootstrap
pull one's own by one's bootstrap
누군가의 도움을 받지 않고 스스로 문제 상황을 개선한다

Bootstrap
누군가의 도움을 받지 않고 스스로 문제 상황을 개선한다
Training & Testing

Decision Trees have high variance
Bootstrap
a slight change in sample,
A huge change in result

모집단의 성질에 대해 표본을 통해 추정할 수 있는 것처럼,
표본의 성질에 대해서도 재표본을 통해 추정할 수 있다는 것이다.
즉 주어진 표본(샘플)에 대해서, 그 샘플에서 또 다시 샘플(재표본)을
여러번(1,000~10,000번, 혹은 그 이상)추출하여
표본의 평균이나 분산 등이 어떤 분포를 가지는가를 알아낼 수 있다.
Bootstrap

1.[C] 합계, 총액
2.[U , C] (전문 용어) (건설 자재용) 골재
Aggregating

Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree

Averaging a set of observations reduces variance
Bootstrap
A slight change in sample,
Still a slight change in split
시험지 확인하러 가서 내 점수를 올려도
반 평균에는 큰 영향을 미치지 않는다

A random sample
of predictors per split
학과, 학점, 학교, 영어성적, 수능성적
√p

Out-of-Bag Error Estimation
OOB

Out-of-Bag ?
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree
Tree

Out-of-Bag
복원추출
뽑히지 않는
데이터가 존재
Validation set

Tree
& advantages
1. 이해하기 쉽다: 씹고 뜯고 맛보고 즐기고 [White box]
2. 데이터 정제가 크게 필요하지 않다: 바로 넣자
3. numerical, categorical 가리지 않는다: 그냥 넣자
4. 데이터가 어떤 패턴인지 볼 때 편하다: 넣어봐

Random Forest Intro [랜덤포레스트 설명]

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

More from Hyunwoo Kim

More from Hyunwoo Kim (16)

Random Forest Intro [랜덤포레스트 설명]