Lightgbm_suman

- LightGBM
1. Data 수 감소
- GOSS(Gradient-based One-Side Sampling)
: Tree의 information gain 계산 전 데이터의 일부를 선별
2. Feature 묶음
- EFM(Exclusive Feature Bundling)
: Graph Coloring , 히스토그램 기반 알고리즘을 이용해
split point 정확도를 적게 훼손하며 변수 개수를 효과적으로 감소
1) 어떤 feature들을 묶을 것인가? / What?
2) 그럼 어떻게 묶을 것인지? / How?
: 계산할 Data, Feature수를 줄이자

- LightGBM
1. GOSS(Gradient-based One-Side Sampling)
Information gain 계산 전 Gradient가 큰 data를 선별하자
- 알고리즘
1) gradient 절대값에 따라 data를 정렬하고 상위 a x 100% 선택 = topSet
2) 나머지 데이터에서 b x 100% 개체를 무작위 표본 추출 = randSet
3) information gain 계산 시 randSet을 1-a / b 만큼 증폭
-> 원 데이터 분포를 많이 바꾸지 않으면서 훈련이 덜 된 개체에 초점

1. GOSS(Gradient-based One-Side Sampling)
- LightGBM

- LightGBM
- Graph Coloring 문제로 정의
: NP-hard 다항시간 내에 solution을 찾을 수 없다
- 따라서, Greedy algorithm 적용
: 충돌, 꼭짓점 차수 기준으로
욕심쟁이 탐색
2. EFM(Exclusive Feature Bundling)
고 차원은 sparse 해서 0이 아닌 변수를 동시에 갖는 경우는 거의 없다
ex. One-Hot encoding
feature1 feature2
feature3
conflicts
0이 아닌 값들

a. edge마다 weight가 있는 그래프 구성
(weight = 변수 간 총 충돌 횟수 , 없으면 edge X)
b. 꼭지점 차수를 기준으로 내림차순 정렬
c. 작은 충돌(gamma로 제어)이 있는
기존 묶음에 할당하거나 새로운 묶음을 만들자
feature1 feature2
feature3
conflicts
0이 아닌 값들
- LightGBM

- Histogram-based algorithm (For Continuous Feature)
: 정렬한 변수에서 분할점을 찾는 대신 개별 구간으로 나눔.
최적 bin으로
구간 나누기
- parameter
tree_method = approx (xgb)
tree_method = hist (xgb, lightgbm)
http://mlexplained.com/2018/01/05/lightgbm-and-xgboost-explained/
- LightGBM

- Bundling
feature A : [0, 10] feature B : [0, 20]
feature B : [10, 30]
+ 10
feature A : [0, 10]
New_feature C = A + B
= [0, 30]
최적 bin으로
구간 나누기
- parameter
tree_method = approx (xgb)
tree_method = hist (xgb, lightgbm)
- LightGBM
1 2

zero_as_missing = True (default = False)
: 모든 0을 missing 처리해서 속도 증가
3. Ignoring sparse inputs
- LightGBM
zero_as_missing = False (default)
zero_as_missing = True
최적 split 탐색 시 0 값 사용 안함

4. Leaf-wise growth
Leaf-wise
(best-first)
- LightGBM
Level-wise
(depth-first)
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.149.2862&rep=rep1&type=pdf
- Node Expansion : “best”-node
best ? maximum reduction of impurity
- Only Binary Tree
- Splitting rules
Numeirc은 유사
Nominal은 2-class , multi-class로 제공
- Node Expansion : order
left -> right

- Readthedocs 튜닝가이드
1. For Faster Speed
- bagging_fraction , bagging_freq
- feature_fraction
- max_bin
- save_binary
: to speed up data loading in future learning
2. For Better Accuracy
- max_bin : Use large number
- learning_rate : small with large num_iter
- num_leaves
3. Deal with Over-fitting
- max_bin : small
- num_leaves : small
- lambda_l1 , lambda_l2 , min_gain_to_split
- max_depth : small (to avoid growing deep tree)
- Kaggle 추천 가이드
-num_leaves
: 2^(max_depth) 추천이나 80보다 큰 경우 더 아래로
-is_unbalance = True
-scale_pos_weight = True
- LightGBM

1. Binning을 어떤 주기로 하는지?
- lightgbm은 Tree 별로 binning
- xgb는 매 split 마다 binning 하는 옵션 제공
3. Histogram-based algorithm과 Leaf-wise의 차이
- 전자는 연속형 변수 구간화
- 후자는 트리성장방법 (best-first decision tree)
5. Categorical Feature는 어떻게 처리되는지?
4. Ignoring Sparse inputs
- split 계산 시 0값을 빼고 사용 -> 계산 속도 향상 but, 정보손실 문제
- 궁금했던 것
2. EFM에서 Feature가 어떻게 묶였는지 볼 수 있는 방법
- 현재 지원되지 않음 , Tree마다 다르기 때문에 제공되지 않는 듯
6. Histogram-based algorithm 최적 bin은 어떻게 찾는지

Categorical Feautre 처리 / 최적 Bin 탐색
1. One-Hot encoding
2. On grouping for maximum homogeneity
3. Histogram-based algorithm 관련 논문

1. One-Hot encoding
- Categorical Feature를 전처리하는 대표적인 방법
- Label encoding 시 생길 수 있는 ordinal 문제를 해결
문제점 : Class가 많을 경우 Tree가 깊어진다 -> 계산시간 증가, 과적합 위험

2. On Grouping for Maximum homogeneity , Fisher 1958
LightGBM 제안 방법 , High Cardinality에 효과적
- 알고리즘
1) Categorical 변수를 어떤 numeric 값을 기준으로 정렬
2) 원하는 Group 수를 정함 (G)
3) 정렬순서를 고려해 경우의 수 별로 D를 계산, 최적 값을 사용
가중치 Class i 내
평균값
Group 내
평균값
Feature 내
Class 수
http://www.csiss.org/SPACE/workshops/2004/SAC/files/fisher.pdf

- 알고리즘
예제, 소득분위 10개를 3개의 그룹으로 나누고 싶다 (G=3)
- feature_A : 소득분위 (categorical 1~10 , K = 10)
- feature_B : 개인 별 소득 (numeric)
- Sorting
: A의 class 별 income 평균값
- 정렬이유
: 정렬 순서를 고려해 그룹화 됨
(i < j < k 이면 a_i < a_j < a_k)

- 알고리즘
예제, 소득분위 10개를 3개의 그룹으로 나누고 싶다 (G=3)
- feature_A : 소득분위 (discrete 1~10 , K = 10)
- feature_B : 개인 별 소득 (numeric)
- Sorting
: A의 class 별 income 평균값
- 정렬이유
: 정렬 순서를 고려해 그룹화 됨
(i < j < k 이면 a_i < a_j < a_k)
경우의 수 =
계산량이 너무 많다

- 알고리즘 + LightGBM idea
+ Class 별 sum_gradient / sum_hessian 으로 정렬
+ G = 2로 고정 (2 subset) -> 계산량 감소
경우의 수 =
D 계산식에 사용한 numeric a값이 무엇인지 명확히 나와있지 않음
흐름 상 sum_gradient / sum_hessian 일 것
https://lightgbm.readthedocs.io/en/latest/Features.html#references

1) McRank: Learning to Rank Using Multiple Classification and GB
Numeric 변수 대비 Binning 구간에 따라 MSE , Loss 변화를 기록
3. Histogram-based algorithm 관련 논문 3가지

2) CLOUDS: A Decision Tree Classifier for Large Datasets
gini value , interval 내 gradient를 이용해 최적 구간 탐색
- Sampling the Splitting points with Estimation (SSE)
a. numeric attribute를 q개의 구간으로 동일한 개수를 가지도록 나눈다
b. 각 interval boundary에서 gini 계수를 구한다 = gini_min
c. 각 interval boundary에서 lower bound gini 계수를 구한다 = gini_est
d. gini_est > gini_min 인 경우 제거
e. hill-climbing algorithm으로 left-boundary와 interval 내
minimum gradient를 고려해 boundary를 정한다.

3) Communication and Memory Efficient Parallel Decsion Tree Construction
여러 AVC-set을 이용해 gain을 구하고 이를 비교하며 interval 탐색

References
https://swalloow.github.io
https://www.researchgate.net/figure/Training-of-an-AdaBoost-classifier-The-first-classifier-
trains-on-unweighted-data-then_fig3_306054843
https://hackernoon.com/gradient-descent-aynk-7cbe95a778da
https://www.kaggle.com/c/home-credit-default-risk/discussion/59806
https://www.microsoft.com/en-us/research/wp-content/uploads/2017/11/lightgbm.pdf
https://xgboost.readthedocs.io/en/latest/parameter.html
- histogram-based algorithm
https://papers.nips.cc/paper/3270-mcrank-learning-to-rank-using-multiple-classification-and-
gradient-boosting.pdf
https://pdfs.semanticscholar.org/3194/5a077ef8fa29e3359a312b1ae29e8ea53469.pdf
https://pdfs.semanticscholar.org/e0e7/31805c073c4589375c8b8f65769834201114.pdf

Lightgbm_suman

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Lightgbm_suman

Similar to Lightgbm_suman (10)

Lightgbm_suman

Editor's Notes