Common Design for Distributed Machine Learning

Common Design for Distributed Machine
Learning, Deep Learning

Common Modeling
머신러닝으로문제를해결했다면, Scaling 문제가 발생
모델결과를얼마나신뢰할수있을지"Generalization error"

Data Parallelism
DecisionTree, RandomForest, GradientBoostingTree

Recap ‑ Random Forest
node queue를이용해서training 수행(queue가 비어있으면종료)

Distributed Random Forest
def predict(features: Vector): Double = {
(algo, combiningStrategy) match {
case (Regression, Sum) =>
predictBySumming(features)
case (Regression, Average) =>
predictBySumming(features) / sumWeights
case (Classification, Sum) => // binary
val prediction = predictBySumming(features)
if (prediction > 0.0) 1.0 else 0.0
case (Classification, Vote) =>
predictByVoting(features)
각 트리는독립적으로구성가능하기 때문에분산처리에용이
Aggregation Strategy (Categorical / Continuous)

Recap ‑ Gradient Boosting Tree
예측오류를다음모델이보완해나가는방식(Sequential)
다시샘플링할때오답에높은가중치를부여

Distributed Gradient Boosting
Boosting : c = c + k
Gradient Boosting : c = c + μ
Distributed DecisionTree 이후에Gradient Update 하는방식으로구현
https://github.com/apache/spark/blob/master/mllib/src/main/scal
a/org/apache/spark/ml/tree/impl/GradientBoostedTrees.scala#L2
47
n n−1 n
n n−1 ∂cn−1
∂J

Distributed Decision Tree Building
1. Parallelize Node Building at Each Level ‑ imbalance problem
2. Parallelize Split Finding on Each Node
3. Parallelize Split Finding at Each Level by Features

#flatMap
input: Instance
output: list(split, label)
#reduceByKey
input: split, list(label)
output: split, labelHistograms
Feature k, Splits m, Instance n 인경우, O(k ∗ m ∗ n) 소요
Communication Overhead 문제발생
Avoid Map Function (Shuffle, Object creation overhead)

[SPARK‑3161] Cache example‑node map for
DecisionTree training
https://issues.apache.org/jira/browse/SPARK‑3161
findBestSplits 단계 이후트리의다음레벨로넘어갈 때전체트리노드를
넘기지말고 lowest split nodes만넘기자
얕은트리모델에서는더느려질수있음(iter마다update)
하지만깊은트리모델일경우필수적인옵션

Model Parallelism
GridSearch, TrainValidationSplit, CrossValidator

Hyperparameter: GridSearch vs RandomSearch
GridSearch는Embarrassingly parallel한알고리즘

TrainValidationSplit, CrossValidator
지정한비율에따라훈련/검증셋을나누어학습에반영
training set을K 개의fold로나누어서교차검증(overfitting 방지)

TrainValidationSplit, CrossValidator
// TrainValidationSplit
val tvs = new TrainValidationSplit()
.setEstimatorParamMaps(factorGrid)
.setEvaluator(new RegressionEvaluator)
.setTrainRatio(r)
// CrossValidator
val cv = new CrossValidator()
.setEstimatorParamMaps(factorGrid)
.setEvaluator(new BinaryClassificationEvaluator)
.setNumFolds(k)
// Best Model
val model = tvs.fit(data)
model.bestModel.extractParamMap
ParamMap을전달, CrossValidator가 더많은시간을소요
Spark 2.2 버전까지는Data Parallelism으로수행

Spark 2.3: Model Parallelism
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(new BinaryClassificationEvaluator)
.setEstimatorParamMaps(paramGrid)
.setParallelism(10)
CrossValidator에서Data Parallelism의한계
클러스터구성에따라리소스를심각하게 활용하지못하게 될수있음
Hyperparameter가 많아질수록시간이엄청나게 증가
Spark 2.3부터 setParallelism 옵션추가 (SPARK‑19357)
옵션으로둔이유는클러스터자원에대한의존성이커서
과부하가 걸리지않도록실험적으로적절한값을선택해야함

Spark 2.3: Model Parallelism
Pipeline, CrossValidator까지고려한경우일반적으로2‑3배속도향상

Task in Distributed ML
일부알고리즘들은아직완전한분산처리구현체가 없음(DBSCAN)
일부알고리즘에서는같은Solver, Optimizer가 조금 다르게 동작가능
데이터의특징관련문제(Missing value, skew, outlier)
일부Network, System Latency로인한전체성능저하

Distributed Deep Learning on GPU Cluster
GPU Cluster, Memory, Communication 등의이슈(Direct RDMA)
Communication Cost 보다Computing Cost 를늘리는것이핵심
http://hoondongkim.blogspot.kr/search/label/GPU

Large Scale Distributed Deep Networks ‑ Google
Paper: Large Scale Distributed Deep Networks ‑ Google
Google DistBelief : Downpour SGD, L‑BFGS
수십억파라메터의네트워크를1만CPU 코어에서학습가능

Stochastic Gradient Descent
mini batch size = m : Batch Gradient Descent
mini batch size = 1 : Stochastic Gradient Descent

Sync / Async Stochastic Gradient Descent
W = W − λ ΔW
Parameter Averaging : Momentum, Adagrad 문제
Update‑based Approaches : update 값을전달
Synchronous vs Asynchronous methods
Parameter Server : dmlc/ps‑lite
i+1 i
j=1
∑
N
i,j

Deep Learning ‑ Model Parallelism
모델이머신의메모리보다크다면? VGG, GoogLeNet...

Deep Learning ‑ Model Parallelism

Parallelizing Convolutional Neural Networks
Paper: One weird trick for parallelizing CNN ‑ Google

Common Design for Distributed Machine Learning,
Deep Learning
Common Modeling
Data / Model Parallelism
Distributed DecisionTree, RandomForest, GBT
Distributed GridSearch, TrainValidationSplit, CrossValidator
Large Scale Distributed Deep Learning

Reference
Paper: Large Scale Distributed Deep Networks ‑ Google
Paper: One weird trick for parallelizing CNN ‑ Google
Youtube: Scalable Distributed Decision Trees in Spark MLlib
Blog: Parallel Gradient Boosting Decision Trees
Blog: Model Parallelism with Spark ML Tuning
Docs: Parallel Learning in LightGBM
Docs: MXNet Architecture

Common Design for Distributed Machine Learning

Recommended

Recommended

More Related Content

Similar to Common Design for Distributed Machine Learning

Similar to Common Design for Distributed Machine Learning (20)

More from Junyoung Park

More from Junyoung Park (14)

Recently uploaded

Recently uploaded (20)

Common Design for Distributed Machine Learning