7. Recap ‑ Random Forest
node queue를이용해서training 수행(queue가 비어있으면종료)
8. Distributed Random Forest
def predict(features: Vector): Double = {
(algo, combiningStrategy) match {
case (Regression, Sum) =>
predictBySumming(features)
case (Regression, Average) =>
predictBySumming(features) / sumWeights
case (Classification, Sum) => // binary
val prediction = predictBySumming(features)
if (prediction > 0.0) 1.0 else 0.0
case (Classification, Vote) =>
predictByVoting(features)
각 트리는독립적으로구성가능하기 때문에분산처리에용이
Aggregation Strategy (Categorical / Continuous)
9. Recap ‑ Gradient Boosting Tree
예측오류를다음모델이보완해나가는방식(Sequential)
다시샘플링할때오답에높은가중치를부여
10. Distributed Gradient Boosting
Boosting : c = c + k
Gradient Boosting : c = c + μ
Distributed DecisionTree 이후에Gradient Update 하는방식으로구현
https://github.com/apache/spark/blob/master/mllib/src/main/scal
a/org/apache/spark/ml/tree/impl/GradientBoostedTrees.scala#L2
47
n n−1 n
n n−1 ∂cn−1
∂J
11. Distributed Decision Tree Building
1. Parallelize Node Building at Each Level ‑ imbalance problem
2. Parallelize Split Finding on Each Node
3. Parallelize Split Finding at Each Level by Features
12. Distributed Decision Tree Building
#flatMap
input: Instance
output: list(split, label)
#reduceByKey
input: split, list(label)
output: split, labelHistograms
Feature k, Splits m, Instance n 인경우, O(k ∗ m ∗ n) 소요
Communication Overhead 문제발생
Avoid Map Function (Shuffle, Object creation overhead)
20. TrainValidationSplit, CrossValidator
// TrainValidationSplit
val tvs = new TrainValidationSplit()
.setEstimatorParamMaps(factorGrid)
.setEvaluator(new RegressionEvaluator)
.setTrainRatio(r)
// CrossValidator
val cv = new CrossValidator()
.setEstimatorParamMaps(factorGrid)
.setEvaluator(new BinaryClassificationEvaluator)
.setNumFolds(k)
// Best Model
val model = tvs.fit(data)
model.bestModel.extractParamMap
ParamMap을전달, CrossValidator가 더많은시간을소요
Spark 2.2 버전까지는Data Parallelism으로수행
21. Spark 2.3: Model Parallelism
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(new BinaryClassificationEvaluator)
.setEstimatorParamMaps(paramGrid)
.setParallelism(10)
CrossValidator에서Data Parallelism의한계
클러스터구성에따라리소스를심각하게 활용하지못하게 될수있음
Hyperparameter가 많아질수록시간이엄청나게 증가
Spark 2.3부터 setParallelism 옵션추가 (SPARK‑19357)
옵션으로둔이유는클러스터자원에대한의존성이커서
과부하가 걸리지않도록실험적으로적절한값을선택해야함
22. Spark 2.3: Model Parallelism
Pipeline, CrossValidator까지고려한경우일반적으로2‑3배속도향상
23. Task in Distributed ML
일부알고리즘들은아직완전한분산처리구현체가 없음(DBSCAN)
일부알고리즘에서는같은Solver, Optimizer가 조금 다르게 동작가능
데이터의특징관련문제(Missing value, skew, outlier)
일부Network, System Latency로인한전체성능저하
25. Distributed Deep Learning on GPU Cluster
GPU Cluster, Memory, Communication 등의이슈(Direct RDMA)
Communication Cost 보다Computing Cost 를늘리는것이핵심
http://hoondongkim.blogspot.kr/search/label/GPU
26. Large Scale Distributed Deep Networks ‑ Google
Paper: Large Scale Distributed Deep Networks ‑ Google
Google DistBelief : Downpour SGD, L‑BFGS
수십억파라메터의네트워크를1만CPU 코어에서학습가능
34. Common Design for Distributed Machine Learning,
Deep Learning
Common Modeling
Data / Model Parallelism
Distributed DecisionTree, RandomForest, GBT
Distributed GridSearch, TrainValidationSplit, CrossValidator
Large Scale Distributed Deep Learning
35. Reference
Paper: Large Scale Distributed Deep Networks ‑ Google
Paper: One weird trick for parallelizing CNN ‑ Google
Youtube: Scalable Distributed Decision Trees in Spark MLlib
Blog: Parallel Gradient Boosting Decision Trees
Blog: Model Parallelism with Spark ML Tuning
Docs: Parallel Learning in LightGBM
Docs: MXNet Architecture