Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Common Design for Distributed Machine Learning

435 views

Published on

Common Design for Distributed Machine Learning, Deep Learning

Published in: Engineering
  • Be the first to comment

Common Design for Distributed Machine Learning

  1. 1. Common Design for Distributed Machine Learning, Deep Learning
  2. 2. Common Modeling 머신러닝으로문제를해결했다면, Scaling 문제가 발생 모델결과를얼마나신뢰할수있을지"Generalization error"
  3. 3. Data / Model Parallelism
  4. 4. Spark Scale‑out Learning
  5. 5. Data Parallelism DecisionTree, RandomForest, GradientBoostingTree
  6. 6. Recap ‑ Random Forest node queue를이용해서training 수행(queue가 비어있으면종료)
  7. 7. Distributed Random Forest def predict(features: Vector): Double = { (algo, combiningStrategy) match { case (Regression, Sum) => predictBySumming(features) case (Regression, Average) => predictBySumming(features) / sumWeights case (Classification, Sum) => // binary val prediction = predictBySumming(features) if (prediction > 0.0) 1.0 else 0.0 case (Classification, Vote) => predictByVoting(features) 각 트리는독립적으로구성가능하기 때문에분산처리에용이 Aggregation Strategy (Categorical / Continuous)
  8. 8. Recap ‑ Gradient Boosting Tree 예측오류를다음모델이보완해나가는방식(Sequential) 다시샘플링할때오답에높은가중치를부여
  9. 9. Distributed Gradient Boosting Boosting : c = c + k Gradient Boosting : c = c + μ Distributed DecisionTree 이후에Gradient Update 하는방식으로구현 https://github.com/apache/spark/blob/master/mllib/src/main/scal a/org/apache/spark/ml/tree/impl/GradientBoostedTrees.scala#L2 47 n n−1 n n n−1 ∂cn−1 ∂J
  10. 10. Distributed Decision Tree Building 1. Parallelize Node Building at Each Level ‑ imbalance problem 2. Parallelize Split Finding on Each Node 3. Parallelize Split Finding at Each Level by Features
  11. 11. Distributed Decision Tree Building #flatMap input: Instance output: list(split, label) #reduceByKey input: split, list(label) output: split, labelHistograms Feature k, Splits m, Instance n 인경우, O(k ∗ m ∗ n) 소요 Communication Overhead 문제발생 Avoid Map Function (Shuffle, Object creation overhead)
  12. 12. Distributed Decision Tree Building
  13. 13. [SPARK‑3161] Cache example‑node map for DecisionTree training https://issues.apache.org/jira/browse/SPARK‑3161 findBestSplits 단계 이후트리의다음레벨로넘어갈 때전체트리노드를 넘기지말고 lowest split nodes만넘기자 얕은트리모델에서는더느려질수있음(iter마다update) 하지만깊은트리모델일경우필수적인옵션
  14. 14. Spark DecisionTree DAG
  15. 15. Model Parallelism GridSearch, TrainValidationSplit, CrossValidator
  16. 16. Hyperparameter: GridSearch vs RandomSearch GridSearch는Embarrassingly parallel한알고리즘
  17. 17. TrainValidationSplit, CrossValidator 지정한비율에따라훈련/검증셋을나누어학습에반영 training set을K 개의fold로나누어서교차검증(overfitting 방지)
  18. 18. TrainValidationSplit, CrossValidator // TrainValidationSplit val tvs = new TrainValidationSplit() .setEstimatorParamMaps(factorGrid) .setEvaluator(new RegressionEvaluator) .setTrainRatio(r) // CrossValidator val cv = new CrossValidator() .setEstimatorParamMaps(factorGrid) .setEvaluator(new BinaryClassificationEvaluator) .setNumFolds(k) // Best Model val model = tvs.fit(data) model.bestModel.extractParamMap ParamMap을전달, CrossValidator가 더많은시간을소요 Spark 2.2 버전까지는Data Parallelism으로수행
  19. 19. Spark 2.3: Model Parallelism val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(new BinaryClassificationEvaluator) .setEstimatorParamMaps(paramGrid) .setParallelism(10) CrossValidator에서Data Parallelism의한계 클러스터구성에따라리소스를심각하게 활용하지못하게 될수있음 Hyperparameter가 많아질수록시간이엄청나게 증가 Spark 2.3부터 setParallelism 옵션추가 (SPARK‑19357) 옵션으로둔이유는클러스터자원에대한의존성이커서 과부하가 걸리지않도록실험적으로적절한값을선택해야함
  20. 20. Spark 2.3: Model Parallelism Pipeline, CrossValidator까지고려한경우일반적으로2‑3배속도향상
  21. 21. Task in Distributed ML 일부알고리즘들은아직완전한분산처리구현체가 없음(DBSCAN) 일부알고리즘에서는같은Solver, Optimizer가 조금 다르게 동작가능 데이터의특징관련문제(Missing value, skew, outlier) 일부Network, System Latency로인한전체성능저하
  22. 22. Distributed Deep Learning
  23. 23. Distributed Deep Learning on GPU Cluster GPU Cluster, Memory, Communication 등의이슈(Direct RDMA) Communication Cost 보다Computing Cost 를늘리는것이핵심 http://hoondongkim.blogspot.kr/search/label/GPU
  24. 24. Large Scale Distributed Deep Networks ‑ Google Paper: Large Scale Distributed Deep Networks ‑ Google Google DistBelief : Downpour SGD, L‑BFGS 수십억파라메터의네트워크를1만CPU 코어에서학습가능
  25. 25. Stochastic Gradient Descent mini batch size = m : Batch Gradient Descent mini batch size = 1 : Stochastic Gradient Descent
  26. 26. Sync / Async Stochastic Gradient Descent W = W − λ ΔW Parameter Averaging : Momentum, Adagrad 문제 Update‑based Approaches : update 값을전달 Synchronous vs Asynchronous methods Parameter Server : dmlc/ps‑lite i+1 i j=1 ∑ N i,j
  27. 27. Deep Learning ‑ Model Parallelism 모델이머신의메모리보다크다면? VGG, GoogLeNet...
  28. 28. Deep Learning ‑ Model Parallelism
  29. 29. Parallelizing Convolutional Neural Networks Paper: One weird trick for parallelizing CNN ‑ Google
  30. 30. Common Design for Distributed Machine Learning, Deep Learning Common Modeling Data / Model Parallelism Distributed DecisionTree, RandomForest, GBT Distributed GridSearch, TrainValidationSplit, CrossValidator Large Scale Distributed Deep Learning
  31. 31. Reference Paper: Large Scale Distributed Deep Networks ‑ Google Paper: One weird trick for parallelizing CNN ‑ Google Youtube: Scalable Distributed Decision Trees in Spark MLlib Blog: Parallel Gradient Boosting Decision Trees Blog: Model Parallelism with Spark ML Tuning Docs: Parallel Learning in LightGBM Docs: MXNet Architecture

×