Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Parallel implementation of
ML algorithms on Spark
Dalei Li
EIT Digital
https://github.com/lidalei/LinearLogisticRegSpark
1
Overview
• Linear regression + l2 regularization
• Normal equation
• Logistic regression + l2 regularization
• Gradient de...
Tools
• IntelliJ + sbt
• Scala 2.11.8 + Spark 2.0.1
3
Linear regression
• Problem formulation
• Closed-form solution
• Computation reformulation
4
Linear regression
• Data set - UCI YearPredictionMSD, text file
• 515,345 songs, (90 audio numerical features,
year)
• Core...
Workflow
6
Read file RegexTokenizer StandardScaler
Solve normal
equation
Spark SQL text
Add l2 regularization
LAPACK
Center ...
Validation
7
Spark ML linear regression with norm solver vs. my implementation (both with
0.1 l2 regularization)
Randomly ...
Logistic regression
• Problem formulation
• Gradient descent
• Newton’s method
• Computation reformulation - gradient and ...
Logistic regression
• Data set - UCI HIGGS, csv file
• 11 million instances, (21+7 numerical features,
binary label)
• Core...
Workflow
10
Read file VectorAssembler DF to RDD
gradient
descent/
newton’s method
Spark SQL csv Gradient - add l2 regulariza...
Validation
11
Spark ML logistic regression with L-BFGS vs. my implementation of Newton’s
method
Randomly split data set in...
• Grid search to find optimal hyper-parameters with
best generalization error
• Estimate generalization error
• k-Fold cros...
Grid search
• Grid - [polynomial expansion degree] x [l2
regularization]
• Polynomial expansion is memory killer
• Degree ...
K-Fold
14
DF
Persist,
randomSplit
map=> [([train_i], test)] map=>[(train, test)]
Spark SQL data frame
[([DF], DF)]
[(union...
15
k-Fold
PE
Experiments
16
Spark 2.0.2 standalone mode
3 cores + 5GB mem
exact copy of read-in file
http://spark.apache.org/docs/latest...
Performance test
• ML Settings
• Logistic regression on HIGGS
• Train-test split, 70% + 30%
• Only 7 high level features w...
Performance and speedup curve
18
0
1.25
2.5
3.75
5
0
225
450
675
900
local 1 executor 2 executors 3 executors 4 executors ...
Grid search
• 10% of original data, i.e., 1.1 million instances, 7 high level features only
• Grid
• Polynomial degrees - ...
Conclusion
• Persist data - use more than once (incl. having branches)
• Change default cluster settings, e.g., executor m...
Q&A
• Thank you!
• Useful links
• Master - spark://ip:7077, e.g., spark://b2.lxd:7077
• Cluster - http://ip:8080/
• Spark ...
Backend slides
22
Training time vs. # executors
23
0
0.25
0.5
0.75
1
0
225
450
675
900
local 1 executor 2 executors 3 executors 4 executors ...
Spark UI
24
Jobs timeline
Spark UI
25
Executor summary
Numerical stability
26
Upcoming SlideShare
Loading in …5
×

Implementation of linear regression and logistic regression on Spark

753 views

Published on

This presentation was developed for a course project at Technical University of Madrid. The course is massively parallel machine learning supervised by Alberto Mozo and Bruno Ordozgoiti.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Implementation of linear regression and logistic regression on Spark

  1. 1. Parallel implementation of ML algorithms on Spark Dalei Li EIT Digital https://github.com/lidalei/LinearLogisticRegSpark 1
  2. 2. Overview • Linear regression + l2 regularization • Normal equation • Logistic regression + l2 regularization • Gradient descend • Newton’s method • Hyper-parameter optimization • Experiments 2
  3. 3. Tools • IntelliJ + sbt • Scala 2.11.8 + Spark 2.0.1 3
  4. 4. Linear regression • Problem formulation • Closed-form solution • Computation reformulation 4
  5. 5. Linear regression • Data set - UCI YearPredictionMSD, text file • 515,345 songs, (90 audio numerical features, year) • Core computation - norm terms and rmse 5 Implemented outer product + vector addition
  6. 6. Workflow 6 Read file RegexTokenizer StandardScaler Solve normal equation Spark SQL text Add l2 regularization LAPACK Center data Evaluation rmse
  7. 7. Validation 7 Spark ML linear regression with norm solver vs. my implementation (both with 0.1 l2 regularization) Randomly split data set into train 70% + test 30%. The RMSEs on test set are also identical, less than 0.5% difference.
  8. 8. Logistic regression • Problem formulation • Gradient descent • Newton’s method • Computation reformulation - gradient and Hessian matrix 8
  9. 9. Logistic regression • Data set - UCI HIGGS, csv file • 11 million instances, (21+7 numerical features, binary label) • Core computation - gradient and Hessian matrix 9 treeReduce can reduce the pressure of final ops in driver.
  10. 10. Workflow 10 Read file VectorAssembler DF to RDD gradient descent/ newton’s method Spark SQL csv Gradient - add l2 regularization Scala case class Instance (features, label), Newton’s - append all-one column Evaluation cross entropy confusion matrix
  11. 11. Validation 11 Spark ML logistic regression with L-BFGS vs. my implementation of Newton’s method Randomly split data set into train 70% + test 30%. The learned THETAs are almost identical, the last one is bias.
  12. 12. • Grid search to find optimal hyper-parameters with best generalization error • Estimate generalization error • k-Fold cross validation Hyper-parameter optimization 12 Hyper-parameter is a parameter used in a training process but not a part of a classifier itself. It controls what kind of parameters can / tend to be selected. For example, polynomial expansion will make non-linear relationship between a label and features be learned possibly.
  13. 13. Grid search • Grid - [polynomial expansion degree] x [l2 regularization] • Polynomial expansion is memory killer • Degree 3 on 7 features results in 119 features • Be careful with exploiting parallelism 13 To increase temporal locality - accesses to a data frame are clustered in time. Polynomial expansion does not include constant column.
  14. 14. K-Fold 14 DF Persist, randomSplit map=> [([train_i], test)] map=>[(train, test)] Spark SQL data frame [([DF], DF)] [(union[DF], DF)]
  15. 15. 15 k-Fold PE
  16. 16. Experiments 16 Spark 2.0.2 standalone mode 3 cores + 5GB mem exact copy of read-in file http://spark.apache.org/docs/latest/cluster-overview.html In total, we have 3 physical machines with 12GB mem + 8 cores. Driver - execute scala program Worker - execute tasks Executor - each application runs a or more processes on a worker node Job - triggered by an action Task - a unit of work executed on an executor, related with number of partitions >= number of blocks (128MB). If set manually, 2-4 partitions for each CPU in your cluster. Stage - a set of tasks Local file - path + content on each worker node.
  17. 17. Performance test • ML Settings • Logistic regression on HIGGS • Train-test split, 70% + 30% • Only 7 high level features were used • Test unit 1 - 100 times full gradient descent + training error on training set, initial learning rate 0.001, l2 regularization 0.1 • Test unit 2 - compute confusion matrix on test set and make predictions 17
  18. 18. Performance and speedup curve 18 0 1.25 2.5 3.75 5 0 225 450 675 900 local 1 executor 2 executors 3 executors 4 executors 5 executors training time (s) training-speed up 1 1.822 2.372 2.693 3.641 4.43 Running time vs. #executors (2 times average). Except for local, all tests have enough memory Local mode does not have enough memory, causing data cannot be persist in memory. Thus, the running time is much higher. Having more executors will reduce the running time linearly.
  19. 19. Grid search • 10% of original data, i.e., 1.1 million instances, 7 high level features only • Grid • Polynomial degrees - 1, 2, 3 • l2 regularization - 0, 0.001, 0.01, 0.1, 0.5 • 3-Fold cross validation • 100 times gradient descent with initial learning rate 0.01 • 2 executors with 10GB mem + 5 cores each • Result - 4400s training time, final test accuracy 62.4% 19 Confusion matrix: truePositive: 117605, trueNegative: 88664, falsePositive: 66529, falseNegative: 57786
  20. 20. Conclusion • Persist data - use more than once (incl. having branches) • Change default cluster settings, e.g., executor memory per executor is 1GB • Make use of Spark UI to find bottlenecks • Using Spark builtin functions if possible • Good examples for missing functions • Don’t use accumulators in a transformation, except only need approximations • Always start from small data to debug faster • Future work - obey train-test split 20
  21. 21. Q&A • Thank you! • Useful links • Master - spark://ip:7077, e.g., spark://b2.lxd:7077 • Cluster - http://ip:8080/ • Spark UI - http://ip:4040/ • https://spark.apache.org/docs/latest/programming-guide.html • http://spark.apache.org/docs/latest/submitting- applications.html, package a jar - sbt package 21
  22. 22. Backend slides 22
  23. 23. Training time vs. # executors 23 0 0.25 0.5 0.75 1 0 225 450 675 900 local 1 executor 2 executors 3 executors 4 executors 5 executors training time (s) test accuracy
  24. 24. Spark UI 24 Jobs timeline
  25. 25. Spark UI 25 Executor summary
  26. 26. Numerical stability 26

×