In this presentation we compare the performance of Spark implementations of important ML algorithms with optimized single-node implementations, and highlight the significant improvements that can be achieved.
3. 3
Map Reduce
• Allows the distribution of large data computations
across a cluster
• Computations typically composed of a sequence of
MR operations
Big
Data
Map()
Output
Reduce()
6. 6
Optimizing MR
• Many companies have significant legacy MR
code
– Either direct MR or indirect usage via Pig
• A variety of techniques to accelerate MR
– Apache Tez
– Tachyon or Apache ignite
– System ML
7. 7
Spark
• Several significant advancements over MR
– Generalizes two stage MR into arbitrary DAGs
– Enables in-memory dataset caching
– Improved usability
• Reduced disk read/writes delivers significant
speedups
– Especially for iterative algorithms like ML
9. 9
Spark Tuning
• Increased reliance on memory introduces
greater requirement for tuning
• Need to understand memory requirements for
caching
• Significant performance benefits associated
with “getting it right”
• Auto-tuning is coming….
10. 10
Optimization opportunities
• Spark delivers improved ML performance using
reduced cluster resources
• Enables numerous opportunities
– Reduced time to insights
– Reduced cluster size
– Eliminate subsampling
– AutoML
11. 11
AutoML
• Data sets increasingly large and complex
• Increasing difficult to intuitively “know” optimal
– Feature engineering
– Choice of algorithm
– Optimize parameterization of algorithm(s)
• Significant manual trial-and-error
• Cult of the algorithm
12. 12
Feature Engineering
• Essential for model performance, efficacy,
robustness and simplicity
– Feature extraction
– Feature selection
– Feature construction
– Feature elimination
• Domain/dataset knowledge is important, but
basic automation feasible
13. 13
Algorithm selection
• Select dependent column
• Indicate classification or regression
• Press “go”
Algorithms run in parallel across cluster
Minimally provides good starting point
Significantly reduces “busy work”
14. 14
Hyperparameter optimization
• Are the default parameters optimal?
• How do I adjust intelligently
– Number of trees? Depth of trees? Splitting
criteria?
• Tedious trial and error
• Overfitting danger
• Intelligent automatic search
15. 15
Algorithm tuning
• Gradient boosted tree parameterization e.g.
– # of trees
– Maximum tree depth
– Loss function
– Minimum node split size
– Bagging rate
– Shrinkage
16. 16
AutoML
Data
Set
Alg
#1
Alg
#2
Alg
#3
Alg
#N
Alg
#1
Alg
#N
1)Investigate N
ML algorithms
2) Tune top
performing
algorithms
Feature
engineering
Alg
#2
Alg
#1
Alg
#N
2) Feature
elimination
17. 17
Spark is for large datasets
*http://datascience.la/benchmarking-random-forest-implementations/
• If your data fits on a single node….
• Other high-performance options exist
*http://haifengl.github.io/smile/index.html
Runtime
18. 18
Data set size
• Large data lakes can
consist of many small
files
• Memory per node
increasing rapidly
*http://www.kdnuggets.com/2015/11/big-ram-big-data-size-datasets.html
23. 23
Operationalization
• What happens after the models are created?
• How does the business benefit from the
insights?
• Operationalization is frequently the weak link
– Operationalizing PowerPoint?
– Hand rolled scoring flows
24. 24
PFA
• Portable Format for Analytics (PFA)
• Successor to PMML
• Significant flexibility in encapsulating complex
data preprocessing
25. 25
Conclusions
• Spark delivers significant performance
improvements over MR
– Can introduce more tuning requirements
• Provides an opportunity for AutoML
– Automatically determine good solutions
• Understand when its appropriate
• Don’t forget about about operationalization