Harnessing Big Data with Spark

Harnessing Big Data with Spark
Lawrence Spracklen
Alpine Data

3
Map Reduce
•  Allows the distribution of large data computations
across a cluster
•  Computations typically composed of a sequence of
MR operations
Big

Data

Map()
Output

Reduce()

4
MR Performance
•  Multiple disk interactions required in EACH MR
operation
Map
Reduce

5
Performance Hierarchy
0.10GB/s 0.10GB/s 0.60GB/s 80GB/s
100X Read Bandwidth

6
Optimizing MR
•  Many companies have signiﬁcant legacy MR
code
–  Either direct MR or indirect usage via Pig
•  A variety of techniques to accelerate MR
–  Apache Tez
–  Tachyon or Apache ignite
–  System ML

7
Spark
•  Several signiﬁcant advancements over MR
–  Generalizes two stage MR into arbitrary DAGs
–  Enables in-memory dataset caching
–  Improved usability
•  Reduced disk read/writes delivers signiﬁcant
speedups
–  Especially for iterative algorithms like ML

8
Perf comparisons
*http://www.edureka.co/blog/apache-spark-vs-hadoop-mapreduce

9
Spark Tuning
•  Increased reliance on memory introduces
greater requirement for tuning
•  Need to understand memory requirements for
caching
•  Signiﬁcant performance beneﬁts associated
with “getting it right”
•  Auto-tuning is coming….

10
Optimization opportunities
•  Spark delivers improved ML performance using
reduced cluster resources
•  Enables numerous opportunities
–  Reduced time to insights
–  Reduced cluster size
–  Eliminate subsampling
–  AutoML

11
AutoML
•  Data sets increasingly large and complex
•  Increasing difﬁcult to intuitively “know” optimal
–  Feature engineering
–  Choice of algorithm
–  Optimize parameterization of algorithm(s)
•  Signiﬁcant manual trial-and-error
•  Cult of the algorithm

12
Feature Engineering
•  Essential for model performance, efﬁcacy,
robustness and simplicity
–  Feature extraction
–  Feature selection
–  Feature construction
–  Feature elimination
•  Domain/dataset knowledge is important, but
basic automation feasible

13
Algorithm selection
•  Select dependent column
•  Indicate classiﬁcation or regression
•  Press “go”
Algorithms run in parallel across cluster
Minimally provides good starting point
Signiﬁcantly reduces “busy work”

14
Hyperparameter optimization
•  Are the default parameters optimal?
•  How do I adjust intelligently
–  Number of trees? Depth of trees? Splitting
criteria?
•  Tedious trial and error
•  Overﬁtting danger
•  Intelligent automatic search

15
Algorithm tuning
•  Gradient boosted tree parameterization e.g.
–  # of trees
–  Maximum tree depth
–  Loss function
–  Minimum node split size
–  Bagging rate
–  Shrinkage

16
AutoML
Data
Set

Alg
#1

Alg
#2

Alg
#3

Alg
#N

Alg
#1

Alg
#N

1)Investigate N
ML algorithms
2) Tune top
performing
algorithms
Feature

engineering

Alg
#2

Alg
#1

Alg
#N

2) Feature
elimination

17
Spark is for large datasets
*http://datascience.la/benchmarking-random-forest-implementations/
•  If your data ﬁts on a single node….
•  Other high-performance options exist
*http://haifengl.github.io/smile/index.html
Runtime

18
Data set size
•  Large data lakes can
consist of many small
ﬁles
•  Memory per node
increasing rapidly
*http://www.kdnuggets.com/2015/11/big-ram-big-data-size-datasets.html

19
NVDIMMS
•  Driving signiﬁcant increases in node memory
–  Up to 10X increase in density
•  Coming in late 2016…

20
Hybrid operators
•  Time consuming to maintain multiple ML
libraries & manually determine optimal choice
•  Develop hybrid implementations that
automatically choose optimal approach
–  Data set size
–  Cluster size
–  Cluster utilization

21
Single-node performance (1/2)
*http://www.ayasdi.com/blog/LawrenceSpracklen

22
Single-node performance (2/2)
*http://www.ayasdi.com/blog/LawrenceSpracklen

23
Operationalization
•  What happens after the models are created?
•  How does the business beneﬁt from the
insights?
•  Operationalization is frequently the weak link
–  Operationalizing PowerPoint?
–  Hand rolled scoring ﬂows

24
PFA
•  Portable Format for Analytics (PFA)
•  Successor to PMML
•  Signiﬁcant ﬂexibility in encapsulating complex
data preprocessing

25
Conclusions
•  Spark delivers signiﬁcant performance
improvements over MR
–  Can introduce more tuning requirements
•  Provides an opportunity for AutoML
–  Automatically determine good solutions
•  Understand when its appropriate
•  Don’t forget about about operationalization

Harnessing Big Data with Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Harnessing Big Data with Spark

Similar to Harnessing Big Data with Spark (20)

More from Alpine Data

More from Alpine Data (6)

Recently uploaded

Recently uploaded (20)

Harnessing Big Data with Spark