Index conf sparkml-feb20-n-pentreath

DBG / Feb 20, 2018 / © 2018 IBM Corporation
Apache Spark 2.3 
Machine Learning Update 
— 
Nick Pentreath 
Principal Engineer, IBM 
 
@MLnick

About
@MLnick on Twitter & Github
Principal Engineer, IBM
Spark Technology Center & Cognitive Open
Technologies
Machine Learning & AI
Apache Spark committer & PMC
Author of Machine Learning with Spark
Various conferences & meetups

Agenda
Apache Spark 2.3 - MLlib Highlights
Deeper Dive into New Scalability Features
Summary and Future Directions

Machine Learning Highlights
Apache Spark 2.3 Release
• First-class support for loading image data into
DataFrames
• Enhanced usability and scalability of commonly
used feature transformers
• New one hot encoder supporting multiple columns
and fixing a major issue with existing transformer
• Multiple column support for QuantileDiscretizer and
Bucketizer transformers (StringIndexer support is
WIP)
• Scaling out cross-validation with parallel
pipeline evaluation
• Scalable feature hashing transformer with
multiple column support
• Robust linear regression with Huber loss
• Further parity work for DataFrame statistics
functions
• Improved support for creating custom pipeline
components in Python

Performance Issues
Multi-Column Transformers
• Most commonly used pipeline components
operate on a single column at a time
• Impacts “wide” datasets – many feature columns that
need to be processed in the same manner
• StringIndexer, QuantileDiscretizer, Bucketizer,
OneHotEncoder, also impacts RFormula
• Pipeline needs separate transformers for each
column
• Makes constructing pipeline cumbersome
• Negatively impacts performance:
• Inefficient computation
• SQL query planning cost overhead (also grows with
each additional pipeline stage)
• Solution
• Multiple column support for relevant transformers
• QuantileDiscretizer, Bucketizer, OneHotEncoder in
2.3
• StringIndexer – WIP
• New OneHotEncoderEstimator
• Supports multi-column transform
• Also fixes the issue that the legacy encoder is a
Transformer, not a Model; so it cannot be used on
new data where the number of categories is
different from training data
• RFormula also benefits by using new
OneHotEncoderEstimator and will use updated
StringIndexer

Performance Test Results
Multi-Column Transformers
Multi- vs single-column Execution Time
0
7.5
15
22.5
30
Single-column Multi-column
Fit Transform
• Pipeline with single-column transformers is
2.7x slower than multi-column version
• Input DataFrame with 1 million rows and 100
random numerical columns
• Pipeline consists of quantile discretization into
10 buckets for each column, followed by one-
hot-encoding
3 executors 64 GB memory, 144 partitions, execution times averaged over 3 runs

Scaling Hyperparameter Optimization
Parallel Cross-Validation
• HPO using cross-validation is computationally
expensive
• The the number of parameter combinations
(and thus pipelines being evaluated) can
increase extremely rapidly
• Common use case is tuning many moderately-
sized models
• Want to optimize utilization of valuable cluster
resources
• Currently computed serially in Spark ML
• Solution
• Enable parallel evaluation of pipelines in
TrainValidationSplit and CrossValidator
• A set of candidate pipelines are trained and
evaluated concurrently
• Configurable level of parallelism
• Dedicated threadpool

TrainValidationSplit Execution Time
0
5,500
11,000
16,500
22,000
1e+06 2e+06 3e+06 4e+06 5e+06
Serial Parallel (p=3)
CrossValidator Execution Time
0
20,000
40,000
60,000
80,000
1e+06 2e+06 3e+06 4e+06 5e+06
Serial Parallel (p=3)
10 partitions, 30 core cluster, execution times averaged over 5 runs

• 2 – 2.7x speed up in execution time with
parallelism level of 3
• Balance more parallelism with available cluster
resources to achieve ideal utilization
• Roughly # cores / data partitions …
• … but watch memory usage
• At most p models in memory at once as well as
cached training and evaluation data
• Should use lower parallelism for larger models and
data sizes
• Parallelism level < 10 is a good “rule of thumb”
for many clusters

Scalable Feature Extraction
Feature Hashing
• Use hashing trick to map feature values to
indices in a feature vector of much lower size
=> dimensionality reduction
• Pros
• Fast & simple
• Preserves sparsity
• Memory efficient
• Potential for feature engineering on the fly
• Cons
• No inverse mapping from indices -> feature names
• Hash collisions
• New FeatureHasher transformer
• Modelled on scikit-learn and Vowpal Wabbit
• Operates on multiple columns, performing feature
extraction and vectorization in one pass
• Handles categorical and numerical features

Scalable Feature Extraction
Feature Hashing

Feature Hashing
Criteo DAC Dataset AUC
0.740
0.748
0.755
0.763
0.770
18 20 22 24
Hashing Full encoding
• Criteo Display Advertising Challenge Data
• 45 million examples
• 34 million features
• High cardinality categorical features
• Extreme sparsity
• Also tested on Criteo 1T log data
• 7 day subset – 1.5 billion examples
• 300 million features
• 224 hashed features (16m)
• Impossible with Spark ML full encoding approach
(OOM, 2gb broadcast limit)
4 executors 64 GB memory, 192 partitions

Summary
• More progress on feature parity between RDD-
and DataFrame-based APIs (mllib vs ml)
• Native image support
• New scalability features for commonly-used
components
• Plenty of bug fixes and usability enhancements
• http://spark.apache.org

Potential Directions for 2.4+
• Continued focus on enhancing usability and
scalability of commonly used feature
transformers
• Complete multi-column support for string indexing
• This also feeds into RFormula performance
• Complete Python API coverage
• Parallel cross-validation – smarter caching and
execution planning
• Feature parity and enhancements for evaluation
metrics – in particular ranking metrics
• Gradient Boosted Trees
• Room for major improvements
• Potential to incorporate latest enhancements from
popular open-source tools such as XGBoost and
LightGBM
• See https://github.com/zhengruifeng/SparkGBM for
example
• Feature hashing
• Signed hash functions
• Feature crossing / namespace support
• DictVectorizer (SPARK-19962)
• What about 3.0?

Thank you!
Nick Pentreath
Principal Engineer
—
nickp@za.ibm.com
@MLnick
ibm.com

Links & References
Apache Spark Downloads
MLlib User Guide
Multi-column Support JIRA
Parallel Cross-validation JIRA
Feature Hashing JIRA

Index conf sparkml-feb20-n-pentreath

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Index conf sparkml-feb20-n-pentreath

Similar to Index conf sparkml-feb20-n-pentreath (20)

More from Chester Chen

More from Chester Chen (20)

Recently uploaded

Recently uploaded (20)

Index conf sparkml-feb20-n-pentreath