SlideShare a Scribd company logo
DBG / Feb 20, 2018 / © 2018 IBM Corporation
Apache Spark 2.3

Machine Learning Update

—

Nick Pentreath

Principal Engineer, IBM



@MLnick
DBG / Feb 20, 2018 / © 2018 IBM Corporation
About
@MLnick on Twitter & Github
Principal Engineer, IBM
Spark Technology Center & Cognitive Open
Technologies
Machine Learning & AI
Apache Spark committer & PMC
Author of Machine Learning with Spark
Various conferences & meetups
DBG / Feb 20, 2018 / © 2018 IBM Corporation
Agenda
Apache Spark 2.3 - MLlib Highlights
Deeper Dive into New Scalability Features
Summary and Future Directions
DBG / Feb 20, 2018 / © 2018 IBM Corporation
Machine Learning Highlights
Apache Spark 2.3 Release
• First-class support for loading image data into
DataFrames
• Enhanced usability and scalability of commonly
used feature transformers
• New one hot encoder supporting multiple columns
and fixing a major issue with existing transformer
• Multiple column support for QuantileDiscretizer and
Bucketizer transformers (StringIndexer support is
WIP)
• Scaling out cross-validation with parallel
pipeline evaluation
• Scalable feature hashing transformer with
multiple column support
• Robust linear regression with Huber loss
• Further parity work for DataFrame statistics
functions
• Improved support for creating custom pipeline
components in Python
DBG / Feb 20, 2018 / © 2018 IBM Corporation
Performance Issues
Multi-Column Transformers
• Most commonly used pipeline components
operate on a single column at a time
• Impacts “wide” datasets – many feature columns that
need to be processed in the same manner
• StringIndexer, QuantileDiscretizer, Bucketizer,
OneHotEncoder, also impacts RFormula
• Pipeline needs separate transformers for each
column
• Makes constructing pipeline cumbersome
• Negatively impacts performance:
• Inefficient computation
• SQL query planning cost overhead (also grows with
each additional pipeline stage)
• Solution
• Multiple column support for relevant transformers
• QuantileDiscretizer, Bucketizer, OneHotEncoder in
2.3
• StringIndexer – WIP
• New OneHotEncoderEstimator
• Supports multi-column transform
• Also fixes the issue that the legacy encoder is a
Transformer, not a Model; so it cannot be used on
new data where the number of categories is
different from training data
• RFormula also benefits by using new
OneHotEncoderEstimator and will use updated
StringIndexer
DBG / Feb 20, 2018 / © 2018 IBM Corporation
Performance Test Results
Multi-Column Transformers
Multi- vs single-column Execution Time
0
7.5
15
22.5
30
Single-column Multi-column
Fit Transform
• Pipeline with single-column transformers is
2.7x slower than multi-column version
• Input DataFrame with 1 million rows and 100
random numerical columns
• Pipeline consists of quantile discretization into
10 buckets for each column, followed by one-
hot-encoding
3 executors 64 GB memory, 144 partitions, execution times averaged over 3 runs
DBG / Feb 20, 2018 / © 2018 IBM Corporation
Scaling Hyperparameter Optimization
Parallel Cross-Validation
• HPO using cross-validation is computationally
expensive
• The the number of parameter combinations
(and thus pipelines being evaluated) can
increase extremely rapidly
• Common use case is tuning many moderately-
sized models
• Want to optimize utilization of valuable cluster
resources
• Currently computed serially in Spark ML
• Solution
• Enable parallel evaluation of pipelines in
TrainValidationSplit and CrossValidator
• A set of candidate pipelines are trained and
evaluated concurrently
• Configurable level of parallelism
• Dedicated threadpool
DBG / Feb 20, 2018 / © 2018 IBM Corporation
Performance Test Results
Parallel Cross-Validation
TrainValidationSplit Execution Time
0
5,500
11,000
16,500
22,000
1e+06 2e+06 3e+06 4e+06 5e+06
Serial Parallel (p=3)
CrossValidator Execution Time
0
20,000
40,000
60,000
80,000
1e+06 2e+06 3e+06 4e+06 5e+06
Serial Parallel (p=3)
10 partitions, 30 core cluster, execution times averaged over 5 runs
DBG / Feb 20, 2018 / © 2018 IBM Corporation
Performance Test Results
Parallel Cross-Validation
• 2 – 2.7x speed up in execution time with
parallelism level of 3
• Balance more parallelism with available cluster
resources to achieve ideal utilization
• Roughly # cores / data partitions …
• … but watch memory usage
• At most p models in memory at once as well as
cached training and evaluation data
• Should use lower parallelism for larger models and
data sizes
• Parallelism level < 10 is a good “rule of thumb”
for many clusters
DBG / Feb 20, 2018 / © 2018 IBM Corporation
Scalable Feature Extraction
Feature Hashing
• Use hashing trick to map feature values to
indices in a feature vector of much lower size
=> dimensionality reduction
• Pros
• Fast & simple
• Preserves sparsity
• Memory efficient
• Potential for feature engineering on the fly
• Cons
• No inverse mapping from indices -> feature names
• Hash collisions
• New FeatureHasher transformer
• Modelled on scikit-learn and Vowpal Wabbit
• Operates on multiple columns, performing feature
extraction and vectorization in one pass
• Handles categorical and numerical features
DBG / Feb 20, 2018 / © 2018 IBM Corporation
Scalable Feature Extraction
Feature Hashing
DBG / Feb 20, 2018 / © 2018 IBM Corporation
Performance Test Results
Feature Hashing
Criteo DAC Dataset AUC
0.740
0.748
0.755
0.763
0.770
18 20 22 24
Hashing Full encoding
• Criteo Display Advertising Challenge Data
• 45 million examples
• 34 million features
• High cardinality categorical features
• Extreme sparsity
• Also tested on Criteo 1T log data
• 7 day subset – 1.5 billion examples
• 300 million features
• 224 hashed features (16m)
• Impossible with Spark ML full encoding approach
(OOM, 2gb broadcast limit)
4 executors 64 GB memory, 192 partitions
DBG / Feb 20, 2018 / © 2018 IBM Corporation
Summary
Summary and Future Directions
• More progress on feature parity between RDD-
and DataFrame-based APIs (mllib vs ml)
• Native image support
• New scalability features for commonly-used
components
• Plenty of bug fixes and usability enhancements
• http://spark.apache.org
DBG / Feb 20, 2018 / © 2018 IBM Corporation
Potential Directions for 2.4+
Summary and Future Directions
• Continued focus on enhancing usability and
scalability of commonly used feature
transformers
• Complete multi-column support for string indexing
• This also feeds into RFormula performance
• Complete Python API coverage
• Parallel cross-validation – smarter caching and
execution planning
• Feature parity and enhancements for evaluation
metrics – in particular ranking metrics
• Gradient Boosted Trees
• Room for major improvements
• Potential to incorporate latest enhancements from
popular open-source tools such as XGBoost and
LightGBM
• See https://github.com/zhengruifeng/SparkGBM for
example
• Feature hashing
• Signed hash functions
• Feature crossing / namespace support
• DictVectorizer (SPARK-19962)
• What about 3.0?
DBG / Feb 20, 2018 / © 2018 IBM Corporation
Thank you!
Nick Pentreath
Principal Engineer
—
nickp@za.ibm.com
@MLnick
ibm.com
DBG / Feb 20, 2018 / © 2018 IBM Corporation
Links & References
Apache Spark Downloads
MLlib User Guide
Multi-column Support JIRA
Parallel Cross-validation JIRA
Feature Hashing JIRA
DBG / Feb 20, 2018 / © 2018 IBM Corporation

More Related Content

What's hot

Ibm fresh water-for_your_data_lake
Ibm   fresh water-for_your_data_lakeIbm   fresh water-for_your_data_lake
Ibm fresh water-for_your_data_lake
Matthias Reiss
 
Writing an Interactive Interface for SQL on Flink
Writing an Interactive Interface for SQL on FlinkWriting an Interactive Interface for SQL on Flink
Writing an Interactive Interface for SQL on Flink
Eventador
 
Spark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier AguedesSpark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier Aguedes
Spark Summit
 
Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019
Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019
Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019
Bowen Li
 
Willump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceWillump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML Inference
Databricks
 
Flink Forward San Francisco 2018: Andrew Torson - "Extending Flink metrics: R...
Flink Forward San Francisco 2018: Andrew Torson - "Extending Flink metrics: R...Flink Forward San Francisco 2018: Andrew Torson - "Extending Flink metrics: R...
Flink Forward San Francisco 2018: Andrew Torson - "Extending Flink metrics: R...
Flink Forward
 
[FFE19] Build a Flink AI Ecosystem
[FFE19] Build a Flink AI Ecosystem[FFE19] Build a Flink AI Ecosystem
[FFE19] Build a Flink AI Ecosystem
Jiangjie Qin
 
A Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop EcosystemA Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop EcosystemDataWorks Summit
 
Scaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and FeastScaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and Feast
Databricks
 
What's New in SPSS Statistics ?
What's New in SPSS Statistics ? What's New in SPSS Statistics ?
What's New in SPSS Statistics ?
Luke Farrell
 
Informatica Online Training
Informatica Online TrainingInformatica Online Training
Informatica Online Training
Rao Rao
 
Large-Scale Ads CTR Prediction with Spark and Deep Learning: Lessons Learned ...
Large-Scale Ads CTR Prediction with Spark and Deep Learning: Lessons Learned ...Large-Scale Ads CTR Prediction with Spark and Deep Learning: Lessons Learned ...
Large-Scale Ads CTR Prediction with Spark and Deep Learning: Lessons Learned ...
Databricks
 
Spark Summit EU talk by Chris Pool and Jeroen Vlek
Spark Summit EU talk by Chris Pool and Jeroen Vlek Spark Summit EU talk by Chris Pool and Jeroen Vlek
Spark Summit EU talk by Chris Pool and Jeroen Vlek
Spark Summit
 
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Flink Forward
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
DataWorks Summit
 
Sap abap online training
Sap abap online trainingSap abap online training
Sap abap online training
Charlotte Charl
 
Inside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick ReissInside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick Reiss
Spark Summit
 
Storing State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your Analytics
Yaroslav Tkachenko
 
On Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQLOn Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQL
Databricks
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
Databricks
 

What's hot (20)

Ibm fresh water-for_your_data_lake
Ibm   fresh water-for_your_data_lakeIbm   fresh water-for_your_data_lake
Ibm fresh water-for_your_data_lake
 
Writing an Interactive Interface for SQL on Flink
Writing an Interactive Interface for SQL on FlinkWriting an Interactive Interface for SQL on Flink
Writing an Interactive Interface for SQL on Flink
 
Spark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier AguedesSpark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier Aguedes
 
Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019
Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019
Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019
 
Willump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceWillump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML Inference
 
Flink Forward San Francisco 2018: Andrew Torson - "Extending Flink metrics: R...
Flink Forward San Francisco 2018: Andrew Torson - "Extending Flink metrics: R...Flink Forward San Francisco 2018: Andrew Torson - "Extending Flink metrics: R...
Flink Forward San Francisco 2018: Andrew Torson - "Extending Flink metrics: R...
 
[FFE19] Build a Flink AI Ecosystem
[FFE19] Build a Flink AI Ecosystem[FFE19] Build a Flink AI Ecosystem
[FFE19] Build a Flink AI Ecosystem
 
A Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop EcosystemA Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop Ecosystem
 
Scaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and FeastScaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and Feast
 
What's New in SPSS Statistics ?
What's New in SPSS Statistics ? What's New in SPSS Statistics ?
What's New in SPSS Statistics ?
 
Informatica Online Training
Informatica Online TrainingInformatica Online Training
Informatica Online Training
 
Large-Scale Ads CTR Prediction with Spark and Deep Learning: Lessons Learned ...
Large-Scale Ads CTR Prediction with Spark and Deep Learning: Lessons Learned ...Large-Scale Ads CTR Prediction with Spark and Deep Learning: Lessons Learned ...
Large-Scale Ads CTR Prediction with Spark and Deep Learning: Lessons Learned ...
 
Spark Summit EU talk by Chris Pool and Jeroen Vlek
Spark Summit EU talk by Chris Pool and Jeroen Vlek Spark Summit EU talk by Chris Pool and Jeroen Vlek
Spark Summit EU talk by Chris Pool and Jeroen Vlek
 
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
 
Sap abap online training
Sap abap online trainingSap abap online training
Sap abap online training
 
Inside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick ReissInside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick Reiss
 
Storing State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your Analytics
 
On Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQLOn Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQL
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 

Similar to Index conf sparkml-feb20-n-pentreath

The state of Spark in the cloud
The state of Spark in the cloudThe state of Spark in the cloud
The state of Spark in the cloud
Nicolas Poggi
 
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
Cloudera, Inc.
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
Nicolas Poggi
 
Data exposure in Azure - production use-case
Data exposure in Azure - production use-caseData exposure in Azure - production use-case
Data exposure in Azure - production use-case
Alexander Laysha
 
On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph Performance
Chin Huang
 
Productionalizing ML : Real Experience
Productionalizing ML : Real ExperienceProductionalizing ML : Real Experience
Productionalizing ML : Real Experience
Ihor Bobak
 
Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez
DataWorks Summit
 
AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...
Institute of Contemporary Sciences
 
Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...
Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...
Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...
Databricks
 
Productionizing Spark ML Pipelines with the Portable Format for Analytics
Productionizing Spark ML Pipelines with the Portable Format for AnalyticsProductionizing Spark ML Pipelines with the Portable Format for Analytics
Productionizing Spark ML Pipelines with the Portable Format for Analytics
Nick Pentreath
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learn
John D Almon
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
Stratebi
 
Toronto High Scalability meetup - Scaling ELK
Toronto High Scalability meetup - Scaling ELKToronto High Scalability meetup - Scaling ELK
Toronto High Scalability meetup - Scaling ELK
Andrew Trossman
 
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflowsCloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Yong Feng
 
Megastore: Providing scalable and highly available storage
Megastore: Providing scalable and highly available storageMegastore: Providing scalable and highly available storage
Megastore: Providing scalable and highly available storage
Niels Claeys
 
Overcoming write availability challenges of PostgreSQL
Overcoming write availability challenges of PostgreSQLOvercoming write availability challenges of PostgreSQL
Overcoming write availability challenges of PostgreSQL
EDB
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
HostedbyConfluent
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Spark Summit
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
HPCC Systems
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
Amazon Web Services
 

Similar to Index conf sparkml-feb20-n-pentreath (20)

The state of Spark in the cloud
The state of Spark in the cloudThe state of Spark in the cloud
The state of Spark in the cloud
 
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
 
Data exposure in Azure - production use-case
Data exposure in Azure - production use-caseData exposure in Azure - production use-case
Data exposure in Azure - production use-case
 
On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph Performance
 
Productionalizing ML : Real Experience
Productionalizing ML : Real ExperienceProductionalizing ML : Real Experience
Productionalizing ML : Real Experience
 
Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez
 
AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...
 
Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...
Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...
Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...
 
Productionizing Spark ML Pipelines with the Portable Format for Analytics
Productionizing Spark ML Pipelines with the Portable Format for AnalyticsProductionizing Spark ML Pipelines with the Portable Format for Analytics
Productionizing Spark ML Pipelines with the Portable Format for Analytics
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learn
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
 
Toronto High Scalability meetup - Scaling ELK
Toronto High Scalability meetup - Scaling ELKToronto High Scalability meetup - Scaling ELK
Toronto High Scalability meetup - Scaling ELK
 
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflowsCloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
 
Megastore: Providing scalable and highly available storage
Megastore: Providing scalable and highly available storageMegastore: Providing scalable and highly available storage
Megastore: Providing scalable and highly available storage
 
Overcoming write availability challenges of PostgreSQL
Overcoming write availability challenges of PostgreSQLOvercoming write availability challenges of PostgreSQL
Overcoming write availability challenges of PostgreSQL
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
 

More from Chester Chen

SFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfSFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdf
Chester Chen
 
zookeeer+raft-2.pdf
zookeeer+raft-2.pdfzookeeer+raft-2.pdf
zookeeer+raft-2.pdf
Chester Chen
 
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
Chester Chen
 
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
Chester Chen
 
A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?
Chester Chen
 
Shopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdataShopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdata
Chester Chen
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
Chester Chen
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
Chester Chen
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a Pro
Chester Chen
 
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleSF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
Chester Chen
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Chester Chen
 
SFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdapSFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdap
Chester Chen
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bighead
Chester Chen
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Chester Chen
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Chester Chen
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in spark
Chester Chen
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
Chester Chen
 
2018 02 20-jeg_index
2018 02 20-jeg_index2018 02 20-jeg_index
2018 02 20-jeg_index
Chester Chen
 
Index conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathIndex conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreath
Chester Chen
 

More from Chester Chen (20)

SFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfSFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdf
 
zookeeer+raft-2.pdf
zookeeer+raft-2.pdfzookeeer+raft-2.pdf
zookeeer+raft-2.pdf
 
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
 
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
 
A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?
 
Shopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdataShopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdata
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a Pro
 
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleSF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 
SFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdapSFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdap
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bighead
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in spark
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
 
2018 02 20-jeg_index
2018 02 20-jeg_index2018 02 20-jeg_index
2018 02 20-jeg_index
 
Index conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathIndex conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreath
 

Recently uploaded

一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 

Recently uploaded (20)

一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 

Index conf sparkml-feb20-n-pentreath

  • 1. DBG / Feb 20, 2018 / © 2018 IBM Corporation Apache Spark 2.3
 Machine Learning Update
 —
 Nick Pentreath
 Principal Engineer, IBM
 
 @MLnick
  • 2. DBG / Feb 20, 2018 / © 2018 IBM Corporation About @MLnick on Twitter & Github Principal Engineer, IBM Spark Technology Center & Cognitive Open Technologies Machine Learning & AI Apache Spark committer & PMC Author of Machine Learning with Spark Various conferences & meetups
  • 3. DBG / Feb 20, 2018 / © 2018 IBM Corporation Agenda Apache Spark 2.3 - MLlib Highlights Deeper Dive into New Scalability Features Summary and Future Directions
  • 4. DBG / Feb 20, 2018 / © 2018 IBM Corporation Machine Learning Highlights Apache Spark 2.3 Release • First-class support for loading image data into DataFrames • Enhanced usability and scalability of commonly used feature transformers • New one hot encoder supporting multiple columns and fixing a major issue with existing transformer • Multiple column support for QuantileDiscretizer and Bucketizer transformers (StringIndexer support is WIP) • Scaling out cross-validation with parallel pipeline evaluation • Scalable feature hashing transformer with multiple column support • Robust linear regression with Huber loss • Further parity work for DataFrame statistics functions • Improved support for creating custom pipeline components in Python
  • 5. DBG / Feb 20, 2018 / © 2018 IBM Corporation Performance Issues Multi-Column Transformers • Most commonly used pipeline components operate on a single column at a time • Impacts “wide” datasets – many feature columns that need to be processed in the same manner • StringIndexer, QuantileDiscretizer, Bucketizer, OneHotEncoder, also impacts RFormula • Pipeline needs separate transformers for each column • Makes constructing pipeline cumbersome • Negatively impacts performance: • Inefficient computation • SQL query planning cost overhead (also grows with each additional pipeline stage) • Solution • Multiple column support for relevant transformers • QuantileDiscretizer, Bucketizer, OneHotEncoder in 2.3 • StringIndexer – WIP • New OneHotEncoderEstimator • Supports multi-column transform • Also fixes the issue that the legacy encoder is a Transformer, not a Model; so it cannot be used on new data where the number of categories is different from training data • RFormula also benefits by using new OneHotEncoderEstimator and will use updated StringIndexer
  • 6. DBG / Feb 20, 2018 / © 2018 IBM Corporation Performance Test Results Multi-Column Transformers Multi- vs single-column Execution Time 0 7.5 15 22.5 30 Single-column Multi-column Fit Transform • Pipeline with single-column transformers is 2.7x slower than multi-column version • Input DataFrame with 1 million rows and 100 random numerical columns • Pipeline consists of quantile discretization into 10 buckets for each column, followed by one- hot-encoding 3 executors 64 GB memory, 144 partitions, execution times averaged over 3 runs
  • 7. DBG / Feb 20, 2018 / © 2018 IBM Corporation Scaling Hyperparameter Optimization Parallel Cross-Validation • HPO using cross-validation is computationally expensive • The the number of parameter combinations (and thus pipelines being evaluated) can increase extremely rapidly • Common use case is tuning many moderately- sized models • Want to optimize utilization of valuable cluster resources • Currently computed serially in Spark ML • Solution • Enable parallel evaluation of pipelines in TrainValidationSplit and CrossValidator • A set of candidate pipelines are trained and evaluated concurrently • Configurable level of parallelism • Dedicated threadpool
  • 8. DBG / Feb 20, 2018 / © 2018 IBM Corporation Performance Test Results Parallel Cross-Validation TrainValidationSplit Execution Time 0 5,500 11,000 16,500 22,000 1e+06 2e+06 3e+06 4e+06 5e+06 Serial Parallel (p=3) CrossValidator Execution Time 0 20,000 40,000 60,000 80,000 1e+06 2e+06 3e+06 4e+06 5e+06 Serial Parallel (p=3) 10 partitions, 30 core cluster, execution times averaged over 5 runs
  • 9. DBG / Feb 20, 2018 / © 2018 IBM Corporation Performance Test Results Parallel Cross-Validation • 2 – 2.7x speed up in execution time with parallelism level of 3 • Balance more parallelism with available cluster resources to achieve ideal utilization • Roughly # cores / data partitions … • … but watch memory usage • At most p models in memory at once as well as cached training and evaluation data • Should use lower parallelism for larger models and data sizes • Parallelism level < 10 is a good “rule of thumb” for many clusters
  • 10. DBG / Feb 20, 2018 / © 2018 IBM Corporation Scalable Feature Extraction Feature Hashing • Use hashing trick to map feature values to indices in a feature vector of much lower size => dimensionality reduction • Pros • Fast & simple • Preserves sparsity • Memory efficient • Potential for feature engineering on the fly • Cons • No inverse mapping from indices -> feature names • Hash collisions • New FeatureHasher transformer • Modelled on scikit-learn and Vowpal Wabbit • Operates on multiple columns, performing feature extraction and vectorization in one pass • Handles categorical and numerical features
  • 11. DBG / Feb 20, 2018 / © 2018 IBM Corporation Scalable Feature Extraction Feature Hashing
  • 12. DBG / Feb 20, 2018 / © 2018 IBM Corporation Performance Test Results Feature Hashing Criteo DAC Dataset AUC 0.740 0.748 0.755 0.763 0.770 18 20 22 24 Hashing Full encoding • Criteo Display Advertising Challenge Data • 45 million examples • 34 million features • High cardinality categorical features • Extreme sparsity • Also tested on Criteo 1T log data • 7 day subset – 1.5 billion examples • 300 million features • 224 hashed features (16m) • Impossible with Spark ML full encoding approach (OOM, 2gb broadcast limit) 4 executors 64 GB memory, 192 partitions
  • 13. DBG / Feb 20, 2018 / © 2018 IBM Corporation Summary Summary and Future Directions • More progress on feature parity between RDD- and DataFrame-based APIs (mllib vs ml) • Native image support • New scalability features for commonly-used components • Plenty of bug fixes and usability enhancements • http://spark.apache.org
  • 14. DBG / Feb 20, 2018 / © 2018 IBM Corporation Potential Directions for 2.4+ Summary and Future Directions • Continued focus on enhancing usability and scalability of commonly used feature transformers • Complete multi-column support for string indexing • This also feeds into RFormula performance • Complete Python API coverage • Parallel cross-validation – smarter caching and execution planning • Feature parity and enhancements for evaluation metrics – in particular ranking metrics • Gradient Boosted Trees • Room for major improvements • Potential to incorporate latest enhancements from popular open-source tools such as XGBoost and LightGBM • See https://github.com/zhengruifeng/SparkGBM for example • Feature hashing • Signed hash functions • Feature crossing / namespace support • DictVectorizer (SPARK-19962) • What about 3.0?
  • 15. DBG / Feb 20, 2018 / © 2018 IBM Corporation Thank you! Nick Pentreath Principal Engineer — nickp@za.ibm.com @MLnick ibm.com
  • 16. DBG / Feb 20, 2018 / © 2018 IBM Corporation Links & References Apache Spark Downloads MLlib User Guide Multi-column Support JIRA Parallel Cross-validation JIRA Feature Hashing JIRA
  • 17. DBG / Feb 20, 2018 / © 2018 IBM Corporation