Quick! Quick! Exploration!: A framework for searching a predictive model on Apache Spark

Quick! Quick! Exploration!:
A framework for searching a predictive
model on Apache Spark
Masato Asahara*, Yoshiki Takahashi+
and Kazuyuki Shudo+
* NEC Corporation, + Tokyo Institute of Technology
Jun/21/2018 @DataWorks Summit 2018

2 © NEC Corporation 2018
Who we are?
▌Masato Asahara (Ph.D.)
▌Principal Software Architecture and Researcher
at NEC System Platform Research Laboratory
Masato Asahara (Ph.D.) is currently leading developments of Spark-based
machine learning and data analytics systems, which fully automate
predictive modeling.
Masato received his Ph.D. degree from Keio University, and has worked at
NEC for 8 years as a researcher in the field of distributed computing
systems and computing resource management technologies.
▌Yoshiki Takahashi
▌Master course student at Tokyo Institute of Technology
Yoshiki Takahashi is a student of the master of computer science program
at the graduate school of Tokyo Institute of Technology. His academic
research proposal is accepted in SysML 2018 which has attracted attention
since its previous workshop era in NIPS.
He worked on development of a Spark-based machine learning platform
for automatic predictive modeling in his internship program at NEC Data
Science Research Laboratories in 2017. He received his B.S. degree in 2017
from Tokyo Institute of Technology.

Agenda
Best model
x
x
x
x
x
x
1000+ Patterns

Agenda
Best model
x
x
x
Quick!
Scalable!
Plug-in

Agenda
MLlib
(A cluster with 16 CPU cores, Using HIGGS data sets of UCI ML repository)
× 𝟏𝟑 faster !!Our framework

Predictive Modeling Automation Framework

Predictive Analysis in Enterprise Area
Driver Risk
Assessment
Inventory
Optimization
Churn
Retention
Predictive
Maintenance
Product Price
Optimization
Sales
Optimization
Energy/Water
Operation Mgmt.

Pain of Modern Predictive Modeling
High skill
Precious CS Ph.D.s
Evolving ML Technology
Quick Trial w/ New ML algo.
Long time
Many Tuning Parameters

Our Framework automates Predictive Modeling!
High skill
Precious CS Ph.D.s
Evolving ML Technology
Quick Trial w/ New ML algo.
Long time
Many Tuning Parameters

Values of Our Automation Framework
Democratized to
business users
Quick model
selection
Easy integration with
future ML
implementations
Best

Design Challenges and Solutions

High Level Architecture
Training Data
Validate Data
Training Validate
Criteria
⋮
⋮
Run
HDFS

Design Challenges
High Scalability Open for ML Implementations

Design Challenges

Just Adding Nodes doesn’t Improve Performance
5 min 4 min
1 min 2 min Wait 6 min

Scheduling as a Combinatorial Optimization Problem
𝑱𝒐𝒃 𝟏
𝑱𝒐𝒃 𝟐
𝑱𝒐𝒃 𝟑
𝑾 𝟏
𝑾 𝟐
5 min
5 min
𝑱𝒐𝒃 𝟏
𝑱𝒐𝒃 𝟐
𝑱𝒐𝒃 𝟑
𝑻 𝟏
𝑻 𝟐
𝐦𝐚𝐱 {𝑻 𝟏, 𝑻 𝟐}
2 min
1 min
5 min
Scheduler
⋮

𝑱𝒐𝒃 𝟏
𝑱𝒐𝒃 𝟐
𝑱𝒐𝒃 𝟑
𝑾 𝟏
𝑾 𝟐
5 min
5 min
𝑱𝒐𝒃 𝟏
𝑱𝒐𝒃 𝟐
𝑱𝒐𝒃 𝟑
𝑻 𝟏
𝑻 𝟐
Minimize 𝐦𝐚𝐱 {𝑻 𝟏, 𝑻 𝟐}
2 min
1 min
5 min
Scheduler
⋮

𝑱𝒐𝒃 𝟏
𝑱𝒐𝒃 𝟐
𝑱𝒐𝒃 𝟑
𝑾 𝟏
𝑾 𝟐
5 min
5 min
𝑱𝒐𝒃 𝟏
𝑱𝒐𝒃 𝟐
𝑱𝒐𝒃 𝟑
𝑻 𝟏
𝑻 𝟐
Minimize 𝐦𝐚𝐱 {𝑻 𝟏, 𝑻 𝟐}
??? min
Scheduler
⋮
??? min
??? min

eta
max_depth
⋮ ⋮
round
Scheduler Pre-profiles Job Time via Sampled Data
This training takes
xx.x sec.
Small Sampled Data Scheduler

Scheduler Pre-profiles Job Time via Sampled Data
Scheduling
Scheduler
Profiling
Automated Predictive
Modeling
Small Sampled Data Entire Data

Preliminary Evaluation: Pre-profiling Tasks Little Time
2.34%
In pre-profiling,
• sampling 1% data from training data
• executing trainings for same search
space as automatic prediction modeling

Design Challenges

Reducing Implementation Costs to Add New ML impl.
Distributed Learning Validation /
Model Selection

Naïve Design: Requires Many Changes to plug-in New ML
Distributed Learning
Training
Training
Training
invoke
Add code
New ML Format Data
TF Format Data
XGB Format Data

Easy Integration with New ML impl. by Encapsulation
Distributed Learning Training
Training
Training
Training
Encapsulation
invoke
Common Format Data
♪～
Add code

Easy Integration with New ML impl. by Encapsulation
Validation /
Model selection
Prediction
Prediction
Prediction
Prediction
Encapsulation
invoke
Common Format Data
♪～
Add code

Evaluation Setup
▌Dataset
HIGGS (UCI Dataset Repository)
• 1M sampled data for each training,
validation and test data
• 28 features
▌Scheduler Training
Executes same grids for training
Using 1% sample of training data
▌Environment
Apache Spark 2.3.0
Apache Hadoop 3.1.0
▌Exploring Algorithms
Gradient Boosting Tree (GBT)
• XGBoost 0.8
• 864 grid points
Multi-layer Perceptron (MLP)
• TensorFlow 1.8.0
• 324 grid points
Logistic Regression (LR)
• scikit-learn 0.18.1
• 5 grid points
Random Forest (RF)
• scikit-learn 0.18.1
• 18 grid points

Evaluation Result: Total Execution Time
× 𝟏𝟑. 𝟏 faster !!

Spark MLlib Focuses on Scaling out for Huge Data Size
Core 1
Core 2
Core 3
Core 1
Core 2
Core 3
Next Model
Complete training !
Shuffle
Training
Dataset

Core 1
Core 2
Core 3
Core 1
Core 2
Core 3
Next Model
Complete training !
No-Shuffle
Training
Dataset
We Focuses on Huge Search Space of Parameter Tuning
Our Framework
Next Model
Next Model
Read entire
data

Evaluation Result: Execution Performance for Scalability
72.7%
78.4%
81.7%
84.7%

Evaluation Result: Improvement of Error and AUC
Classification
Accuracy
AUC
Best model* 0.756 0.837
Gradient Boosting Tree** (-0.013) 0.743 (-0.012) 0.825
Logistic Regression** (-0.114) 0.642 (-0.153) 0.684
Random Forest** (-0.032) 0.724 (-0.036) 0.801
* Best model produced by our framework.
** Using default hyper parameters of XGBoost and scikit-learn

Evaluation Result : Amount of Code for Adding New ML
# Lines of Code w/o comments
151 lines
292 lines
290 lines
python : 116
scala : 176
python : 90
scala : 200

Summary – Automation Framework for Predictive Modeling
Best model
x
x
x
Quick!
Scalable!
Plug-in

Values
Democratized to
business users
Quick model
selection
Easy integration with
future ML
implementations
Best

Design Challenges (Addressed)

Future work - Convert Data Structure for Each ML impl.
Common Format :
Double[ ][ ]
Sparse
Column-oriented
Row-oriented
Memory Copy &
Convert

Common Memory Format can be Read w/o copy is Better
Common Format :
????
Sparse
Column-oriented
Row-oriented
Zero-copy read
Apache Arrow …?

Quick! Quick! Exploration!: A framework for searching a predictive model on Apache Spark

Quick! Quick! Exploration!: A framework for searching a predictive model on Apache Spark

More Related Content

What's hot

Similar to Quick! Quick! Exploration!: A framework for searching a predictive model on Apache Spark

More from DataWorks Summit

Recently uploaded

Quick! Quick! Exploration!: A framework for searching a predictive model on Apache Spark