Quick! Quick! Exploration!:
A framework for searching a predictive
model on Apache Spark
Masato Asahara*, Yoshiki Takahashi+
and Kazuyuki Shudo+
* NEC Corporation, + Tokyo Institute of Technology
Jun/21/2018 @DataWorks Summit 2018
2 © NEC Corporation 2018
Who we are?
▌Masato Asahara (Ph.D.)
▌Principal Software Architecture and Researcher
at NEC System Platform Research Laboratory
Masato Asahara (Ph.D.) is currently leading developments of Spark-based
machine learning and data analytics systems, which fully automate
predictive modeling.
Masato received his Ph.D. degree from Keio University, and has worked at
NEC for 8 years as a researcher in the field of distributed computing
systems and computing resource management technologies.
▌Yoshiki Takahashi
▌Master course student at Tokyo Institute of Technology
Yoshiki Takahashi is a student of the master of computer science program
at the graduate school of Tokyo Institute of Technology. His academic
research proposal is accepted in SysML 2018 which has attracted attention
since its previous workshop era in NIPS.
He worked on development of a Spark-based machine learning platform
for automatic predictive modeling in his internship program at NEC Data
Science Research Laboratories in 2017. He received his B.S. degree in 2017
from Tokyo Institute of Technology.
3 © NEC Corporation 2018
Agenda
Best model
x
x
x
x
x
x
1000+ Patterns
4 © NEC Corporation 2018
Agenda
Best model
x
x
x
Quick!
Scalable!
Plug-in
5 © NEC Corporation 2018
Agenda
MLlib
(A cluster with 16 CPU cores, Using HIGGS data sets of UCI ML repository)
× 𝟏𝟑 faster !!Our framework
Predictive Modeling Automation Framework
7 © NEC Corporation 2018
Predictive Analysis in Enterprise Area
Driver Risk
Assessment
Inventory
Optimization
Churn
Retention
Predictive
Maintenance
Product Price
Optimization
Sales
Optimization
Energy/Water
Operation Mgmt.
8 © NEC Corporation 2018
Pain of Modern Predictive Modeling
High skill
Precious CS Ph.D.s
Evolving ML Technology
Quick Trial w/ New ML algo.
Long time
Many Tuning Parameters
9 © NEC Corporation 2018
Our Framework automates Predictive Modeling!
High skill
Precious CS Ph.D.s
Evolving ML Technology
Quick Trial w/ New ML algo.
Long time
Many Tuning Parameters
10 © NEC Corporation 2018
Values of Our Automation Framework
Democratized to
business users
Quick model
selection
Easy integration with
future ML
implementations
Best
Design Challenges and Solutions
12 © NEC Corporation 2018
High Level Architecture
Training Data
Validate Data
Training Validate
Criteria
⋮
⋮
Run
HDFS
13 © NEC Corporation 2018
Design Challenges
High Scalability Open for ML Implementations
14 © NEC Corporation 2018
Design Challenges
High Scalability Open for ML Implementations
15 © NEC Corporation 2018
Just Adding Nodes doesn’t Improve Performance
5 min 4 min
1 min 2 min Wait 6 min
16 © NEC Corporation 2018
Scheduling as a Combinatorial Optimization Problem
𝑱𝒐𝒃 𝟏
𝑱𝒐𝒃 𝟐
𝑱𝒐𝒃 𝟑
𝑾 𝟏
𝑾 𝟐
5 min
5 min
𝑱𝒐𝒃 𝟏
𝑱𝒐𝒃 𝟐
𝑱𝒐𝒃 𝟑
𝑻 𝟏
𝑻 𝟐
𝐦𝐚𝐱 {𝑻 𝟏, 𝑻 𝟐}
2 min
1 min
5 min
Scheduler
⋮
17 © NEC Corporation 2018
Scheduling as a Combinatorial Optimization Problem
𝑱𝒐𝒃 𝟏
𝑱𝒐𝒃 𝟐
𝑱𝒐𝒃 𝟑
𝑾 𝟏
𝑾 𝟐
5 min
5 min
𝑱𝒐𝒃 𝟏
𝑱𝒐𝒃 𝟐
𝑱𝒐𝒃 𝟑
𝑻 𝟏
𝑻 𝟐
Minimize 𝐦𝐚𝐱 {𝑻 𝟏, 𝑻 𝟐}
2 min
1 min
5 min
Scheduler
⋮
18 © NEC Corporation 2018
Scheduling as a Combinatorial Optimization Problem
𝑱𝒐𝒃 𝟏
𝑱𝒐𝒃 𝟐
𝑱𝒐𝒃 𝟑
𝑾 𝟏
𝑾 𝟐
5 min
5 min
𝑱𝒐𝒃 𝟏
𝑱𝒐𝒃 𝟐
𝑱𝒐𝒃 𝟑
𝑻 𝟏
𝑻 𝟐
Minimize 𝐦𝐚𝐱 {𝑻 𝟏, 𝑻 𝟐}
??? min
Scheduler
⋮
??? min
??? min
19 © NEC Corporation 2018
eta
max_depth
⋮ ⋮
round
Scheduler Pre-profiles Job Time via Sampled Data
This training takes
xx.x sec.
Small Sampled Data Scheduler
20 © NEC Corporation 2018
Scheduler Pre-profiles Job Time via Sampled Data
Scheduling
Scheduler
Profiling
Automated Predictive
Modeling
Small Sampled Data Entire Data
21 © NEC Corporation 2018
Preliminary Evaluation: Pre-profiling Tasks Little Time
2.34%
In pre-profiling,
• sampling 1% data from training data
• executing trainings for same search
space as automatic prediction modeling
22 © NEC Corporation 2018
Design Challenges
High Scalability Open for ML Implementations
23 © NEC Corporation 2018
Reducing Implementation Costs to Add New ML impl.
Distributed Learning Validation /
Model Selection
24 © NEC Corporation 2018
Naïve Design: Requires Many Changes to plug-in New ML
Distributed Learning
Training
Training
Training
invoke
Add code
New ML Format Data
TF Format Data
XGB Format Data
25 © NEC Corporation 2018
Easy Integration with New ML impl. by Encapsulation
Distributed Learning Training
Training
Training
Training
Encapsulation
invoke
Common Format Data
♪~
Add code
26 © NEC Corporation 2018
Easy Integration with New ML impl. by Encapsulation
Validation /
Model selection
Prediction
Prediction
Prediction
Prediction
Encapsulation
invoke
Common Format Data
♪~
Add code
Evaluation
28 © NEC Corporation 2018
Evaluation Setup
▌Dataset
HIGGS (UCI Dataset Repository)
• 1M sampled data for each training,
validation and test data
• 28 features
▌Scheduler Training
Executes same grids for training
Using 1% sample of training data
▌Environment
Apache Spark 2.3.0
Apache Hadoop 3.1.0
▌Exploring Algorithms
Gradient Boosting Tree (GBT)
• XGBoost 0.8
• 864 grid points
Multi-layer Perceptron (MLP)
• TensorFlow 1.8.0
• 324 grid points
Logistic Regression (LR)
• scikit-learn 0.18.1
• 5 grid points
Random Forest (RF)
• scikit-learn 0.18.1
• 18 grid points
29 © NEC Corporation 2018
Evaluation Result: Total Execution Time
× 𝟏𝟑. 𝟏 faster !!
30 © NEC Corporation 2018
Spark MLlib Focuses on Scaling out for Huge Data Size
Core 1
Core 2
Core 3
Core 1
Core 2
Core 3
Next Model
Complete training !
Shuffle
Training
Dataset
31 © NEC Corporation 2018
Core 1
Core 2
Core 3
Core 1
Core 2
Core 3
Next Model
Complete training !
No-Shuffle
Training
Dataset
We Focuses on Huge Search Space of Parameter Tuning
Our Framework
Next Model
Next Model
Read entire
data
32 © NEC Corporation 2018
Evaluation Result: Execution Performance for Scalability
72.7%
78.4%
81.7%
84.7%
33 © NEC Corporation 2018
Evaluation Result: Improvement of Error and AUC
Classification
Accuracy
AUC
Best model* 0.756 0.837
Gradient Boosting Tree** (-0.013) 0.743 (-0.012) 0.825
Logistic Regression** (-0.114) 0.642 (-0.153) 0.684
Random Forest** (-0.032) 0.724 (-0.036) 0.801
* Best model produced by our framework.
** Using default hyper parameters of XGBoost and scikit-learn
34 © NEC Corporation 2018
Evaluation Result : Amount of Code for Adding New ML
# Lines of Code w/o comments
151 lines
292 lines
290 lines
python : 116
scala : 176
python : 90
scala : 200
Summary and Future work
36 © NEC Corporation 2018
Summary – Automation Framework for Predictive Modeling
Best model
x
x
x
Quick!
Scalable!
Plug-in
37 © NEC Corporation 2018
Values
Democratized to
business users
Quick model
selection
Easy integration with
future ML
implementations
Best
38 © NEC Corporation 2018
Design Challenges (Addressed)
High Scalability Open for ML Implementations
39 © NEC Corporation 2018
Future work - Convert Data Structure for Each ML impl.
Common Format :
Double[ ][ ]
Sparse
Column-oriented
Row-oriented
Memory Copy &
Convert
40 © NEC Corporation 2018
Common Memory Format can be Read w/o copy is Better
Common Format :
????
Sparse
Column-oriented
Row-oriented
Zero-copy read
Apache Arrow …?
Quick! Quick! Exploration!: A framework for searching a predictive model on Apache Spark

Quick! Quick! Exploration!: A framework for searching a predictive model on Apache Spark

  • 1.
    Quick! Quick! Exploration!: Aframework for searching a predictive model on Apache Spark Masato Asahara*, Yoshiki Takahashi+ and Kazuyuki Shudo+ * NEC Corporation, + Tokyo Institute of Technology Jun/21/2018 @DataWorks Summit 2018
  • 2.
    2 © NECCorporation 2018 Who we are? ▌Masato Asahara (Ph.D.) ▌Principal Software Architecture and Researcher at NEC System Platform Research Laboratory Masato Asahara (Ph.D.) is currently leading developments of Spark-based machine learning and data analytics systems, which fully automate predictive modeling. Masato received his Ph.D. degree from Keio University, and has worked at NEC for 8 years as a researcher in the field of distributed computing systems and computing resource management technologies. ▌Yoshiki Takahashi ▌Master course student at Tokyo Institute of Technology Yoshiki Takahashi is a student of the master of computer science program at the graduate school of Tokyo Institute of Technology. His academic research proposal is accepted in SysML 2018 which has attracted attention since its previous workshop era in NIPS. He worked on development of a Spark-based machine learning platform for automatic predictive modeling in his internship program at NEC Data Science Research Laboratories in 2017. He received his B.S. degree in 2017 from Tokyo Institute of Technology.
  • 3.
    3 © NECCorporation 2018 Agenda Best model x x x x x x 1000+ Patterns
  • 4.
    4 © NECCorporation 2018 Agenda Best model x x x Quick! Scalable! Plug-in
  • 5.
    5 © NECCorporation 2018 Agenda MLlib (A cluster with 16 CPU cores, Using HIGGS data sets of UCI ML repository) × 𝟏𝟑 faster !!Our framework
  • 6.
  • 7.
    7 © NECCorporation 2018 Predictive Analysis in Enterprise Area Driver Risk Assessment Inventory Optimization Churn Retention Predictive Maintenance Product Price Optimization Sales Optimization Energy/Water Operation Mgmt.
  • 8.
    8 © NECCorporation 2018 Pain of Modern Predictive Modeling High skill Precious CS Ph.D.s Evolving ML Technology Quick Trial w/ New ML algo. Long time Many Tuning Parameters
  • 9.
    9 © NECCorporation 2018 Our Framework automates Predictive Modeling! High skill Precious CS Ph.D.s Evolving ML Technology Quick Trial w/ New ML algo. Long time Many Tuning Parameters
  • 10.
    10 © NECCorporation 2018 Values of Our Automation Framework Democratized to business users Quick model selection Easy integration with future ML implementations Best
  • 11.
  • 12.
    12 © NECCorporation 2018 High Level Architecture Training Data Validate Data Training Validate Criteria ⋮ ⋮ Run HDFS
  • 13.
    13 © NECCorporation 2018 Design Challenges High Scalability Open for ML Implementations
  • 14.
    14 © NECCorporation 2018 Design Challenges High Scalability Open for ML Implementations
  • 15.
    15 © NECCorporation 2018 Just Adding Nodes doesn’t Improve Performance 5 min 4 min 1 min 2 min Wait 6 min
  • 16.
    16 © NECCorporation 2018 Scheduling as a Combinatorial Optimization Problem 𝑱𝒐𝒃 𝟏 𝑱𝒐𝒃 𝟐 𝑱𝒐𝒃 𝟑 𝑾 𝟏 𝑾 𝟐 5 min 5 min 𝑱𝒐𝒃 𝟏 𝑱𝒐𝒃 𝟐 𝑱𝒐𝒃 𝟑 𝑻 𝟏 𝑻 𝟐 𝐦𝐚𝐱 {𝑻 𝟏, 𝑻 𝟐} 2 min 1 min 5 min Scheduler ⋮
  • 17.
    17 © NECCorporation 2018 Scheduling as a Combinatorial Optimization Problem 𝑱𝒐𝒃 𝟏 𝑱𝒐𝒃 𝟐 𝑱𝒐𝒃 𝟑 𝑾 𝟏 𝑾 𝟐 5 min 5 min 𝑱𝒐𝒃 𝟏 𝑱𝒐𝒃 𝟐 𝑱𝒐𝒃 𝟑 𝑻 𝟏 𝑻 𝟐 Minimize 𝐦𝐚𝐱 {𝑻 𝟏, 𝑻 𝟐} 2 min 1 min 5 min Scheduler ⋮
  • 18.
    18 © NECCorporation 2018 Scheduling as a Combinatorial Optimization Problem 𝑱𝒐𝒃 𝟏 𝑱𝒐𝒃 𝟐 𝑱𝒐𝒃 𝟑 𝑾 𝟏 𝑾 𝟐 5 min 5 min 𝑱𝒐𝒃 𝟏 𝑱𝒐𝒃 𝟐 𝑱𝒐𝒃 𝟑 𝑻 𝟏 𝑻 𝟐 Minimize 𝐦𝐚𝐱 {𝑻 𝟏, 𝑻 𝟐} ??? min Scheduler ⋮ ??? min ??? min
  • 19.
    19 © NECCorporation 2018 eta max_depth ⋮ ⋮ round Scheduler Pre-profiles Job Time via Sampled Data This training takes xx.x sec. Small Sampled Data Scheduler
  • 20.
    20 © NECCorporation 2018 Scheduler Pre-profiles Job Time via Sampled Data Scheduling Scheduler Profiling Automated Predictive Modeling Small Sampled Data Entire Data
  • 21.
    21 © NECCorporation 2018 Preliminary Evaluation: Pre-profiling Tasks Little Time 2.34% In pre-profiling, • sampling 1% data from training data • executing trainings for same search space as automatic prediction modeling
  • 22.
    22 © NECCorporation 2018 Design Challenges High Scalability Open for ML Implementations
  • 23.
    23 © NECCorporation 2018 Reducing Implementation Costs to Add New ML impl. Distributed Learning Validation / Model Selection
  • 24.
    24 © NECCorporation 2018 Naïve Design: Requires Many Changes to plug-in New ML Distributed Learning Training Training Training invoke Add code New ML Format Data TF Format Data XGB Format Data
  • 25.
    25 © NECCorporation 2018 Easy Integration with New ML impl. by Encapsulation Distributed Learning Training Training Training Training Encapsulation invoke Common Format Data ♪~ Add code
  • 26.
    26 © NECCorporation 2018 Easy Integration with New ML impl. by Encapsulation Validation / Model selection Prediction Prediction Prediction Prediction Encapsulation invoke Common Format Data ♪~ Add code
  • 27.
  • 28.
    28 © NECCorporation 2018 Evaluation Setup ▌Dataset HIGGS (UCI Dataset Repository) • 1M sampled data for each training, validation and test data • 28 features ▌Scheduler Training Executes same grids for training Using 1% sample of training data ▌Environment Apache Spark 2.3.0 Apache Hadoop 3.1.0 ▌Exploring Algorithms Gradient Boosting Tree (GBT) • XGBoost 0.8 • 864 grid points Multi-layer Perceptron (MLP) • TensorFlow 1.8.0 • 324 grid points Logistic Regression (LR) • scikit-learn 0.18.1 • 5 grid points Random Forest (RF) • scikit-learn 0.18.1 • 18 grid points
  • 29.
    29 © NECCorporation 2018 Evaluation Result: Total Execution Time × 𝟏𝟑. 𝟏 faster !!
  • 30.
    30 © NECCorporation 2018 Spark MLlib Focuses on Scaling out for Huge Data Size Core 1 Core 2 Core 3 Core 1 Core 2 Core 3 Next Model Complete training ! Shuffle Training Dataset
  • 31.
    31 © NECCorporation 2018 Core 1 Core 2 Core 3 Core 1 Core 2 Core 3 Next Model Complete training ! No-Shuffle Training Dataset We Focuses on Huge Search Space of Parameter Tuning Our Framework Next Model Next Model Read entire data
  • 32.
    32 © NECCorporation 2018 Evaluation Result: Execution Performance for Scalability 72.7% 78.4% 81.7% 84.7%
  • 33.
    33 © NECCorporation 2018 Evaluation Result: Improvement of Error and AUC Classification Accuracy AUC Best model* 0.756 0.837 Gradient Boosting Tree** (-0.013) 0.743 (-0.012) 0.825 Logistic Regression** (-0.114) 0.642 (-0.153) 0.684 Random Forest** (-0.032) 0.724 (-0.036) 0.801 * Best model produced by our framework. ** Using default hyper parameters of XGBoost and scikit-learn
  • 34.
    34 © NECCorporation 2018 Evaluation Result : Amount of Code for Adding New ML # Lines of Code w/o comments 151 lines 292 lines 290 lines python : 116 scala : 176 python : 90 scala : 200
  • 35.
  • 36.
    36 © NECCorporation 2018 Summary – Automation Framework for Predictive Modeling Best model x x x Quick! Scalable! Plug-in
  • 37.
    37 © NECCorporation 2018 Values Democratized to business users Quick model selection Easy integration with future ML implementations Best
  • 38.
    38 © NECCorporation 2018 Design Challenges (Addressed) High Scalability Open for ML Implementations
  • 39.
    39 © NECCorporation 2018 Future work - Convert Data Structure for Each ML impl. Common Format : Double[ ][ ] Sparse Column-oriented Row-oriented Memory Copy & Convert
  • 40.
    40 © NECCorporation 2018 Common Memory Format can be Read w/o copy is Better Common Format : ???? Sparse Column-oriented Row-oriented Zero-copy read Apache Arrow …?