Big Data Heterogeneous Mixture Learning on Spark

Big Data Heterogeneous
Mixture Learning on Spark
Masato Asahara and Ryohei Fujimaki
NEC Data Science Research Labs.
Jun/30/2016 @Hadoop Summit 2016

2 © NEC Corporation 2016
Who we are?
▌Masato Asahara (Ph.D.)
▌Researcher, NEC Data Science Research Laboratory
 Masato received his Ph.D. degree from Keio University, and has worked at
NEC for 6 years in the research field of distributed computing systems and
computing resource management technologies.
 Masato is currently leading developments of Spark-based machine learning
and data analytics systems, particularly using NEC’s Heterogeneous Mixture
Learning technology.
▌Ryohei Fujimaki (Ph.D.)
▌Research Fellow, NEC Data Science Research Laboratory
 Ryohei is responsible for cores to NEC’s leading predictive and prescriptive
analytics solutions, including "heterogeneous mixture learning” and
“predictive optimization” technologies. In addition to technology R&D,
Ryohei is also involved with co-developing cutting-edge advanced analytics
solutions with NEC’s global business clients and partners in the North
American and APAC regions.
 Ryohei received his Ph.D. degree from the University of Tokyo, and became
the youngest research fellow ever in NEC Corporation’s 115-year history.

Agenda
HML

Agenda
CPU
12
SEP
11
SEP
10
SEP
HML

Agenda
HML

Agenda
Micro Modular Server
HML

NEC’s Predictive Analytics and
Heterogeneous Mixture Learning

Enterprise Applications of HML
Driver Risk
Assessment
Inventory
Optimization
Churn
Retention
Predictive
Maintenance
Product Price
Optimization
Sales
Optimization
Energy/Water
Operation Mgmt.

NEC’s Heterogeneous Mixture Learning (HML)
NEC’s machine learning that automatically derives “accurate” and
“transparent” formulas behind Big Data.
Sun
Sat
Y=a1x1+ ・・・ +anxn
Y=a1x1+ ・・・+knx3
Y=b1x2+ ・・・+hnxn
Y=c1x4+ ・・・+fnxn
Mon
Sun
Explore Massive Formulas
Transparent
data segmentation
and predictive formulas
Heterogeneous
mixture data

The HML Model
Gender
Height
tallshort
age
smoke
food
weight
sports
sports
tea/coffee
beer/wine
The health risk of a
tall man can be
predicted by age,
alcohol and smoke!

HML (Heterogeneous Mixture Learning)

HML Algorithm
Segment data by a rule-based tree
Feature
Selection
Pruning
Data
Segmentation
Training data

HML Algorithm
Select features and fit predictive models for data segments
Feature
Selection
Pruning
Data
Segmentation
Training data

Enterprise Applications of HML
Driver Risk
Assessment
Inventory
Optimization
Churn
Retention
Predictive
Maintenance
Energy/Water
Operation Mgmt.
Sales
Optimization
Product Price
Optimization

Driver Risk
Assessment
Inventory
Optimization
Churn
Retention
Predictive
Maintenance
Energy/Water
Demand Forecasting
Sales
Forecasting
Product Price
Optimization
Demand for Big Data Analysis
5 million(customers)×12(months)
＝60,000,000 training samples

Driver Risk
Assessment
Inventory
Optimization
Churn
Retention
Predictive
Maintenanc
e
Energy/Water
Operation Mgmt.
Product Price
Optimization
100(items)×1000(shops)
=100,000 predictive models
Sales
Optimization

Driver Risk
Assessment
Inventory
Optimization
Churn
Retention
Predictive
Maintenanc
e
Energy/Water
Operation Mgmt.
Sales
Optimization
Product Price
Optimization

Why Spark, not Hadoop?
Because Spark’s in-memory architecture can run HML faster
Feature
Selection
Pruning
Data
Segmentation
CPU CPU
Data segmentation Feature selection
CPU CPU CPU
Data segmentation Feature selection
~
Pruning

Data Scalability powered by Spark
Treat unlimited scale of training data by adding executors
executor executor executor
CPU CPU CPU
driver
HDFS
Add executors,
and get
more memory
and CPU power
Training data

Distributed HML Engine: Architecture
HDFS
Distributed HML
Core (Scala)
YARN
invoke
HML
model
Distributed HML
Interface (Python)
Matrix
computation
libraries
(Breeze, BLAS)
Data Scientist

3 Technical Key Points to Fast Run ML on Spark
CPU
12
SEP
11
SEP
10
SEP

CPU
12
SEP
11
SEP
10
SEP

Challenge to Avoid Data Shuffling
driver
Model:
10KB~10MB
HML Model Merge Algorithm
Training data:
TB~
driver
Naïve Design Distributed HML

Execution Flow of Distributed HML
Executors build a rule-based tree from their local data in parallel
Driver
Executor
Feature
Selection
Pruning
Data
Segmentation

Driver merges the trees
Driver
Executor
Feature
Selection
Pruning
Data
Segmentation
HML Segment Merge Algorithm

Driver broadcasts the merged tree to executors
Driver
Executor
Feature
Selection
Pruning
Data
Segmentation

Executors perform feature selection with their local data in parallel
Driver
Executor
Feature
Selection
Pruning
Data
Segmentation
Sat
Sat
Sat
Sat
Sat

Driver merges the results of feature selection
Driver
Executor
Feature
Selection
Pruning
Data
Segmentation
Sat
Sat
Sat
Sat
Sat
HML Feature
Merge Algorithm

Driver merges the results of feature selection
Driver
Executor
Feature
Selection
Pruning
Data
Segmentation
Sat
HML Feature
Merge Algorithm

Driver broadcasts the merged results of feature selection to executors
Driver
Executor
Feature
Selection
Pruning
Data
Segmentation
Sat
SatSatSat

CPU
12
SEP
11
SEP
10
SEP

Wait Time for Other Executors Delays Execution
Machine learning is likely to cause unbalanced computation load
Executor
Sat
Sat
Time
Executor Executor

Balance Computational Effort for Each Executor
Executor optimizes all predictive formulas with equally-divided training data
Executor
Sat
Sat
Time
Executor Executor

Balance Computational Effort for Each Executor
Executor optimizes all predictive formulas with equally-divided training data
Executor
Sat
Sat
Time
Executor Executor
Sat Sat
Sat Sat
Completion time reduced!

1TB
CPU
12
SEP
11
SEP
10
SEP

Machine Learning Consists of Many Matrix Computations
Feature
Selection
Pruning
Data
Segmentation
Basic Matrix Computation
Matrix Inversion
Combinatorial Optimization
…
CPU

RDD Design for Leveraging High-Speed Libraries
Distributed HML’s RDD Design
Matrix object1
element
…
Straightforward RDD Design
Training sample1
partition
Training sample2
CPU
CPU
Matrix computation libraries
(Breeze, BLAS)
~Training sample1
Training sample2
…
seq(Training sample1,
Training sample2, …)
mapPartition
Matrix object1
…
Training sample1
Training sample2
map
Iterator
Matrix
object
creation
element

Performance Degradation by Long-Chained RDD
Long-chained RDD operations cause high-cost recalculations
Feature
Selection
Pruning
Data
Segmentation RDD.map(TreeMapFunc)
.map(FeatureSelectionMapFunc)
.reduce(TreeReduceFunc)
.reduce(FeatureSelectionReduceFunc)
.map(TreeMapFunc)
.map(PruningMapFunc)

Long-chained RDD operations cause high-cost recalculations
Feature
Selection
Pruning
Data
.map(TreeMapFunc)
GC
.map(TreeMapFunc)
CPU

Cut long-chained operations periodically by checkpoint()
Feature
Selection
Pruning
Data
.map(TreeMapFunc)
GC
.map(TreeMapFunc)
savedRDD
.checkpoint()
=
CPU
~

Benchmark Performance Evaluations

Eval 1: Prediction Error (vs. Spark MLlib algorithms)
Distributed HML achieves low error competitive to a complex model
data # samples Distributed
HML
Logistic
Regression
Decision
Tree
Random
Forests
gas sensor
array (CO)*
4,208,261 0.542 0.597 0.587 0.576
household
power
consumption
*
2,075,259 0.524 0.531 0.529 0.655
HIGGS* 11,000,000 0.335 0.358 0.337 0.317
HEPMASS* 7,000,000 0.156 0.163 0.167 0.175
* UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/)
1st
1st
1st
2nd

Eval 2: Performance Scalability (vs. Spark MLlib algos)
Distributed HML is competitive with Spark MLlib implementations.
HIGGS* (11M samples) HEPMASS* (7M samples)# CPU cores
Distributed
HML
Logistic
Regression
Distributed
HML
* UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/)
Logistic
Regression

ATM Cash Balance Prediction
Much cash  High
inventory cost
Little cash  High risk of
out of stock
Around 20,000 ATMs

Training Speed
2 hours9 days
▌Summary of data
# ATMs: around 20,000 (in Japan)
# training samples: around 10M
▌Cluster spec (10 nodes)
 # CPU cores: 128 (Xeon E5-2640 v3)
 Memory: 2.5TB
 Spark 1.6.1,
Hadoop 2.7.1 (HDP 2.3.2)
* Run w/ 1 CPU core and 256GB memory
110x
Speed up
Serial HML* Distributed HML

Prediction Error
▌Summary of data
▌Cluster spec (10 nodes)
 # CPU cores: 128
 Memory: 2.5TB
 Spark 1.6.1,
Hadoop 2.7.1 (HDP 2.3.2)
Average Error:
17% lower
Serial
HML
Distributed HMLSerial
HML
Serial
HML
vs
Better: Distributed HML

NEC’s HML Data Analytics Appliance
powered by Micro Modular Server

Dilemma between Cloud and On-Premises
Cloud On-Premises
•High pay-for-use charge
•Complicated application
software design
•Data leakage risk
• Complicated operation
for Spark/YARN/HDFS
• High installation cost
• High space/power cost
•Quick start
•Low operation effort for
Spark/YARN/HDFS
•No space/power cost
•No installation effort
• No pay-for-use charge
• Low data leakage risk
• Simple application
software design

NEC’s HML Data Analytics Appliance: Concept
Data Scientist ~
Plug-and-Analytics!
•Quick deploy!
•Low data leakage risk
•Low operation effort for
Spark/YARN/HDFS powered by HDP
•Significantly space saving
•Low power cost
•Simple application software
328 CPU cores, 1.3TB memory
and 5TB SSD in 2U space!
HML

Speed Test (ATM Cash Balance Prediction)
7275 sec
(2h)
▌Summary of data
▌Standard Cluster spec (10 nodes)
 # CPU cores: 128 (Xeon E5-2640 v3)
 Memory: 2.5TB
 Spark 1.6.1,
Hadoop 2.7.1 (HDP 2.3.2)
Standard Cluster Micro Modular Server
4529 sec
(1.25h)
1.6x
Speed up

Toward More Big Data!
(DX1000)
Scalable Modular Server
(DX2000)
• CPU 2.3x speed up
• MEM 1.7x increased
• SSD 3.3x increased
272 high-speed Xeon-D 2.1GHz cores, 2TB
memory and 17TB SSD in 3U space

Summary
CPU
12
SEP
11
SEP
10
SEP
HMLMicro Modular Server

Summary
HMLMicro Modular Server

Micro Modular Server (DX1000) Spec.: Server Module

Micro Modular Server (DX1000) Spec.: Module Enclosure

Scalable Modular Server (DX2000) Spec.: Server Module

Scalable Modular Server (DX2000) Spec.: Module Enclosure

Big Data Heterogeneous Mixture Learning on Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Big Data Heterogeneous Mixture Learning on Spark

Similar to Big Data Heterogeneous Mixture Learning on Spark (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Big Data Heterogeneous Mixture Learning on Spark