More Related Content
Similar to Big Data Heterogeneous Mixture Learning on Spark (20)
More from DataWorks Summit/Hadoop Summit (20)
Big Data Heterogeneous Mixture Learning on Spark
- 1. Big Data Heterogeneous
Mixture Learning on Spark
Masato Asahara and Ryohei Fujimaki
NEC Data Science Research Labs.
Jun/30/2016 @Hadoop Summit 2016
- 2. 2 © NEC Corporation 2016
Who we are?
▌Masato Asahara (Ph.D.)
▌Researcher, NEC Data Science Research Laboratory
Masato received his Ph.D. degree from Keio University, and has worked at
NEC for 6 years in the research field of distributed computing systems and
computing resource management technologies.
Masato is currently leading developments of Spark-based machine learning
and data analytics systems, particularly using NEC’s Heterogeneous Mixture
Learning technology.
▌Ryohei Fujimaki (Ph.D.)
▌Research Fellow, NEC Data Science Research Laboratory
Ryohei is responsible for cores to NEC’s leading predictive and prescriptive
analytics solutions, including "heterogeneous mixture learning” and
“predictive optimization” technologies. In addition to technology R&D,
Ryohei is also involved with co-developing cutting-edge advanced analytics
solutions with NEC’s global business clients and partners in the North
American and APAC regions.
Ryohei received his Ph.D. degree from the University of Tokyo, and became
the youngest research fellow ever in NEC Corporation’s 115-year history.
- 3. 3 © NEC Corporation 2016
Agenda
HML
- 4. 4 © NEC Corporation 2016
Agenda
CPU
12
SEP
11
SEP
10
SEP
HML
- 5. 5 © NEC Corporation 2016
Agenda
HML
- 6. 6 © NEC Corporation 2016
Agenda
Micro Modular Server
HML
- 8. 8 © NEC Corporation 2016
Enterprise Applications of HML
Driver Risk
Assessment
Inventory
Optimization
Churn
Retention
Predictive
Maintenance
Product Price
Optimization
Sales
Optimization
Energy/Water
Operation Mgmt.
- 9. 9 © NEC Corporation 2016
NEC’s Heterogeneous Mixture Learning (HML)
NEC’s machine learning that automatically derives “accurate” and
“transparent” formulas behind Big Data.
Sun
Sat
Y=a1x1+ ・・・ +anxn
Y=a1x1+ ・・・+knx3
Y=b1x2+ ・・・+hnxn
Y=c1x4+ ・・・+fnxn
Mon
Sun
Explore Massive Formulas
Transparent
data segmentation
and predictive formulas
Heterogeneous
mixture data
- 10. 10 © NEC Corporation 2016
The HML Model
Gender
Height
tallshort
age
smoke
food
weight
sports
sports
tea/coffee
beer/wine
The health risk of a
tall man can be
predicted by age,
alcohol and smoke!
- 11. 11 © NEC Corporation 2016
HML (Heterogeneous Mixture Learning)
- 12. 12 © NEC Corporation 2016
HML (Heterogeneous Mixture Learning)
- 13. 13 © NEC Corporation 2016
HML Algorithm
Segment data by a rule-based tree
Feature
Selection
Pruning
Data
Segmentation
Training data
- 14. 14 © NEC Corporation 2016
HML Algorithm
Select features and fit predictive models for data segments
Feature
Selection
Pruning
Data
Segmentation
Training data
- 15. 15 © NEC Corporation 2016
Enterprise Applications of HML
Driver Risk
Assessment
Inventory
Optimization
Churn
Retention
Predictive
Maintenance
Energy/Water
Operation Mgmt.
Sales
Optimization
Product Price
Optimization
- 16. 16 © NEC Corporation 2016
Driver Risk
Assessment
Inventory
Optimization
Churn
Retention
Predictive
Maintenance
Energy/Water
Demand Forecasting
Sales
Forecasting
Product Price
Optimization
Demand for Big Data Analysis
5 million(customers)×12(months)
=60,000,000 training samples
- 17. 17 © NEC Corporation 2016
Driver Risk
Assessment
Inventory
Optimization
Churn
Retention
Predictive
Maintenanc
e
Energy/Water
Operation Mgmt.
Product Price
Optimization
Demand for Big Data Analysis
100(items)×1000(shops)
=100,000 predictive models
Sales
Optimization
- 18. 18 © NEC Corporation 2016
Driver Risk
Assessment
Inventory
Optimization
Churn
Retention
Predictive
Maintenanc
e
Energy/Water
Operation Mgmt.
Sales
Optimization
Product Price
Optimization
Demand for Big Data Analysis
- 20. 20 © NEC Corporation 2016
Why Spark, not Hadoop?
Because Spark’s in-memory architecture can run HML faster
Feature
Selection
Pruning
Data
Segmentation
CPU CPU
Data segmentation Feature selection
CPU CPU CPU
Data segmentation Feature selection
~
Pruning
- 21. 21 © NEC Corporation 2016
Data Scalability powered by Spark
Treat unlimited scale of training data by adding executors
executor executor executor
CPU CPU CPU
driver
HDFS
Add executors,
and get
more memory
and CPU power
Training data
- 22. 22 © NEC Corporation 2016
Distributed HML Engine: Architecture
HDFS
Distributed HML
Core (Scala)
YARN
invoke
HML
model
Distributed HML
Interface (Python)
Matrix
computation
libraries
(Breeze, BLAS)
Data Scientist
- 23. 23 © NEC Corporation 2016
3 Technical Key Points to Fast Run ML on Spark
CPU
12
SEP
11
SEP
10
SEP
- 24. 24 © NEC Corporation 2016
3 Technical Key Points to Fast Run ML on Spark
CPU
12
SEP
11
SEP
10
SEP
- 25. 25 © NEC Corporation 2016
Challenge to Avoid Data Shuffling
executor executor executor
driver
Model:
10KB~10MB
HML Model Merge Algorithm
Training data:
TB~
executor executor executor
driver
Naïve Design Distributed HML
- 26. 26 © NEC Corporation 2016
Execution Flow of Distributed HML
Executors build a rule-based tree from their local data in parallel
Driver
Executor
Feature
Selection
Pruning
Data
Segmentation
- 27. 27 © NEC Corporation 2016
Execution Flow of Distributed HML
Driver merges the trees
Driver
Executor
Feature
Selection
Pruning
Data
Segmentation
HML Segment Merge Algorithm
- 28. 28 © NEC Corporation 2016
Execution Flow of Distributed HML
Driver broadcasts the merged tree to executors
Driver
Executor
Feature
Selection
Pruning
Data
Segmentation
- 29. 29 © NEC Corporation 2016
Execution Flow of Distributed HML
Executors perform feature selection with their local data in parallel
Driver
Executor
Feature
Selection
Pruning
Data
Segmentation
Sat
Sat
Sat
Sat
Sat
- 30. 30 © NEC Corporation 2016
Execution Flow of Distributed HML
Driver merges the results of feature selection
Driver
Executor
Feature
Selection
Pruning
Data
Segmentation
Sat
Sat
Sat
Sat
Sat
HML Feature
Merge Algorithm
- 31. 31 © NEC Corporation 2016
Execution Flow of Distributed HML
Driver merges the results of feature selection
Driver
Executor
Feature
Selection
Pruning
Data
Segmentation
Sat
HML Feature
Merge Algorithm
- 32. 32 © NEC Corporation 2016
Execution Flow of Distributed HML
Driver broadcasts the merged results of feature selection to executors
Driver
Executor
Feature
Selection
Pruning
Data
Segmentation
Sat
SatSatSat
- 33. 33 © NEC Corporation 2016
3 Technical Key Points to Fast Run ML on Spark
CPU
12
SEP
11
SEP
10
SEP
- 34. 34 © NEC Corporation 2016
Wait Time for Other Executors Delays Execution
Machine learning is likely to cause unbalanced computation load
Executor
Sat
Sat
Time
Executor Executor
- 35. 35 © NEC Corporation 2016
Balance Computational Effort for Each Executor
Executor optimizes all predictive formulas with equally-divided training data
Executor
Sat
Sat
Time
Executor Executor
- 36. 36 © NEC Corporation 2016
Balance Computational Effort for Each Executor
Executor optimizes all predictive formulas with equally-divided training data
Executor
Sat
Sat
Time
Executor Executor
Sat Sat
Sat Sat
Completion time reduced!
- 37. 37 © NEC Corporation 2016
3 Technical Key Points to Fast Run ML on Spark
1TB
CPU
12
SEP
11
SEP
10
SEP
- 38. 38 © NEC Corporation 2016
Machine Learning Consists of Many Matrix Computations
Feature
Selection
Pruning
Data
Segmentation
Basic Matrix Computation
Matrix Inversion
Combinatorial Optimization
…
CPU
- 39. 39 © NEC Corporation 2016
RDD Design for Leveraging High-Speed Libraries
Distributed HML’s RDD Design
Matrix object1
element
…
Straightforward RDD Design
Training sample1
partition
Training sample2
CPU
CPU
Matrix computation libraries
(Breeze, BLAS)
~Training sample1
Training sample2
…
seq(Training sample1,
Training sample2, …)
mapPartition
Matrix object1
…
Training sample1
Training sample2
map
Iterator
Matrix
object
creation
element
- 40. 40 © NEC Corporation 2016
Performance Degradation by Long-Chained RDD
Long-chained RDD operations cause high-cost recalculations
Feature
Selection
Pruning
Data
Segmentation RDD.map(TreeMapFunc)
.map(FeatureSelectionMapFunc)
.reduce(TreeReduceFunc)
.reduce(FeatureSelectionReduceFunc)
.map(TreeMapFunc)
.map(FeatureSelectionMapFunc)
.reduce(TreeReduceFunc)
.reduce(FeatureSelectionReduceFunc)
.map(PruningMapFunc)
- 41. 41 © NEC Corporation 2016
Performance Degradation by Long-Chained RDD
Long-chained RDD operations cause high-cost recalculations
Feature
Selection
Pruning
Data
Segmentation RDD.map(TreeMapFunc)
.map(FeatureSelectionMapFunc)
.reduce(TreeReduceFunc)
.reduce(FeatureSelectionReduceFunc)
.map(TreeMapFunc)
.map(FeatureSelectionMapFunc)
.reduce(TreeReduceFunc)
.reduce(FeatureSelectionReduceFunc)
GC
.map(TreeMapFunc)
CPU
.map(PruningMapFunc)
- 42. 42 © NEC Corporation 2016
Performance Degradation by Long-Chained RDD
Cut long-chained operations periodically by checkpoint()
Feature
Selection
Pruning
Data
Segmentation RDD.map(TreeMapFunc)
.map(FeatureSelectionMapFunc)
.reduce(TreeReduceFunc)
.reduce(FeatureSelectionReduceFunc)
.map(TreeMapFunc)
.map(FeatureSelectionMapFunc)
.reduce(TreeReduceFunc)
.reduce(FeatureSelectionReduceFunc)
GC
.map(TreeMapFunc)
savedRDD
.checkpoint()
=
CPU
~
.map(PruningMapFunc)
- 44. 44 © NEC Corporation 2016
Eval 1: Prediction Error (vs. Spark MLlib algorithms)
Distributed HML achieves low error competitive to a complex model
data # samples Distributed
HML
Logistic
Regression
Decision
Tree
Random
Forests
gas sensor
array (CO)*
4,208,261 0.542 0.597 0.587 0.576
household
power
consumption
*
2,075,259 0.524 0.531 0.529 0.655
HIGGS* 11,000,000 0.335 0.358 0.337 0.317
HEPMASS* 7,000,000 0.156 0.163 0.167 0.175
* UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/)
1st
1st
1st
2nd
- 45. 45 © NEC Corporation 2016
Eval 2: Performance Scalability (vs. Spark MLlib algos)
Distributed HML is competitive with Spark MLlib implementations.
HIGGS* (11M samples) HEPMASS* (7M samples)# CPU cores
Distributed
HML
Logistic
Regression
Distributed
HML
* UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/)
Logistic
Regression
- 47. 47 © NEC Corporation 2016
ATM Cash Balance Prediction
Much cash High
inventory cost
Little cash High risk of
out of stock
Around 20,000 ATMs
- 48. 48 © NEC Corporation 2016
Training Speed
2 hours9 days
▌Summary of data
# ATMs: around 20,000 (in Japan)
# training samples: around 10M
▌Cluster spec (10 nodes)
# CPU cores: 128 (Xeon E5-2640 v3)
Memory: 2.5TB
Spark 1.6.1,
Hadoop 2.7.1 (HDP 2.3.2)
* Run w/ 1 CPU core and 256GB memory
110x
Speed up
Serial HML* Distributed HML
- 49. 49 © NEC Corporation 2016
Prediction Error
▌Summary of data
# ATMs: around 20,000 (in Japan)
# training samples: around 23M
▌Cluster spec (10 nodes)
# CPU cores: 128
Memory: 2.5TB
Spark 1.6.1,
Hadoop 2.7.1 (HDP 2.3.2)
Average Error:
17% lower
Serial
HML
Distributed HMLSerial
HML
Serial
HML
vs
Better: Distributed HML
- 50. NEC’s HML Data Analytics Appliance
powered by Micro Modular Server
- 51. 51 © NEC Corporation 2016
Dilemma between Cloud and On-Premises
Cloud On-Premises
•High pay-for-use charge
•Complicated application
software design
•Data leakage risk
• Complicated operation
for Spark/YARN/HDFS
• High installation cost
• High space/power cost
•Quick start
•Low operation effort for
Spark/YARN/HDFS
•No space/power cost
•No installation effort
• No pay-for-use charge
• Low data leakage risk
• Simple application
software design
- 52. 52 © NEC Corporation 2016
NEC’s HML Data Analytics Appliance: Concept
Micro Modular Server
Data Scientist ~
Plug-and-Analytics!
•Quick deploy!
•Low data leakage risk
•Low operation effort for
Spark/YARN/HDFS powered by HDP
•Significantly space saving
•Low power cost
•Simple application software
328 CPU cores, 1.3TB memory
and 5TB SSD in 2U space!
HML
- 53. 53 © NEC Corporation 2016
Speed Test (ATM Cash Balance Prediction)
7275 sec
(2h)
▌Summary of data
# ATMs: around 20,000 (in Japan)
# training samples: around 10M
▌Standard Cluster spec (10 nodes)
# CPU cores: 128 (Xeon E5-2640 v3)
Memory: 2.5TB
Spark 1.6.1,
Hadoop 2.7.1 (HDP 2.3.2)
Standard Cluster Micro Modular Server
4529 sec
(1.25h)
1.6x
Speed up
- 54. 54 © NEC Corporation 2016
Toward More Big Data!
Micro Modular Server
(DX1000)
Scalable Modular Server
(DX2000)
• CPU 2.3x speed up
• MEM 1.7x increased
• SSD 3.3x increased
272 high-speed Xeon-D 2.1GHz cores, 2TB
memory and 17TB SSD in 3U space
- 55. 55 © NEC Corporation 2016
Summary
CPU
12
SEP
11
SEP
10
SEP
HMLMicro Modular Server
- 56. 56 © NEC Corporation 2016
Summary
HMLMicro Modular Server
- 59. 59 © NEC Corporation 2016
Micro Modular Server (DX1000) Spec.: Server Module
- 60. 60 © NEC Corporation 2016
Micro Modular Server (DX1000) Spec.: Module Enclosure
- 61. 61 © NEC Corporation 2016
Scalable Modular Server (DX2000) Spec.: Server Module
- 62. 62 © NEC Corporation 2016
Scalable Modular Server (DX2000) Spec.: Module Enclosure