SlideShare a Scribd company logo
1 of 62
Download to read offline
Big Data Heterogeneous
Mixture Learning on Spark
Masato Asahara and Ryohei Fujimaki
NEC Data Science Research Labs.
Jun/30/2016 @Hadoop Summit 2016
2 © NEC Corporation 2016
Who we are?
▌Masato Asahara (Ph.D.)
▌Researcher, NEC Data Science Research Laboratory
 Masato received his Ph.D. degree from Keio University, and has worked at
NEC for 6 years in the research field of distributed computing systems and
computing resource management technologies.
 Masato is currently leading developments of Spark-based machine learning
and data analytics systems, particularly using NEC’s Heterogeneous Mixture
Learning technology.
▌Ryohei Fujimaki (Ph.D.)
▌Research Fellow, NEC Data Science Research Laboratory
 Ryohei is responsible for cores to NEC’s leading predictive and prescriptive
analytics solutions, including "heterogeneous mixture learning” and
“predictive optimization” technologies. In addition to technology R&D,
Ryohei is also involved with co-developing cutting-edge advanced analytics
solutions with NEC’s global business clients and partners in the North
American and APAC regions.
 Ryohei received his Ph.D. degree from the University of Tokyo, and became
the youngest research fellow ever in NEC Corporation’s 115-year history.
3 © NEC Corporation 2016
Agenda
HML
4 © NEC Corporation 2016
Agenda
CPU
12
SEP
11
SEP
10
SEP
HML
5 © NEC Corporation 2016
Agenda
HML
6 © NEC Corporation 2016
Agenda
Micro Modular Server
HML
NEC’s Predictive Analytics and
Heterogeneous Mixture Learning
8 © NEC Corporation 2016
Enterprise Applications of HML
Driver Risk
Assessment
Inventory
Optimization
Churn
Retention
Predictive
Maintenance
Product Price
Optimization
Sales
Optimization
Energy/Water
Operation Mgmt.
9 © NEC Corporation 2016
NEC’s Heterogeneous Mixture Learning (HML)
NEC’s machine learning that automatically derives “accurate” and
“transparent” formulas behind Big Data.
Sun
Sat
Y=a1x1+ ・・・ +anxn
Y=a1x1+ ・・・+knx3
Y=b1x2+ ・・・+hnxn
Y=c1x4+ ・・・+fnxn
Mon
Sun
Explore Massive Formulas
Transparent
data segmentation
and predictive formulas
Heterogeneous
mixture data
10 © NEC Corporation 2016
The HML Model
Gender
Height
tallshort
age
smoke
food
weight
sports
sports
tea/coffee
beer/wine
The health risk of a
tall man can be
predicted by age,
alcohol and smoke!
11 © NEC Corporation 2016
HML (Heterogeneous Mixture Learning)
12 © NEC Corporation 2016
HML (Heterogeneous Mixture Learning)
13 © NEC Corporation 2016
HML Algorithm
Segment data by a rule-based tree
Feature
Selection
Pruning
Data
Segmentation
Training data
14 © NEC Corporation 2016
HML Algorithm
Select features and fit predictive models for data segments
Feature
Selection
Pruning
Data
Segmentation
Training data
15 © NEC Corporation 2016
Enterprise Applications of HML
Driver Risk
Assessment
Inventory
Optimization
Churn
Retention
Predictive
Maintenance
Energy/Water
Operation Mgmt.
Sales
Optimization
Product Price
Optimization
16 © NEC Corporation 2016
Driver Risk
Assessment
Inventory
Optimization
Churn
Retention
Predictive
Maintenance
Energy/Water
Demand Forecasting
Sales
Forecasting
Product Price
Optimization
Demand for Big Data Analysis
5 million(customers)×12(months)
=60,000,000 training samples
17 © NEC Corporation 2016
Driver Risk
Assessment
Inventory
Optimization
Churn
Retention
Predictive
Maintenanc
e
Energy/Water
Operation Mgmt.
Product Price
Optimization
Demand for Big Data Analysis
100(items)×1000(shops)
=100,000 predictive models
Sales
Optimization
18 © NEC Corporation 2016
Driver Risk
Assessment
Inventory
Optimization
Churn
Retention
Predictive
Maintenanc
e
Energy/Water
Operation Mgmt.
Sales
Optimization
Product Price
Optimization
Demand for Big Data Analysis
Distributed HML on Spark
20 © NEC Corporation 2016
Why Spark, not Hadoop?
Because Spark’s in-memory architecture can run HML faster
Feature
Selection
Pruning
Data
Segmentation
CPU CPU
Data segmentation Feature selection
CPU CPU CPU
Data segmentation Feature selection
~
Pruning
21 © NEC Corporation 2016
Data Scalability powered by Spark
Treat unlimited scale of training data by adding executors
executor executor executor
CPU CPU CPU
driver
HDFS
Add executors,
and get
more memory
and CPU power
Training data
22 © NEC Corporation 2016
Distributed HML Engine: Architecture
HDFS
Distributed HML
Core (Scala)
YARN
invoke
HML
model
Distributed HML
Interface (Python)
Matrix
computation
libraries
(Breeze, BLAS)
Data Scientist
23 © NEC Corporation 2016
3 Technical Key Points to Fast Run ML on Spark
CPU
12
SEP
11
SEP
10
SEP
24 © NEC Corporation 2016
3 Technical Key Points to Fast Run ML on Spark
CPU
12
SEP
11
SEP
10
SEP
25 © NEC Corporation 2016
Challenge to Avoid Data Shuffling
executor executor executor
driver
Model:
10KB~10MB
HML Model Merge Algorithm
Training data:
TB~
executor executor executor
driver
Naïve Design Distributed HML
26 © NEC Corporation 2016
Execution Flow of Distributed HML
Executors build a rule-based tree from their local data in parallel
Driver
Executor
Feature
Selection
Pruning
Data
Segmentation
27 © NEC Corporation 2016
Execution Flow of Distributed HML
Driver merges the trees
Driver
Executor
Feature
Selection
Pruning
Data
Segmentation
HML Segment Merge Algorithm
28 © NEC Corporation 2016
Execution Flow of Distributed HML
Driver broadcasts the merged tree to executors
Driver
Executor
Feature
Selection
Pruning
Data
Segmentation
29 © NEC Corporation 2016
Execution Flow of Distributed HML
Executors perform feature selection with their local data in parallel
Driver
Executor
Feature
Selection
Pruning
Data
Segmentation
Sat
Sat
Sat
Sat
Sat
30 © NEC Corporation 2016
Execution Flow of Distributed HML
Driver merges the results of feature selection
Driver
Executor
Feature
Selection
Pruning
Data
Segmentation
Sat
Sat
Sat
Sat
Sat
HML Feature
Merge Algorithm
31 © NEC Corporation 2016
Execution Flow of Distributed HML
Driver merges the results of feature selection
Driver
Executor
Feature
Selection
Pruning
Data
Segmentation
Sat
HML Feature
Merge Algorithm
32 © NEC Corporation 2016
Execution Flow of Distributed HML
Driver broadcasts the merged results of feature selection to executors
Driver
Executor
Feature
Selection
Pruning
Data
Segmentation
Sat
SatSatSat
33 © NEC Corporation 2016
3 Technical Key Points to Fast Run ML on Spark
CPU
12
SEP
11
SEP
10
SEP
34 © NEC Corporation 2016
Wait Time for Other Executors Delays Execution
Machine learning is likely to cause unbalanced computation load
Executor
Sat
Sat
Time
Executor Executor
35 © NEC Corporation 2016
Balance Computational Effort for Each Executor
Executor optimizes all predictive formulas with equally-divided training data
Executor
Sat
Sat
Time
Executor Executor
36 © NEC Corporation 2016
Balance Computational Effort for Each Executor
Executor optimizes all predictive formulas with equally-divided training data
Executor
Sat
Sat
Time
Executor Executor
Sat Sat
Sat Sat
Completion time reduced!
37 © NEC Corporation 2016
3 Technical Key Points to Fast Run ML on Spark
1TB
CPU
12
SEP
11
SEP
10
SEP
38 © NEC Corporation 2016
Machine Learning Consists of Many Matrix Computations
Feature
Selection
Pruning
Data
Segmentation
Basic Matrix Computation
Matrix Inversion
Combinatorial Optimization
…
CPU
39 © NEC Corporation 2016
RDD Design for Leveraging High-Speed Libraries
Distributed HML’s RDD Design
Matrix object1
element
…
Straightforward RDD Design
Training sample1
partition
Training sample2
CPU
CPU
Matrix computation libraries
(Breeze, BLAS)
~Training sample1
Training sample2
…
seq(Training sample1,
Training sample2, …)
mapPartition
Matrix object1
…
Training sample1
Training sample2
map
Iterator
Matrix
object
creation
element
40 © NEC Corporation 2016
Performance Degradation by Long-Chained RDD
Long-chained RDD operations cause high-cost recalculations
Feature
Selection
Pruning
Data
Segmentation RDD.map(TreeMapFunc)
.map(FeatureSelectionMapFunc)
.reduce(TreeReduceFunc)
.reduce(FeatureSelectionReduceFunc)
.map(TreeMapFunc)
.map(FeatureSelectionMapFunc)
.reduce(TreeReduceFunc)
.reduce(FeatureSelectionReduceFunc)
.map(PruningMapFunc)
41 © NEC Corporation 2016
Performance Degradation by Long-Chained RDD
Long-chained RDD operations cause high-cost recalculations
Feature
Selection
Pruning
Data
Segmentation RDD.map(TreeMapFunc)
.map(FeatureSelectionMapFunc)
.reduce(TreeReduceFunc)
.reduce(FeatureSelectionReduceFunc)
.map(TreeMapFunc)
.map(FeatureSelectionMapFunc)
.reduce(TreeReduceFunc)
.reduce(FeatureSelectionReduceFunc)
GC
.map(TreeMapFunc)
CPU
.map(PruningMapFunc)
42 © NEC Corporation 2016
Performance Degradation by Long-Chained RDD
Cut long-chained operations periodically by checkpoint()
Feature
Selection
Pruning
Data
Segmentation RDD.map(TreeMapFunc)
.map(FeatureSelectionMapFunc)
.reduce(TreeReduceFunc)
.reduce(FeatureSelectionReduceFunc)
.map(TreeMapFunc)
.map(FeatureSelectionMapFunc)
.reduce(TreeReduceFunc)
.reduce(FeatureSelectionReduceFunc)
GC
.map(TreeMapFunc)
savedRDD
.checkpoint()
=
CPU
~
.map(PruningMapFunc)
Benchmark Performance Evaluations
44 © NEC Corporation 2016
Eval 1: Prediction Error (vs. Spark MLlib algorithms)
Distributed HML achieves low error competitive to a complex model
data # samples Distributed
HML
Logistic
Regression
Decision
Tree
Random
Forests
gas sensor
array (CO)*
4,208,261 0.542 0.597 0.587 0.576
household
power
consumption
*
2,075,259 0.524 0.531 0.529 0.655
HIGGS* 11,000,000 0.335 0.358 0.337 0.317
HEPMASS* 7,000,000 0.156 0.163 0.167 0.175
* UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/)
1st
1st
1st
2nd
45 © NEC Corporation 2016
Eval 2: Performance Scalability (vs. Spark MLlib algos)
Distributed HML is competitive with Spark MLlib implementations.
HIGGS* (11M samples) HEPMASS* (7M samples)# CPU cores
Distributed
HML
Logistic
Regression
Distributed
HML
* UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/)
Logistic
Regression
Evaluation in Real Case
47 © NEC Corporation 2016
ATM Cash Balance Prediction
Much cash  High
inventory cost
Little cash  High risk of
out of stock
Around 20,000 ATMs
48 © NEC Corporation 2016
Training Speed
2 hours9 days
▌Summary of data
# ATMs: around 20,000 (in Japan)
# training samples: around 10M
▌Cluster spec (10 nodes)
 # CPU cores: 128 (Xeon E5-2640 v3)
 Memory: 2.5TB
 Spark 1.6.1,
Hadoop 2.7.1 (HDP 2.3.2)
* Run w/ 1 CPU core and 256GB memory
110x
Speed up
Serial HML* Distributed HML
49 © NEC Corporation 2016
Prediction Error
▌Summary of data
# ATMs: around 20,000 (in Japan)
# training samples: around 23M
▌Cluster spec (10 nodes)
 # CPU cores: 128
 Memory: 2.5TB
 Spark 1.6.1,
Hadoop 2.7.1 (HDP 2.3.2)
Average Error:
17% lower
Serial
HML
Distributed HMLSerial
HML
Serial
HML
vs
Better: Distributed HML
NEC’s HML Data Analytics Appliance
powered by Micro Modular Server
51 © NEC Corporation 2016
Dilemma between Cloud and On-Premises
Cloud On-Premises
•High pay-for-use charge
•Complicated application
software design
•Data leakage risk
• Complicated operation
for Spark/YARN/HDFS
• High installation cost
• High space/power cost
•Quick start
•Low operation effort for
Spark/YARN/HDFS
•No space/power cost
•No installation effort
• No pay-for-use charge
• Low data leakage risk
• Simple application
software design
52 © NEC Corporation 2016
NEC’s HML Data Analytics Appliance: Concept
Micro Modular Server
Data Scientist ~
Plug-and-Analytics!
•Quick deploy!
•Low data leakage risk
•Low operation effort for
Spark/YARN/HDFS powered by HDP
•Significantly space saving
•Low power cost
•Simple application software
328 CPU cores, 1.3TB memory
and 5TB SSD in 2U space!
HML
53 © NEC Corporation 2016
Speed Test (ATM Cash Balance Prediction)
7275 sec
(2h)
▌Summary of data
# ATMs: around 20,000 (in Japan)
# training samples: around 10M
▌Standard Cluster spec (10 nodes)
 # CPU cores: 128 (Xeon E5-2640 v3)
 Memory: 2.5TB
 Spark 1.6.1,
Hadoop 2.7.1 (HDP 2.3.2)
Standard Cluster Micro Modular Server
4529 sec
(1.25h)
1.6x
Speed up
54 © NEC Corporation 2016
Toward More Big Data!
Micro Modular Server
(DX1000)
Scalable Modular Server
(DX2000)
• CPU 2.3x speed up
• MEM 1.7x increased
• SSD 3.3x increased
272 high-speed Xeon-D 2.1GHz cores, 2TB
memory and 17TB SSD in 3U space
55 © NEC Corporation 2016
Summary
CPU
12
SEP
11
SEP
10
SEP
HMLMicro Modular Server
56 © NEC Corporation 2016
Summary
HMLMicro Modular Server
Appendix
59 © NEC Corporation 2016
Micro Modular Server (DX1000) Spec.: Server Module
60 © NEC Corporation 2016
Micro Modular Server (DX1000) Spec.: Module Enclosure
61 © NEC Corporation 2016
Scalable Modular Server (DX2000) Spec.: Server Module
62 © NEC Corporation 2016
Scalable Modular Server (DX2000) Spec.: Module Enclosure

More Related Content

What's hot

Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
Brock Noland
 
Lessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloudLessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloud
DataWorks Summit
 
Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...
DataWorks Summit
 

What's hot (20)

Spark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan ZvaraSpark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan Zvara
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Lessons Learned Migrating from IBM BigInsights to Hortonworks Data Platform
Lessons Learned Migrating from IBM BigInsights to Hortonworks Data PlatformLessons Learned Migrating from IBM BigInsights to Hortonworks Data Platform
Lessons Learned Migrating from IBM BigInsights to Hortonworks Data Platform
 
Deep Learning at Scale
Deep Learning at ScaleDeep Learning at Scale
Deep Learning at Scale
 
The Evolution of Big Data Pipelines at Intuit
The Evolution of Big Data Pipelines at Intuit The Evolution of Big Data Pipelines at Intuit
The Evolution of Big Data Pipelines at Intuit
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...
Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...
Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
 
Lessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloudLessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloud
 
Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...
 
Keys for Success from Streams to Queries
Keys for Success from Streams to QueriesKeys for Success from Streams to Queries
Keys for Success from Streams to Queries
 
The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!
 
Self-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons LearnedSelf-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons Learned
 
Real World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in ProductionReal World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in Production
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
 
Building Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous DataBuilding Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous Data
 
Production Grade Data Science for Hadoop
Production Grade Data Science for HadoopProduction Grade Data Science for Hadoop
Production Grade Data Science for Hadoop
 
Paris FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks PresentationParis FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks Presentation
 
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
 

Viewers also liked

Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJIntro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Daniel Madrigal
 
Supply Chain Management of 7 eleven
Supply Chain Management  of 7 elevenSupply Chain Management  of 7 eleven
Supply Chain Management of 7 eleven
Susheel Racherla
 

Viewers also liked (20)

Launching your advanced analytics program for success in a mature industry
Launching your advanced analytics program for success in a mature industryLaunching your advanced analytics program for success in a mature industry
Launching your advanced analytics program for success in a mature industry
 
Lego-like building blocks of Storm and Spark Streaming Pipelines
Lego-like building blocks of Storm and Spark Streaming PipelinesLego-like building blocks of Storm and Spark Streaming Pipelines
Lego-like building blocks of Storm and Spark Streaming Pipelines
 
Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsHadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the experts
 
2015-4 혁신기술로서의 빅데이터 국내 기술수용 초기 특성연구- 김정선
2015-4 혁신기술로서의 빅데이터 국내 기술수용 초기 특성연구- 김정선2015-4 혁신기술로서의 빅데이터 국내 기술수용 초기 특성연구- 김정선
2015-4 혁신기술로서의 빅데이터 국내 기술수용 초기 특성연구- 김정선
 
IoT:what about data storage?
IoT:what about data storage?IoT:what about data storage?
IoT:what about data storage?
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 
Timeline service V2 at the Hadoop Summit SJ 2016
Timeline service V2 at the Hadoop Summit SJ 2016Timeline service V2 at the Hadoop Summit SJ 2016
Timeline service V2 at the Hadoop Summit SJ 2016
 
The Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache FlinkThe Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache Flink
 
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJIntro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
 
Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
Apache Phoenix + Apache HBase
Apache Phoenix + Apache HBaseApache Phoenix + Apache HBase
Apache Phoenix + Apache HBase
 
Filling the Data Lake
Filling the Data LakeFilling the Data Lake
Filling the Data Lake
 
Supply Chain Management of 7 eleven
Supply Chain Management  of 7 elevenSupply Chain Management  of 7 eleven
Supply Chain Management of 7 eleven
 
Distributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop ClustersDistributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop Clusters
 
7-Eleven Presentation
7-Eleven Presentation7-Eleven Presentation
7-Eleven Presentation
 
Supply Chain Strategy at 7-Eleven
Supply Chain Strategy at 7-ElevenSupply Chain Strategy at 7-Eleven
Supply Chain Strategy at 7-Eleven
 
7 eleven Japan
7 eleven Japan7 eleven Japan
7 eleven Japan
 
Integrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data LakesIntegrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data Lakes
 

Similar to Big Data Heterogeneous Mixture Learning on Spark

Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
DataWorks Summit
 
Resume_Mahadevan_new (2)
Resume_Mahadevan_new (2)Resume_Mahadevan_new (2)
Resume_Mahadevan_new (2)
Mahadevan N
 

Similar to Big Data Heterogeneous Mixture Learning on Spark (20)

Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On Spark
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
 
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
IBM - Craig Bender
IBM - Craig BenderIBM - Craig Bender
IBM - Craig Bender
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
 
Austin,TX Meetup presentation tensorflow final oct 26 2017
Austin,TX Meetup presentation tensorflow final oct 26 2017Austin,TX Meetup presentation tensorflow final oct 26 2017
Austin,TX Meetup presentation tensorflow final oct 26 2017
 
Trends towards the merge of HPC + Big Data systems
Trends towards the merge of HPC + Big Data systemsTrends towards the merge of HPC + Big Data systems
Trends towards the merge of HPC + Big Data systems
 
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBMPowering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
 
Resume_Mahadevan_new (2)
Resume_Mahadevan_new (2)Resume_Mahadevan_new (2)
Resume_Mahadevan_new (2)
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWER
 
The Fast Path to Building Operational Applications with Spark
The Fast Path to Building Operational Applications with SparkThe Fast Path to Building Operational Applications with Spark
The Fast Path to Building Operational Applications with Spark
 
Sql 2016 2017 full
Sql 2016   2017 fullSql 2016   2017 full
Sql 2016 2017 full
 
Insync10 anthony spierings
Insync10 anthony spieringsInsync10 anthony spierings
Insync10 anthony spierings
 
CH02-Computer Organization and Architecture 10e.pptx
CH02-Computer Organization and Architecture 10e.pptxCH02-Computer Organization and Architecture 10e.pptx
CH02-Computer Organization and Architecture 10e.pptx
 
High Performance Computing
High Performance ComputingHigh Performance Computing
High Performance Computing
 
OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020
 

More from DataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Big Data Heterogeneous Mixture Learning on Spark

  • 1. Big Data Heterogeneous Mixture Learning on Spark Masato Asahara and Ryohei Fujimaki NEC Data Science Research Labs. Jun/30/2016 @Hadoop Summit 2016
  • 2. 2 © NEC Corporation 2016 Who we are? ▌Masato Asahara (Ph.D.) ▌Researcher, NEC Data Science Research Laboratory  Masato received his Ph.D. degree from Keio University, and has worked at NEC for 6 years in the research field of distributed computing systems and computing resource management technologies.  Masato is currently leading developments of Spark-based machine learning and data analytics systems, particularly using NEC’s Heterogeneous Mixture Learning technology. ▌Ryohei Fujimaki (Ph.D.) ▌Research Fellow, NEC Data Science Research Laboratory  Ryohei is responsible for cores to NEC’s leading predictive and prescriptive analytics solutions, including "heterogeneous mixture learning” and “predictive optimization” technologies. In addition to technology R&D, Ryohei is also involved with co-developing cutting-edge advanced analytics solutions with NEC’s global business clients and partners in the North American and APAC regions.  Ryohei received his Ph.D. degree from the University of Tokyo, and became the youngest research fellow ever in NEC Corporation’s 115-year history.
  • 3. 3 © NEC Corporation 2016 Agenda HML
  • 4. 4 © NEC Corporation 2016 Agenda CPU 12 SEP 11 SEP 10 SEP HML
  • 5. 5 © NEC Corporation 2016 Agenda HML
  • 6. 6 © NEC Corporation 2016 Agenda Micro Modular Server HML
  • 7. NEC’s Predictive Analytics and Heterogeneous Mixture Learning
  • 8. 8 © NEC Corporation 2016 Enterprise Applications of HML Driver Risk Assessment Inventory Optimization Churn Retention Predictive Maintenance Product Price Optimization Sales Optimization Energy/Water Operation Mgmt.
  • 9. 9 © NEC Corporation 2016 NEC’s Heterogeneous Mixture Learning (HML) NEC’s machine learning that automatically derives “accurate” and “transparent” formulas behind Big Data. Sun Sat Y=a1x1+ ・・・ +anxn Y=a1x1+ ・・・+knx3 Y=b1x2+ ・・・+hnxn Y=c1x4+ ・・・+fnxn Mon Sun Explore Massive Formulas Transparent data segmentation and predictive formulas Heterogeneous mixture data
  • 10. 10 © NEC Corporation 2016 The HML Model Gender Height tallshort age smoke food weight sports sports tea/coffee beer/wine The health risk of a tall man can be predicted by age, alcohol and smoke!
  • 11. 11 © NEC Corporation 2016 HML (Heterogeneous Mixture Learning)
  • 12. 12 © NEC Corporation 2016 HML (Heterogeneous Mixture Learning)
  • 13. 13 © NEC Corporation 2016 HML Algorithm Segment data by a rule-based tree Feature Selection Pruning Data Segmentation Training data
  • 14. 14 © NEC Corporation 2016 HML Algorithm Select features and fit predictive models for data segments Feature Selection Pruning Data Segmentation Training data
  • 15. 15 © NEC Corporation 2016 Enterprise Applications of HML Driver Risk Assessment Inventory Optimization Churn Retention Predictive Maintenance Energy/Water Operation Mgmt. Sales Optimization Product Price Optimization
  • 16. 16 © NEC Corporation 2016 Driver Risk Assessment Inventory Optimization Churn Retention Predictive Maintenance Energy/Water Demand Forecasting Sales Forecasting Product Price Optimization Demand for Big Data Analysis 5 million(customers)×12(months) =60,000,000 training samples
  • 17. 17 © NEC Corporation 2016 Driver Risk Assessment Inventory Optimization Churn Retention Predictive Maintenanc e Energy/Water Operation Mgmt. Product Price Optimization Demand for Big Data Analysis 100(items)×1000(shops) =100,000 predictive models Sales Optimization
  • 18. 18 © NEC Corporation 2016 Driver Risk Assessment Inventory Optimization Churn Retention Predictive Maintenanc e Energy/Water Operation Mgmt. Sales Optimization Product Price Optimization Demand for Big Data Analysis
  • 20. 20 © NEC Corporation 2016 Why Spark, not Hadoop? Because Spark’s in-memory architecture can run HML faster Feature Selection Pruning Data Segmentation CPU CPU Data segmentation Feature selection CPU CPU CPU Data segmentation Feature selection ~ Pruning
  • 21. 21 © NEC Corporation 2016 Data Scalability powered by Spark Treat unlimited scale of training data by adding executors executor executor executor CPU CPU CPU driver HDFS Add executors, and get more memory and CPU power Training data
  • 22. 22 © NEC Corporation 2016 Distributed HML Engine: Architecture HDFS Distributed HML Core (Scala) YARN invoke HML model Distributed HML Interface (Python) Matrix computation libraries (Breeze, BLAS) Data Scientist
  • 23. 23 © NEC Corporation 2016 3 Technical Key Points to Fast Run ML on Spark CPU 12 SEP 11 SEP 10 SEP
  • 24. 24 © NEC Corporation 2016 3 Technical Key Points to Fast Run ML on Spark CPU 12 SEP 11 SEP 10 SEP
  • 25. 25 © NEC Corporation 2016 Challenge to Avoid Data Shuffling executor executor executor driver Model: 10KB~10MB HML Model Merge Algorithm Training data: TB~ executor executor executor driver Naïve Design Distributed HML
  • 26. 26 © NEC Corporation 2016 Execution Flow of Distributed HML Executors build a rule-based tree from their local data in parallel Driver Executor Feature Selection Pruning Data Segmentation
  • 27. 27 © NEC Corporation 2016 Execution Flow of Distributed HML Driver merges the trees Driver Executor Feature Selection Pruning Data Segmentation HML Segment Merge Algorithm
  • 28. 28 © NEC Corporation 2016 Execution Flow of Distributed HML Driver broadcasts the merged tree to executors Driver Executor Feature Selection Pruning Data Segmentation
  • 29. 29 © NEC Corporation 2016 Execution Flow of Distributed HML Executors perform feature selection with their local data in parallel Driver Executor Feature Selection Pruning Data Segmentation Sat Sat Sat Sat Sat
  • 30. 30 © NEC Corporation 2016 Execution Flow of Distributed HML Driver merges the results of feature selection Driver Executor Feature Selection Pruning Data Segmentation Sat Sat Sat Sat Sat HML Feature Merge Algorithm
  • 31. 31 © NEC Corporation 2016 Execution Flow of Distributed HML Driver merges the results of feature selection Driver Executor Feature Selection Pruning Data Segmentation Sat HML Feature Merge Algorithm
  • 32. 32 © NEC Corporation 2016 Execution Flow of Distributed HML Driver broadcasts the merged results of feature selection to executors Driver Executor Feature Selection Pruning Data Segmentation Sat SatSatSat
  • 33. 33 © NEC Corporation 2016 3 Technical Key Points to Fast Run ML on Spark CPU 12 SEP 11 SEP 10 SEP
  • 34. 34 © NEC Corporation 2016 Wait Time for Other Executors Delays Execution Machine learning is likely to cause unbalanced computation load Executor Sat Sat Time Executor Executor
  • 35. 35 © NEC Corporation 2016 Balance Computational Effort for Each Executor Executor optimizes all predictive formulas with equally-divided training data Executor Sat Sat Time Executor Executor
  • 36. 36 © NEC Corporation 2016 Balance Computational Effort for Each Executor Executor optimizes all predictive formulas with equally-divided training data Executor Sat Sat Time Executor Executor Sat Sat Sat Sat Completion time reduced!
  • 37. 37 © NEC Corporation 2016 3 Technical Key Points to Fast Run ML on Spark 1TB CPU 12 SEP 11 SEP 10 SEP
  • 38. 38 © NEC Corporation 2016 Machine Learning Consists of Many Matrix Computations Feature Selection Pruning Data Segmentation Basic Matrix Computation Matrix Inversion Combinatorial Optimization … CPU
  • 39. 39 © NEC Corporation 2016 RDD Design for Leveraging High-Speed Libraries Distributed HML’s RDD Design Matrix object1 element … Straightforward RDD Design Training sample1 partition Training sample2 CPU CPU Matrix computation libraries (Breeze, BLAS) ~Training sample1 Training sample2 … seq(Training sample1, Training sample2, …) mapPartition Matrix object1 … Training sample1 Training sample2 map Iterator Matrix object creation element
  • 40. 40 © NEC Corporation 2016 Performance Degradation by Long-Chained RDD Long-chained RDD operations cause high-cost recalculations Feature Selection Pruning Data Segmentation RDD.map(TreeMapFunc) .map(FeatureSelectionMapFunc) .reduce(TreeReduceFunc) .reduce(FeatureSelectionReduceFunc) .map(TreeMapFunc) .map(FeatureSelectionMapFunc) .reduce(TreeReduceFunc) .reduce(FeatureSelectionReduceFunc) .map(PruningMapFunc)
  • 41. 41 © NEC Corporation 2016 Performance Degradation by Long-Chained RDD Long-chained RDD operations cause high-cost recalculations Feature Selection Pruning Data Segmentation RDD.map(TreeMapFunc) .map(FeatureSelectionMapFunc) .reduce(TreeReduceFunc) .reduce(FeatureSelectionReduceFunc) .map(TreeMapFunc) .map(FeatureSelectionMapFunc) .reduce(TreeReduceFunc) .reduce(FeatureSelectionReduceFunc) GC .map(TreeMapFunc) CPU .map(PruningMapFunc)
  • 42. 42 © NEC Corporation 2016 Performance Degradation by Long-Chained RDD Cut long-chained operations periodically by checkpoint() Feature Selection Pruning Data Segmentation RDD.map(TreeMapFunc) .map(FeatureSelectionMapFunc) .reduce(TreeReduceFunc) .reduce(FeatureSelectionReduceFunc) .map(TreeMapFunc) .map(FeatureSelectionMapFunc) .reduce(TreeReduceFunc) .reduce(FeatureSelectionReduceFunc) GC .map(TreeMapFunc) savedRDD .checkpoint() = CPU ~ .map(PruningMapFunc)
  • 44. 44 © NEC Corporation 2016 Eval 1: Prediction Error (vs. Spark MLlib algorithms) Distributed HML achieves low error competitive to a complex model data # samples Distributed HML Logistic Regression Decision Tree Random Forests gas sensor array (CO)* 4,208,261 0.542 0.597 0.587 0.576 household power consumption * 2,075,259 0.524 0.531 0.529 0.655 HIGGS* 11,000,000 0.335 0.358 0.337 0.317 HEPMASS* 7,000,000 0.156 0.163 0.167 0.175 * UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/) 1st 1st 1st 2nd
  • 45. 45 © NEC Corporation 2016 Eval 2: Performance Scalability (vs. Spark MLlib algos) Distributed HML is competitive with Spark MLlib implementations. HIGGS* (11M samples) HEPMASS* (7M samples)# CPU cores Distributed HML Logistic Regression Distributed HML * UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/) Logistic Regression
  • 47. 47 © NEC Corporation 2016 ATM Cash Balance Prediction Much cash  High inventory cost Little cash  High risk of out of stock Around 20,000 ATMs
  • 48. 48 © NEC Corporation 2016 Training Speed 2 hours9 days ▌Summary of data # ATMs: around 20,000 (in Japan) # training samples: around 10M ▌Cluster spec (10 nodes)  # CPU cores: 128 (Xeon E5-2640 v3)  Memory: 2.5TB  Spark 1.6.1, Hadoop 2.7.1 (HDP 2.3.2) * Run w/ 1 CPU core and 256GB memory 110x Speed up Serial HML* Distributed HML
  • 49. 49 © NEC Corporation 2016 Prediction Error ▌Summary of data # ATMs: around 20,000 (in Japan) # training samples: around 23M ▌Cluster spec (10 nodes)  # CPU cores: 128  Memory: 2.5TB  Spark 1.6.1, Hadoop 2.7.1 (HDP 2.3.2) Average Error: 17% lower Serial HML Distributed HMLSerial HML Serial HML vs Better: Distributed HML
  • 50. NEC’s HML Data Analytics Appliance powered by Micro Modular Server
  • 51. 51 © NEC Corporation 2016 Dilemma between Cloud and On-Premises Cloud On-Premises •High pay-for-use charge •Complicated application software design •Data leakage risk • Complicated operation for Spark/YARN/HDFS • High installation cost • High space/power cost •Quick start •Low operation effort for Spark/YARN/HDFS •No space/power cost •No installation effort • No pay-for-use charge • Low data leakage risk • Simple application software design
  • 52. 52 © NEC Corporation 2016 NEC’s HML Data Analytics Appliance: Concept Micro Modular Server Data Scientist ~ Plug-and-Analytics! •Quick deploy! •Low data leakage risk •Low operation effort for Spark/YARN/HDFS powered by HDP •Significantly space saving •Low power cost •Simple application software 328 CPU cores, 1.3TB memory and 5TB SSD in 2U space! HML
  • 53. 53 © NEC Corporation 2016 Speed Test (ATM Cash Balance Prediction) 7275 sec (2h) ▌Summary of data # ATMs: around 20,000 (in Japan) # training samples: around 10M ▌Standard Cluster spec (10 nodes)  # CPU cores: 128 (Xeon E5-2640 v3)  Memory: 2.5TB  Spark 1.6.1, Hadoop 2.7.1 (HDP 2.3.2) Standard Cluster Micro Modular Server 4529 sec (1.25h) 1.6x Speed up
  • 54. 54 © NEC Corporation 2016 Toward More Big Data! Micro Modular Server (DX1000) Scalable Modular Server (DX2000) • CPU 2.3x speed up • MEM 1.7x increased • SSD 3.3x increased 272 high-speed Xeon-D 2.1GHz cores, 2TB memory and 17TB SSD in 3U space
  • 55. 55 © NEC Corporation 2016 Summary CPU 12 SEP 11 SEP 10 SEP HMLMicro Modular Server
  • 56. 56 © NEC Corporation 2016 Summary HMLMicro Modular Server
  • 57.
  • 59. 59 © NEC Corporation 2016 Micro Modular Server (DX1000) Spec.: Server Module
  • 60. 60 © NEC Corporation 2016 Micro Modular Server (DX1000) Spec.: Module Enclosure
  • 61. 61 © NEC Corporation 2016 Scalable Modular Server (DX2000) Spec.: Server Module
  • 62. 62 © NEC Corporation 2016 Scalable Modular Server (DX2000) Spec.: Module Enclosure