A Database-Hadoop Hybrid Approach
to Scalable Machine Learning
Makoto YUI, Isao Kojima AIST, Japan
<m.yui@aist.go.jp>
June...
Outline
1. Motivation & Problem Description
2. Our Hybrid Approach to Scalable Machine
Learning
 Architecture
 Our batch...
3
As we seen in the Keynote and the Panel discussion (2nd day)
of this conference, data analytics and machine learning are...
What’s possible choices around there?
In-database Analytics
 MADlib (open-source project lead by Greenplum)
 Bismarck (P...
4 Issues needed to be considered
1. Scalability
5
Scalability would always be a problem when
handing Big Data
4 Issues needed to be considered
1. Scalability
2. Data movement
6
Data movement is important because
Moving data is a cri...
4 Issues needed to be considered
1. Scalability
2. Data movement
3. Transactions
7
Considering transactions is important f...
4 Issues needed to be considered
1. Scalability
2. Data movement
3. Transactions
4. Latency and throughput
8
Latency and t...
Which is better?
9
Scalability
Data
movement
Transactions Latency Throughput
In-database
analytics
Machine learning
on Had...
Which is better?
10
Scalability
Data
movement
Transactions Latency Throughput
In-database
analytics
Machine learning
on Ha...
Which is better?
11
Scalability
Data
movement
Transactions Latency Throughput
In-database
analytics
Machine learning
on Ha...
Which is better?
12
Scalability
Data
movement
Transactions
In-database
analytics
Machine learning
on Hadoop
+ Incremental
...
Idea behind DB-Hadoop Hybrid approach
13
scalability
Data
movement
Transactions
Batch learning
on Hadoop
Incremental learn...
Inside the box (anoverview)
– How tocombine them
14
Postgres
Hadoop cluster
node
node
node
・・・
OLTP
transactions
Training ...
15
Trickle
updates
Source
database
Trickle updates
in the queue periodically
Hadoop clusterRelational Database
Staging
tab...
16
Trickle
updates
Source
database
Hadoop clusterRelational Database
Staging
table
Pull updates in the queue
Training
data...
17
Trickle
updates
Source
database
Hadoop clusterRelational Database
Staging
table
Pull updates in the queue
Training
data...
18
Trickle
updates
Source
database
Hadoop clusterRelational Database
Staging
table
Pull updates in the queue
Training
data...
19
Trickle
updates
Source
database
Hadoop clusterRelational Database
Staging
table
Pull updates in the queue
Training
data...
Existing Approach for Parallel Batch Learning
―MachineLearningasUserDefinedAggregates(UDAF)
20
train train
+1, <1,2>
..
+1...
Purely Relational Approach for Parallel Learning
 Implemented a trainer as a set-returning
function, instead of UDAF
The ...
Experimental Evaluation
1. Compared the performance of our batch learning scheme
to state-of-the-art machine learning tech...
Performance Evaluation of Batch Learning
Our batch learning scheme on Hive is 5 and 7.65 times faster than
Vowpal Wabbit a...
Performance Analysis in the Evaluation
24
Source
database
Hadoop clusterRelational Database
Staging
table
Training
data si...
Performance Analysis in the Evaluation
25
 Sqoop required 3 min 32 s to migrate a prediction model
80% model containing a...
Performance Analysis in the Evaluation
26
Source
database
Hadoop clusterRelational Database
Staging
table
Training
data si...
Conclusions
 DB-Hadoop hybrid architecture for online prediction
in which the prediction model needs to be updated in a l...
Directions for Future Work
28
Source
database
Hadoop clusterRelational Database
Staging
table
Training
data sink
Predictio...
Backup slides
29
Directions for Future Work
30
Source
database
Hadoop clusterRelational Database
Staging
table
Training
data sink
Predictio...
Evaluation of Incremental Learning
 Given a prediction model is created with 80% of training
data by batch learning, the ...
Special Thanks
Font
Lato by Łukasz Dziedzic
Symbols by the Noun Project
Data Analysis designed by Brennan Novak
Elephant d...
Upcoming SlideShare
Loading in...5
×

A Database-Hadoop Hybrid Approach to Scalable Machine Learning

5,254

Published on

My presentation slide at IEEE 2nd International Congress on Big Data on June 30, 2013.
http://www.ieeebigdata.org/2013/

Published in: Technology, Education
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,254
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
78
Comments
0
Likes
9
Embeds 0
No embeds

No notes for slide

A Database-Hadoop Hybrid Approach to Scalable Machine Learning

  1. 1. A Database-Hadoop Hybrid Approach to Scalable Machine Learning Makoto YUI, Isao Kojima AIST, Japan <m.yui@aist.go.jp> June 30, 2013 IEEE BigData Congress 2013, Santa Clara
  2. 2. Outline 1. Motivation & Problem Description 2. Our Hybrid Approach to Scalable Machine Learning  Architecture  Our batch learning scheme on Hive 3. Experimental Evaluation 4. Conclusions and Future Directions 2
  3. 3. 3 As we seen in the Keynote and the Panel discussion (2nd day) of this conference, data analytics and machine learning are obviously getting more attentions along with Big Data Suppose then that you were a developer and your manager is willing to .. Manager Developer
  4. 4. What’s possible choices around there? In-database Analytics  MADlib (open-source project lead by Greenplum)  Bismarck (Project at wisc.edu, SIGMOD’12)  SAS In-database Analytics  Fuzzy Logix (Sybase) and more.. Machine Learning on Hadoop  Apache Mahout  Vowpal Wabbit (open-source project at MS research)  In-house analytical tools e.g., Twitter, SIGMOD’12 4 Two popular schools of thought for performing large-scale machine learning that does not fit in memory space
  5. 5. 4 Issues needed to be considered 1. Scalability 5 Scalability would always be a problem when handing Big Data
  6. 6. 4 Issues needed to be considered 1. Scalability 2. Data movement 6 Data movement is important because Moving data is a critical issue when the size of dataset shift from terabytes to petabytes and beyond
  7. 7. 4 Issues needed to be considered 1. Scalability 2. Data movement 3. Transactions 7 Considering transactions is important for real- time/online predictions because most of transaction records, which are valuable for predictions, are stored in relational databases
  8. 8. 4 Issues needed to be considered 1. Scalability 2. Data movement 3. Transactions 4. Latency and throughput 8 Latency and throughput are the key issues for achieving online prediction and/or real-time analytics
  9. 9. Which is better? 9 Scalability Data movement Transactions Latency Throughput In-database analytics Machine learning on Hadoop + Fault Tolerance + Straggler node handling + Scale-out
  10. 10. Which is better? 10 Scalability Data movement Transactions Latency Throughput In-database analytics Machine learning on Hadoop It depends on where the data initially stored and purposes of using the data + HDFS is useful for append-only and archiving purposes + ETL Processing (feature engineering) + RDBMS is reliable as a transactional data store
  11. 11. Which is better? 11 Scalability Data movement Transactions Latency Throughput In-database analytics Machine learning on Hadoop + Small fraction updates + Index-lookup for online prediction
  12. 12. Which is better? 12 Scalability Data movement Transactions In-database analytics Machine learning on Hadoop + Incremental learning for each training instance - High latency bottleneck in job submitting process + Batch processing Latency Throughput
  13. 13. Idea behind DB-Hadoop Hybrid approach 13 scalability Data movement Transactions Batch learning on Hadoop Incremental learning and prediction in a relational database Just an illustration, you know Next, we will see what happens inside the box Latency Throughput
  14. 14. Inside the box (anoverview) – How tocombine them 14 Postgres Hadoop cluster node node node ・・・ OLTP transactions Training data Prediction model Incremental learning implemented as a database stored procedure Trickle training data to Hadoop HDFS little by little and bringing back prediction models periodicity Batch learning
  15. 15. 15 Trickle updates Source database Trickle updates in the queue periodically Hadoop clusterRelational Database Staging table Pull updates in the queue Training data sink The Detailed Architecture ―DatatoPredictionCycle Incremental Learner
  16. 16. 16 Trickle updates Source database Hadoop clusterRelational Database Staging table Pull updates in the queue Training data sink Prediction model Batch learning process build a model The Detailed Architecture ―DatatoPredictionCycle Incremental Learner up-to-date model Export a prediction model Trickle updates in the queue periodically
  17. 17. 17 Trickle updates Source database Hadoop clusterRelational Database Staging table Pull updates in the queue Training data sink Prediction model Batch learning process build a model The Detailed Architecture ―DatatoPredictionCycle Incremental Learner up-to-date model Select the latest one Insert a new one Export a prediction model Trickle updates in the queue periodically
  18. 18. 18 Trickle updates Source database Hadoop clusterRelational Database Staging table Pull updates in the queue Training data sink Prediction model Batch learning process build a model The Detailed Architecture ―DatatoPredictionCycle Incremental Learner up-to-date model Select the latest one Insert a new one Export a prediction model Transactional updates Online prediction User can control the flow considering requirements and performance  Real-time prediction is possible using database triggers on the staging table Trickle updates in the queue periodically
  19. 19. 19 Trickle updates Source database Hadoop clusterRelational Database Staging table Pull updates in the queue Training data sink Prediction model Batch learning process build a model The Detailed Architecture ―DatatoPredictionCycle Incremental Learner up-to-date model Select the latest one Insert a new one Export a prediction model The workflow consists of continuous and independent processes Trickle updates in the queue periodically
  20. 20. Existing Approach for Parallel Batch Learning ―MachineLearningasUserDefinedAggregates(UDAF) 20 train train +1, <1,2> .. +1, <1,7,9> -1, <1,3, 9> .. +1, <3,8> merge tuple <label, array<features > array<weight> array<sum of weight>, array<count> Training table Prediction model UDAF -1, <2,7, 9> .. +1, <3,8> final merge merge -1, <2,7, 9> .. +1, <3,8> train train array<weight>  Bottleneck in the final merge Scalability is limited by the maximum fan-out of the final merge  Scalar aggregates computing a large single result are not suitable for S/N settings  Parallel aggregation (as one in Google Dremel) is not supported in Hadoop/MapReduce Aggregate tree (parallel aggregation) to merge prediction models Problems Observed Even though MPP databases and Hive parallelize user-defined aggregates, the above problems prevent using it
  21. 21. Purely Relational Approach for Parallel Learning  Implemented a trainer as a set-returning function, instead of UDAF The purely relational way that scales on MPP and Hive/Hadoop  Shuffle by feature to Reducers  Run trainers independently on mappers and aggregate the results on reducers Embarrassingly parallel as # of mappers and reducers is controllable 21 +1, <1,2> .. +1, <1,7,9> -1, <1,3, 9> .. +1, <3,8> train train tuple <label, array<features>> tuple<feature, weights> Prediction model UDTF Relation <feature, weights> param-mix param-mix Training table Shuffle by feature Our solution for parallel machine learning on Hadoop/Hive SELECT feature, -- reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT trainLogistic(features,label,..) as (feature,weight) FROM train ) t -- map-only task GROUP BY feature; -- shuffled to reducers Parameter Mixing K. B. Hall et al. in Proc. NIPS workshop on Leaning on Cores, Clusters, and Clouds, 2010. Key points
  22. 22. Experimental Evaluation 1. Compared the performance of our batch learning scheme to state-of-the-art machine learning techniques, namely Bismarck and Vowpal Wabbit 2. Conducted a online prediction scenario to see the latency and throughput of our incremental learning scheme Dataset KDD Cup 2012, Track 2 dataset, which is one of the one of the largest publically available datasets for machine learning, provided by a commercial search engine provider Experimental Environment In-house 33 commodity servers (32 slaves nodes for Hadoop) each equipped with 8 processors and 24 GB memory 22 Given a prediction model is created with 80% of training data by batch learning, the rest of data (20%) is supplied for incremental learning  The task is predicting Click-Through-Rates of search engine ads  The training data is about 235 million records in 23 GB
  23. 23. Performance Evaluation of Batch Learning Our batch learning scheme on Hive is 5 and 7.65 times faster than Vowpal Wabbit and Bismarck, respectively 23 AUC value (Green Bar) represents prediction accuracy 5x 7.65x Throughput: 2.3 million tuples/sec on 32 nodes Latency: 96 sec for training 235 million records of 23 GB CAUTION: you can find the detailed number and setting in our paper
  24. 24. Performance Analysis in the Evaluation 24 Source database Hadoop clusterRelational Database Staging table Training data sink Prediction model Incremental Learner up-to-date model  Low latency (5 sec) under moderate (70,000 tuples/sec) updates  96 sec for training  Excellent throughput (2.3 million tuples/sec) on 32 nodes 5 s 96 s
  25. 25. Performance Analysis in the Evaluation 25  Sqoop required 3 min 32 s to migrate a prediction model 80% model containing about 1.56 million records (323MB)  Model conversion to a dense format, which is suited for online learning/prediction on Postgres, required 58 seconds Source database Hadoop clusterRelational Database Staging table Training data sink Prediction model Incremental Learner up-to-date model 5 s 96 s 212 s58 s Non-trivial costs in model migration
  26. 26. Performance Analysis in the Evaluation 26 Source database Hadoop clusterRelational Database Staging table Training data sink Prediction model Incremental Learner up-to-date model  “Data migration time > Training time” justifies the rationale behind in-database analytics  The cost of moving data is critical for online prediction as well as in Big Data analysis Model migration costs could be amortized with our approach Key observations
  27. 27. Conclusions  DB-Hadoop hybrid architecture for online prediction in which the prediction model needs to be updated in a low latency process  Design principal for achieving scalable machine learning on Hadoop/Hive Excellent throughput and Scalability Our Batch learning scheme on Hive is 5 and 7.65 times faster than Vowpal Wabbit and Bismarck, respectively Acceptably Small Latency Possibly less than 5 s, under moderate transactional updates 27 Going hybrid brings low latency to Big Data analytics
  28. 28. Directions for Future Work 28 Source database Hadoop clusterRelational Database Staging table Training data sink Prediction model Incremental Learner up-to-date model Online testing  Integrating online testing schemes (e.g., Multi-armed Bandits and A/B testing) to the prediction pipeline  Develop a scheme to select the best prediction model among past models for each user in each session Online prediction
  29. 29. Backup slides 29
  30. 30. Directions for Future Work 30 Source database Hadoop clusterRelational Database Staging table Training data sink Prediction model Incremental Learner up-to-date model Take a common setting for OLTP that database is partitioned across servers (a.k.a. Database Sharding) into consideration
  31. 31. Evaluation of Incremental Learning  Given a prediction model is created with 80% of training data by batch learning, the rest of data (20% ) is supplied for incremental learning 31 Built model with .. elapsed time (in sec) Throughput tuples/sec AUC Batch only (80%) 96.33 2067418.3 0.7177 +0.1% updates (80.1%) 4.99 36155.4 0.7197 +1% updates (81%) 25.96 69812.8 0.7242 +10% updates (90%) 256.03 71278.1 0.7291 +20% updates (100%) 499.61 72901.4 0.7349 Batch only (100%) 102.52 2298010.8 0.7356
  32. 32. Special Thanks Font Lato by Łukasz Dziedzic Symbols by the Noun Project Data Analysis designed by Brennan Novak Elephant designed by Ted Mitchner Scale designed by Laurent Patain Heavy Load designed by Olivier Guin Receipt designed by Benjamin Orlovski Gauge designed by Márcio Duarte Stopwatch designed by Ilsur Aptukov Box designed by Travis J. Lee Sprint Cycle designed by Jeremy J Bristol Dilbert characters by Scott Adams Inc. 10-12-10 and 7-29-12 32
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×