More Related Content
Similar to Introduction to Hivemall
Similar to Introduction to Hivemall (20)
More from Treasure Data, Inc.
More from Treasure Data, Inc. (20)
Introduction to Hivemall
- 1. Copyright ©2015 Treasure Data. All Rights Reserved.
Treasure Data Inc.
Research Engineer
Makoto YUI @myui
2015/05/14
TD tech talk #3 @Retty 1
http://myui.github.io/
20 min. Introduction to Hivemall
- 2. Copyright ©2015 Treasure Data. All Rights Reserved.
Ø2015/04 Joined Treasure Data, Inc.
Ø1st Research Engineer in Treasure Data
ØMy mission in TD is developing ML-‐as-‐a-‐Service (MLaaS)
Ø2010/04-‐2015/03 Senior Researcher at National Institute
of Advanced Industrial Science and Technology, Japan.
ØWorked on a large-‐scale Machine Learning project and Parallel
Databases
Ø2009/03 Ph.D. in Computer Science from NAIST
Ø My research topic was about building XML native database and
Parallel Database systems
ØSuper programmer award from the MITOU Foundation
(a Government founded program for finding young and
talented programmers)
Ø Super creators in Treasure Data: Sada Furuhashi, Keisuke Nishida
2
Who am I ?
- 3. Copyright ©2015 Treasure Data. All Rights Reserved.
3
0
2000
4000
6000
8000
10000
12000
Aug-‐12Sep-‐12Oct-‐12Nov-‐12Dec-‐12
Jan-‐13Feb-‐13M
ar-‐13Apr-‐13M
ay-‐13Jun-‐13
Jul-‐13Aug-‐13Sep-‐13Oct-‐13Nov-‐13Dec-‐13
Jan-‐14Feb-‐14M
ar-‐14Apr-‐14M
ay-‐14Jun-‐14
Jul-‐14Aug-‐14Sep-‐14Oct-‐14
Billion records (Unit)
Service in
Series A Funding
Reached 100 customers
Selected as “Cool Vendor
in Big Data” by Gartner
10 trillion
records
5 trillion records
Figures on Oct. 2014
4 hundred thousand (40万) records Imported for each SECOND!!
10+ trillion (10兆) records Total number of imported records
12 billion (120億) records # records sent by an Ad-tech company
Figures of Imported Data in Treasure Data
- 4. Copyright ©2015 Treasure Data. All Rights Reserved.
The latest numbers in Treasure Data
100+
Customers
In Japan
15 trillion
# of
stored records
4,000
A single company
sends data to us
from 4,000 nodes
500,000
# of records
stored per a second
4
- 5. Copyright ©2015 Treasure Data. All Rights Reserved.
Plan of the Talk
1. Brief introduction to Hivemall
2. How to use Hivemall
3. Real-‐time prediction w/ Hivemall and RDBMS
5
- 6. Copyright ©2015 Treasure Data. All Rights Reserved.
What is Hivemall
Scalable machine learning library built on the top of
Apache Hive, licensed under the Apache License v2
Hadoop HDFS
MapReduce
(MRv1)
Hive / PIG
Hivemall
Apache YARN
Apache Tez
DAG processing
MR v2
Machine Learning
Check http://github.com/myui/hivemall
6
Query Processing
Parallel Data
Processing Framework
Resource Management
Distributed File System
- 7. Copyright ©2015 Treasure Data. All Rights Reserved.
R
M MM
M
HDFS
HDFS
M M M
R
M M M
R
HDFS
M MM
M M
HDFS
R
MapReduce and DAG engine
MapReduce DAG engine
Tez/Spark
No intermediate DFS reads/writes!
7
- 8. Copyright ©2015 Treasure Data. All Rights Reserved.
Very easy to use; Machine Learning on SQL
The key characteristic of Hivemall
100+ lines
of code
Classification with Mahout
CREATE TABLE lr_model AS
SELECT
feature, -‐-‐ reducers perform model averaging in
parallel
avg(weight) as weight
FROM (
SELECT logress(features,label,..) as (feature,weight)
FROM train
) t -‐-‐ map-‐only task
GROUP BY feature; -‐-‐ shuffled to reducers
ü Machine Learning made easy for SQL
developers (ML for the rest of us)
ü APIs are very stable because of SQL
abstraction
This SQL query automatically runs in parallel
on Hadoop
8
- 9. Copyright ©2015 Treasure Data. All Rights Reserved.
List of functions in Hivemall v0.3
9
• Classification (both
binary-‐ and multi-‐class)
ü Perceptron
ü Passive Aggressive (PA)
ü Confidence Weighted (CW)
ü Adaptive Regularization of
Weight Vectors (AROW)
ü Soft Confidence Weighted (SCW)
ü AdaGrad+RDA
• Regression
ü Logistic Regression (SGD)
ü PA Regression
ü AROW Regression
ü AdaGrad
ü AdaDELTA
• kNN and Recommendation
ü Minhash and b-‐Bit Minhash
(LSH variant)
ü Similarity Search using K-‐NN
ü Matrix Factorization
• Feature engineering
ü Feature hashing
ü Feature scaling
(normalization, z-‐score)
ü TF-‐IDF vectorizer
Treasure Data will support Hivemall
v0.3.1 in the next week!
bit.ly/hivemall-‐mf
- 10. Copyright ©2015 Treasure Data. All Rights Reserved.
• Contribution from Daniel Dai (Pig PMC) from
Hortonworks
• To be supported from Pig 0.15
10
Hivemall on Apache Pig
- 11. Copyright ©2015 Treasure Data. All Rights Reserved.
Plan of the Talk
1. Brief introduction to Hivemall
2. How to use Hivemall
3. Real-‐time prediction w/ Hivemall and RDBMS
11
- 12. Copyright ©2015 Treasure Data. All Rights Reserved.
How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature Vector
Feature Vector
Label
Data preparation
12
- 13. Copyright ©2015 Treasure Data. All Rights Reserved.
Create external table e2006tfidf_train (
rowid int,
label float,
features ARRAY<STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t'
COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-
tfidf/train';
How to use Hivemall -‐ Data preparation
Define a Hive table for training/testing data
13
- 14. Copyright ©2015 Treasure Data. All Rights Reserved.
How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature Vector
Feature Vector
Label
Feature Engineering
14
- 15. Copyright ©2015 Treasure Data. All Rights Reserved.
create view e2006tfidf_train_scaled
as
select
rowid,
rescale(target,${min_label},${max_label})
as label,
features
from
e2006tfidf_train;
Applying a Min-Max Feature Normalization
How to use Hivemall -‐ Feature Engineering
Transforming a label value
to a value between 0.0 and 1.0
15
- 16. Copyright ©2015 Treasure Data. All Rights Reserved.
How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature Vector
Feature Vector
Label
Training
16
- 17. Copyright ©2015 Treasure Data. All Rights Reserved.
How to use Hivemall -‐ Training
CREATE TABLE lr_model AS
SELECT
feature,
avg(weight) as weight
FROM (
SELECT logress(features,label,..)
as (feature,weight)
FROM train
) t
GROUP BY feature
Training by logistic regression
map-‐only task to learn a prediction model
Shuffle map-‐outputs to reduces by feature
Reducers perform model averaging
in parallel
17
- 18. Copyright ©2015 Treasure Data. All Rights Reserved.
How to use Hivemall -‐ Training
CREATE TABLE news20b_cw_model1 AS
SELECT
feature,
voted_avg(weight) as weight
FROM
(SELECT
train_cw(features,label)
as (feature,weight)
FROM
news20b_train
) t
GROUP BY feature
Training of Confidence Weighted Classifier
Vote to use negative or positive
weights for avg
+0.7, +0.3, +0.2, -‐0.1, +0.7
Training for the CW classifier
18
- 19. Copyright ©2015 Treasure Data. All Rights Reserved.
create table news20mc_ensemble_model1 as
select
label,
cast(feature as int) as feature,
cast(voted_avg(weight) as float) as weight
from
(select
train_multiclass_cw(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
union all
select
train_multiclass_arow(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
union all
select
train_multiclass_scw(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
) t
group by label, feature;
Ensemble learning for stable prediction performance
Just stack prediction models
by union all
19
- 20. Copyright ©2015 Treasure Data. All Rights Reserved.
How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature Vector
Feature Vector
Label
Prediction
20
- 21. Copyright ©2015 Treasure Data. All Rights Reserved.
How to use Hivemall -‐ Prediction
CREATE TABLE lr_predict
as
SELECT
t.rowid,
sigmoid(sum(m.weight)) as prob
FROM
testing_exploded t LEFT OUTER JOIN
lr_model m ON (t.feature = m.feature)
GROUP BY
t.rowid
Prediction is done by LEFT OUTER JOIN
between test data and prediction model
No need to load the entire model into memory
21
- 22. Copyright ©2015 Treasure Data. All Rights Reserved.
Plan of the Talk
1. Brief introduction to Hivemall
2. How to use Hivemall
3. Real-‐time prediction w/ Hivemall and RDBMS
22
- 23. Copyright ©2015 Treasure Data. All Rights Reserved.
Type/Purpose Matrix of Machine Learning
23
Online
Learning
Offline
Learning
Online
Prediction
• Algorithm Trade (HFT)
• Twitter real-‐time
analysis
• Ad-‐tech (e.g., CTR/CVR
prediction)
• Real-‐time
recommendation
Offline
Prediction
no/fewneeds?
• Daily/weeklybatch
systems
• Business
Analytics/Reporting
- 24. Copyright ©2015 Treasure Data. All Rights Reserved.
How to use Hivemall
Machine
Learning
Batch Training on Hadoop
Online Prediction on RDBMS
Prediction
Model
Label
Feature Vector
Feature Vector
Label
Export
prediction model
24
- 25. Copyright ©2015 Treasure Data. All Rights Reserved.
Export Prediction Model to a RDBMS
25
hive> desc news20b_cw_model1;
feature int
weight double
Any RDBMS
TD export
Periodical export is very easy
in Treasure Data
103 -0.4896543622016907
104 -0.0955817922949791
105 0.12560302019119263
106 0.09214721620082855
- 26. Copyright ©2015 Treasure Data. All Rights Reserved.
26
hive> desc testing_exploded;
feature string
value float
Real-‐time Prediction on MySQL
#2 Preparing a Test data table
SIGMOID(x) = 1.0 / (1.0 + exp(-‐x))
Prediction
Model
Label
Feature Vector
SELECT
sigmoid(sum(t.value * m.weight)) as prob
FROM
testing_exploded t LEFT OUTER JOIN
prediction_model m ON (t.feature = m.feature)
#3 Online prediction on MySQL
You can alternatively use SQL view
defining for testing target
Index lookups are very
efficient in RDBMSs
http://bit.ly/hivemall-‐rtp
- 27. Copyright ©2015 Treasure Data. All Rights Reserved.
Cost of Amazon Machine Learning
Amazon-‐ML is suspected to be based on Vowpal Wabbit
(single process)
27
Data Analysis and Model Building Fees
$0.42/Instance per Hour
Batch Prediction
$0.1/1000 requests
Real-‐time Prediction
$0.0001 per a request
Pay-‐per-‐request is apparently not suitable for doing prediction for
each web request (e.g. online CTR prediction)
- 28. Copyright ©2015 Treasure Data. All Rights Reserved.
28
Real-‐time Prediction on Treasure Data
Run batch training
job periodically
Real-‐time prediction
on a RDBMS
Periodical
export
- 29. Copyright ©2015 Treasure Data. All Rights Reserved.
29
Beyond Query-‐as-‐a-‐Service!
We ❤️ Open-‐source! We Invented ..
We are Hiring!