SlideShare a Scribd company logo
1 of 29
Download to read offline
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
Treasure  Data  Inc.
Research  Engineer
Makoto  YUI  @myui
2015/05/14
TD  tech  talk  #3  @Retty 1
http://myui.github.io/
20  min.  Introduction  to  Hivemall
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
Ø2015/04  Joined  Treasure  Data,  Inc.
Ø1st Research  Engineer  in  Treasure  Data
ØMy  mission  in  TD  is  developing  ML-­‐as-­‐a-­‐Service  (MLaaS)  
Ø2010/04-­‐2015/03  Senior  Researcher  at  National  Institute  
of  Advanced  Industrial  Science  and  Technology,  Japan.  
ØWorked  on  a  large-­‐scale  Machine  Learning  project  and  Parallel  
Databases  
Ø2009/03  Ph.D.  in  Computer  Science  from  NAIST
Ø My  research  topic  was  about  building  XML  native  database  and  
Parallel  Database  systems
ØSuper  programmer  award  from  the  MITOU  Foundation  
(a  Government  founded  program  for  finding  young  and  
talented  programmers)
Ø Super  creators  in  Treasure  Data:  Sada Furuhashi,  Keisuke  Nishida
2
Who  am    I  ?
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
3
0
2000
4000
6000
8000
10000
12000
Aug-­‐12Sep-­‐12Oct-­‐12Nov-­‐12Dec-­‐12
Jan-­‐13Feb-­‐13M
ar-­‐13Apr-­‐13M
ay-­‐13Jun-­‐13
Jul-­‐13Aug-­‐13Sep-­‐13Oct-­‐13Nov-­‐13Dec-­‐13
Jan-­‐14Feb-­‐14M
ar-­‐14Apr-­‐14M
ay-­‐14Jun-­‐14
Jul-­‐14Aug-­‐14Sep-­‐14Oct-­‐14
Billion  records  (Unit)
Service  in
Series  A  Funding
Reached  100  customers
Selected  as  “Cool  Vendor  
in  Big  Data”  by  Gartner
10  trillion
records  
5  trillion  records
Figures on Oct. 2014
4 hundred thousand (40万) records Imported for each SECOND!!
10+ trillion (10兆) records Total number of imported records
12 billion (120億) records # records sent by an Ad-tech company
Figures  of  Imported  Data  in  Treasure  Data
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
The  latest  numbers  in  Treasure  Data
100+
Customers
In Japan
15 trillion
# of
stored records
4,000
A single company
sends data to us
from 4,000 nodes
500,000
# of records
stored per a second
4
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
Plan  of  the  Talk
1. Brief  introduction  to  Hivemall
2. How  to  use  Hivemall
3. Real-­‐time  prediction  w/  Hivemall  and  RDBMS
5
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
What  is  Hivemall
Scalable  machine  learning  library  built  on  the  top  of  
Apache  Hive,  licensed  under  the  Apache  License  v2
Hadoop  HDFS
MapReduce
(MRv1)
Hive /  PIG
Hivemall
Apache  YARN
Apache  Tez
DAG processing
MR v2
Machine  Learning
Check  http://github.com/myui/hivemall
6
Query  Processing
Parallel  Data  
Processing  Framework
Resource  Management
Distributed  File  System
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
R
M MM
M
HDFS
HDFS
M M M
R
M M M
R
HDFS
M MM
M M
HDFS
R
MapReduce  and  DAG  engine
MapReduce   DAG  engine
Tez/Spark
No  intermediate  DFS  reads/writes!
7
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
Very  easy  to  use;  Machine  Learning  on  SQL
The  key  characteristic  of  Hivemall
100+  lines
of  code
Classification  with  Mahout
CREATE  TABLE  lr_model AS
SELECT
feature,  -­‐-­‐ reducers  perform  model  averaging  in  
parallel
avg(weight)  as  weight
FROM  (
SELECT  logress(features,label,..)  as  (feature,weight)
FROM  train
)  t  -­‐-­‐ map-­‐only  task
GROUP  BY  feature;  -­‐-­‐ shuffled  to  reducers
ü Machine  Learning  made  easy  for  SQL  
developers  (ML  for  the  rest  of  us)
ü APIs  are  very  stable  because  of  SQL  
abstraction
This  SQL  query  automatically  runs  in  parallel
on  Hadoop  
8
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
List  of  functions  in  Hivemall  v0.3
9
• Classification  (both  
binary-­‐ and  multi-­‐class)
ü Perceptron
ü Passive  Aggressive  (PA)
ü Confidence  Weighted  (CW)
ü Adaptive  Regularization  of  
Weight  Vectors  (AROW)
ü Soft  Confidence  Weighted  (SCW)
ü AdaGrad+RDA
• Regression
ü Logistic  Regression  (SGD)
ü PA  Regression
ü AROW  Regression
ü AdaGrad
ü AdaDELTA
• kNN and  Recommendation
ü Minhash and  b-­‐Bit  Minhash
(LSH  variant)
ü Similarity  Search  using  K-­‐NN
ü Matrix  Factorization
• Feature  engineering
ü Feature  hashing
ü Feature  scaling
(normalization,  z-­‐score)  
ü TF-­‐IDF  vectorizer
Treasure  Data  will  support  Hivemall
v0.3.1  in  the  next  week!  
bit.ly/hivemall-­‐mf
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
• Contribution  from  Daniel  Dai  (Pig  PMC)  from  
Hortonworks
• To  be  supported  from  Pig  0.15
10
Hivemall  on  Apache  Pig
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
Plan  of  the  Talk
1. Brief  introduction  to  Hivemall
2. How  to  use  Hivemall
3. Real-­‐time  prediction  w/  Hivemall  and  RDBMS
11
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
How  to  use  Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature  Vector
Feature  Vector
Label
Data  preparation
12
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
Create external table e2006tfidf_train (
rowid int,
label float,
features ARRAY<STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t'
COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-
tfidf/train';
How  to  use  Hivemall  -­‐ Data  preparation
Define  a  Hive  table  for  training/testing  data
13
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
How  to  use  Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature  Vector
Feature  Vector
Label
Feature  Engineering
14
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
create view e2006tfidf_train_scaled
as
select
rowid,
rescale(target,${min_label},${max_label})
as label,
features
from
e2006tfidf_train;
Applying a Min-Max Feature Normalization
How  to  use  Hivemall  -­‐ Feature  Engineering
Transforming  a  label  value  
to  a  value  between  0.0  and  1.0
15
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
How  to  use  Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature  Vector
Feature  Vector
Label
Training
16
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
How  to  use  Hivemall  -­‐ Training
CREATE TABLE lr_model AS
SELECT
feature,
avg(weight) as weight
FROM (
SELECT logress(features,label,..)
as (feature,weight)
FROM train
) t
GROUP BY feature
Training  by  logistic  regression
map-­‐only  task  to  learn  a  prediction  model
Shuffle  map-­‐outputs  to  reduces  by  feature
Reducers  perform  model  averaging  
in  parallel
17
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
How  to  use  Hivemall  -­‐ Training
CREATE TABLE news20b_cw_model1 AS
SELECT
feature,
voted_avg(weight) as weight
FROM
(SELECT
train_cw(features,label)
as (feature,weight)
FROM
news20b_train
) t
GROUP BY feature
Training  of  Confidence  Weighted  Classifier
Vote  to  use  negative  or  positive  
weights  for  avg
+0.7,  +0.3,  +0.2,  -­‐0.1,  +0.7
Training  for  the  CW  classifier
18
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
create table news20mc_ensemble_model1 as
select
label,
cast(feature as int) as feature,
cast(voted_avg(weight) as float) as weight
from
(select
train_multiclass_cw(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
union all
select
train_multiclass_arow(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
union all
select
train_multiclass_scw(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
) t
group by label, feature;
Ensemble  learning  for  stable  prediction  performance
Just  stack  prediction  models  
by  union  all
19
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
How  to  use  Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature  Vector
Feature  Vector
Label
Prediction
20
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
How  to  use  Hivemall  -­‐ Prediction
CREATE TABLE lr_predict
as
SELECT
t.rowid,
sigmoid(sum(m.weight)) as prob
FROM
testing_exploded t LEFT OUTER JOIN
lr_model m ON (t.feature = m.feature)
GROUP BY
t.rowid
Prediction  is  done  by  LEFT  OUTER  JOIN
between  test  data  and  prediction  model
No  need  to  load  the  entire  model  into  memory
21
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
Plan  of  the  Talk
1. Brief  introduction  to  Hivemall
2. How  to  use  Hivemall
3. Real-­‐time  prediction  w/  Hivemall  and  RDBMS
22
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
Type/Purpose  Matrix  of  Machine  Learning
23
Online
Learning
Offline
Learning
Online
Prediction
• Algorithm Trade  (HFT)
• Twitter  real-­‐time  
analysis
• Ad-­‐tech (e.g.,  CTR/CVR  
prediction)
• Real-­‐time  
recommendation
Offline
Prediction
no/fewneeds?
• Daily/weeklybatch  
systems
• Business
Analytics/Reporting
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
How  to  use  Hivemall
Machine
Learning
Batch Training on Hadoop
Online Prediction on RDBMS
Prediction
Model
Label
Feature  Vector
Feature  Vector
Label
Export  
prediction  model
24
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
Export  Prediction  Model  to  a  RDBMS
25
hive> desc news20b_cw_model1;
feature int
weight double
Any  RDBMS
TD  export
Periodical  export  is  very easy
in  Treasure  Data
103 -0.4896543622016907
104 -0.0955817922949791
105 0.12560302019119263
106 0.09214721620082855
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
26
hive>  desc  testing_exploded;                                                    
feature                                  string  
value                                      float
Real-­‐time  Prediction  on  MySQL
#2  Preparing  a  Test  data  table
SIGMOID(x) =  1.0  /  (1.0  +  exp(-­‐x))
Prediction
Model
Label
Feature  Vector
SELECT    
sigmoid(sum(t.value   *  m.weight))  as  prob
FROM
testing_exploded   t  LEFT  OUTER  JOIN  
prediction_model   m  ON  (t.feature  =  m.feature)
#3  Online  prediction  on  MySQL  
You  can  alternatively  use  SQL  view
defining  for  testing  target
Index  lookups  are  very
efficient  in  RDBMSs
http://bit.ly/hivemall-­‐rtp
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
Cost  of  Amazon  Machine  Learning
Amazon-­‐ML  is  suspected  to  be  based  on  Vowpal Wabbit
(single  process)  
27
Data  Analysis  and  Model  Building  Fees
$0.42/Instance  per  Hour
Batch  Prediction
$0.1/1000 requests
Real-­‐time  Prediction
$0.0001  per  a  request
Pay-­‐per-­‐request    is  apparently  not  suitable  for  doing  prediction  for  
each  web  request  (e.g.  online  CTR  prediction)
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
28
Real-­‐time  Prediction  on  Treasure  Data
Run  batch  training
job  periodically
Real-­‐time  prediction
on  a  RDBMS
Periodical
export
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
29
Beyond  Query-­‐as-­‐a-­‐Service!
We  ❤️ Open-­‐source!  We  invented  ..
We  are  Hiring!

More Related Content

What's hot

3rd Hivemall meetup
3rd Hivemall meetup3rd Hivemall meetup
3rd Hivemall meetupMakoto Yui
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseLukas Vlcek
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesCarol McDonald
 
A First Look at HPC Midlands
A First Look at HPC MidlandsA First Look at HPC Midlands
A First Look at HPC MidlandsMartin Hamilton
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Makoto Yui
 

What's hot (8)

3rd Hivemall meetup
3rd Hivemall meetup3rd Hivemall meetup
3rd Hivemall meetup
 
Neo4j vs giraph
Neo4j vs giraphNeo4j vs giraph
Neo4j vs giraph
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
 
All thingspython@pivotal
All thingspython@pivotalAll thingspython@pivotal
All thingspython@pivotal
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
 
A First Look at HPC Midlands
A First Look at HPC MidlandsA First Look at HPC Midlands
A First Look at HPC Midlands
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
 

Similar to Hivemall Talk at TD tech talk #3

Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...huguk
 
Managing Machine Learning workflows on Treasure Data
Managing Machine Learning workflows on Treasure DataManaging Machine Learning workflows on Treasure Data
Managing Machine Learning workflows on Treasure DataAki Ariga
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Srivatsan Ramanujam
 
Db tech show - hivemall
Db tech show - hivemallDb tech show - hivemall
Db tech show - hivemallMakoto Yui
 
Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17Makoto Yui
 
Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17Makoto Yui
 
Apache Hivemall and my OSS experience
Apache Hivemall and my OSS experienceApache Hivemall and my OSS experience
Apache Hivemall and my OSS experienceMakoto Yui
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...Srivatsan Ramanujam
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsEsther Vasiete
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureSkillspeed
 
Jfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocksJfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocksJim Dowling
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworksIJDKP
 
Mammothdb - Public VC Pitchdeck!
Mammothdb - Public VC Pitchdeck!Mammothdb - Public VC Pitchdeck!
Mammothdb - Public VC Pitchdeck!Steve Keil
 
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLAdam Muise
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET Journal
 
The sensor data challenge - Innovations (not only) for the Internet of Things
The sensor data challenge - Innovations (not only) for the Internet of ThingsThe sensor data challenge - Innovations (not only) for the Internet of Things
The sensor data challenge - Innovations (not only) for the Internet of ThingsStephan Reimann
 
Big Data Meetup #7
Big Data Meetup #7Big Data Meetup #7
Big Data Meetup #7Paul Lo
 
Building a Stock Prediction system with Machine Learning using Geode, SpringX...
Building a Stock Prediction system with Machine Learning using Geode, SpringX...Building a Stock Prediction system with Machine Learning using Geode, SpringX...
Building a Stock Prediction system with Machine Learning using Geode, SpringX...William Markito Oliveira
 

Similar to Hivemall Talk at TD tech talk #3 (20)

Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
 
Managing Machine Learning workflows on Treasure Data
Managing Machine Learning workflows on Treasure DataManaging Machine Learning workflows on Treasure Data
Managing Machine Learning workflows on Treasure Data
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
 
Db tech show - hivemall
Db tech show - hivemallDb tech show - hivemall
Db tech show - hivemall
 
Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17
 
Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17
 
Apache Hivemall and my OSS experience
Apache Hivemall and my OSS experienceApache Hivemall and my OSS experience
Apache Hivemall and my OSS experience
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
 
Dancing with the Elephant
Dancing with the ElephantDancing with the Elephant
Dancing with the Elephant
 
Jfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocksJfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocks
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworks
 
Mammothdb - Public VC Pitchdeck!
Mammothdb - Public VC Pitchdeck!Mammothdb - Public VC Pitchdeck!
Mammothdb - Public VC Pitchdeck!
 
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOP
 
The sensor data challenge - Innovations (not only) for the Internet of Things
The sensor data challenge - Innovations (not only) for the Internet of ThingsThe sensor data challenge - Innovations (not only) for the Internet of Things
The sensor data challenge - Innovations (not only) for the Internet of Things
 
Big Data Meetup #7
Big Data Meetup #7Big Data Meetup #7
Big Data Meetup #7
 
Building a Stock Prediction system with Machine Learning using Geode, SpringX...
Building a Stock Prediction system with Machine Learning using Geode, SpringX...Building a Stock Prediction system with Machine Learning using Geode, SpringX...
Building a Stock Prediction system with Machine Learning using Geode, SpringX...
 

More from Makoto Yui

Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6Makoto Yui
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Makoto Yui
 
Idea behind Apache Hivemall
Idea behind Apache HivemallIdea behind Apache Hivemall
Idea behind Apache HivemallMakoto Yui
 
What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0Makoto Yui
 
What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0Makoto Yui
 
Revisiting b+-trees
Revisiting b+-treesRevisiting b+-trees
Revisiting b+-treesMakoto Yui
 
Incubating Apache Hivemall
Incubating Apache HivemallIncubating Apache Hivemall
Incubating Apache HivemallMakoto Yui
 
Apache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, MiamiApache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, MiamiMakoto Yui
 
機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会Makoto Yui
 
Podling Hivemall in the Apache Incubator
Podling Hivemall in the Apache IncubatorPodling Hivemall in the Apache Incubator
Podling Hivemall in the Apache IncubatorMakoto Yui
 
Dots20161029 myui
Dots20161029 myuiDots20161029 myui
Dots20161029 myuiMakoto Yui
 
Hadoopsummit16 myui
Hadoopsummit16 myuiHadoopsummit16 myui
Hadoopsummit16 myuiMakoto Yui
 
HadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myuiHadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myuiMakoto Yui
 
Recommendation 101 using Hivemall
Recommendation 101 using HivemallRecommendation 101 using Hivemall
Recommendation 101 using HivemallMakoto Yui
 
Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016Makoto Yui
 
Introduction to Hivemall
Introduction to HivemallIntroduction to Hivemall
Introduction to HivemallMakoto Yui
 
Tdtechtalk20160425myui
Tdtechtalk20160425myuiTdtechtalk20160425myui
Tdtechtalk20160425myuiMakoto Yui
 
Tdtechtalk20160330myui
Tdtechtalk20160330myuiTdtechtalk20160330myui
Tdtechtalk20160330myuiMakoto Yui
 
Datascientistsymp1113
Datascientistsymp1113Datascientistsymp1113
Datascientistsymp1113Makoto Yui
 
2nd Hivemall meetup 20151020
2nd Hivemall meetup 201510202nd Hivemall meetup 20151020
2nd Hivemall meetup 20151020Makoto Yui
 

More from Makoto Yui (20)

Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
 
Idea behind Apache Hivemall
Idea behind Apache HivemallIdea behind Apache Hivemall
Idea behind Apache Hivemall
 
What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0
 
What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0
 
Revisiting b+-trees
Revisiting b+-treesRevisiting b+-trees
Revisiting b+-trees
 
Incubating Apache Hivemall
Incubating Apache HivemallIncubating Apache Hivemall
Incubating Apache Hivemall
 
Apache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, MiamiApache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, Miami
 
機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会
 
Podling Hivemall in the Apache Incubator
Podling Hivemall in the Apache IncubatorPodling Hivemall in the Apache Incubator
Podling Hivemall in the Apache Incubator
 
Dots20161029 myui
Dots20161029 myuiDots20161029 myui
Dots20161029 myui
 
Hadoopsummit16 myui
Hadoopsummit16 myuiHadoopsummit16 myui
Hadoopsummit16 myui
 
HadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myuiHadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myui
 
Recommendation 101 using Hivemall
Recommendation 101 using HivemallRecommendation 101 using Hivemall
Recommendation 101 using Hivemall
 
Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016
 
Introduction to Hivemall
Introduction to HivemallIntroduction to Hivemall
Introduction to Hivemall
 
Tdtechtalk20160425myui
Tdtechtalk20160425myuiTdtechtalk20160425myui
Tdtechtalk20160425myui
 
Tdtechtalk20160330myui
Tdtechtalk20160330myuiTdtechtalk20160330myui
Tdtechtalk20160330myui
 
Datascientistsymp1113
Datascientistsymp1113Datascientistsymp1113
Datascientistsymp1113
 
2nd Hivemall meetup 20151020
2nd Hivemall meetup 201510202nd Hivemall meetup 20151020
2nd Hivemall meetup 20151020
 

Recently uploaded

Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdfSuman Jyoti
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spaintimesproduction05
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSrknatarajan
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLManishPatel169454
 

Recently uploaded (20)

Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spain
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 

Hivemall Talk at TD tech talk #3

  • 1. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Treasure  Data  Inc. Research  Engineer Makoto  YUI  @myui 2015/05/14 TD  tech  talk  #3  @Retty 1 http://myui.github.io/ 20  min.  Introduction  to  Hivemall
  • 2. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Ø2015/04  Joined  Treasure  Data,  Inc. Ø1st Research  Engineer  in  Treasure  Data ØMy  mission  in  TD  is  developing  ML-­‐as-­‐a-­‐Service  (MLaaS)   Ø2010/04-­‐2015/03  Senior  Researcher  at  National  Institute   of  Advanced  Industrial  Science  and  Technology,  Japan.   ØWorked  on  a  large-­‐scale  Machine  Learning  project  and  Parallel   Databases   Ø2009/03  Ph.D.  in  Computer  Science  from  NAIST Ø My  research  topic  was  about  building  XML  native  database  and   Parallel  Database  systems ØSuper  programmer  award  from  the  MITOU  Foundation   (a  Government  founded  program  for  finding  young  and   talented  programmers) Ø Super  creators  in  Treasure  Data:  Sada Furuhashi,  Keisuke  Nishida 2 Who  am    I  ?
  • 3. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. 3 0 2000 4000 6000 8000 10000 12000 Aug-­‐12Sep-­‐12Oct-­‐12Nov-­‐12Dec-­‐12 Jan-­‐13Feb-­‐13M ar-­‐13Apr-­‐13M ay-­‐13Jun-­‐13 Jul-­‐13Aug-­‐13Sep-­‐13Oct-­‐13Nov-­‐13Dec-­‐13 Jan-­‐14Feb-­‐14M ar-­‐14Apr-­‐14M ay-­‐14Jun-­‐14 Jul-­‐14Aug-­‐14Sep-­‐14Oct-­‐14 Billion  records  (Unit) Service  in Series  A  Funding Reached  100  customers Selected  as  “Cool  Vendor   in  Big  Data”  by  Gartner 10  trillion records   5  trillion  records Figures on Oct. 2014 4 hundred thousand (40万) records Imported for each SECOND!! 10+ trillion (10兆) records Total number of imported records 12 billion (120億) records # records sent by an Ad-tech company Figures  of  Imported  Data  in  Treasure  Data
  • 4. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. The  latest  numbers  in  Treasure  Data 100+ Customers In Japan 15 trillion # of stored records 4,000 A single company sends data to us from 4,000 nodes 500,000 # of records stored per a second 4
  • 5. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Plan  of  the  Talk 1. Brief  introduction  to  Hivemall 2. How  to  use  Hivemall 3. Real-­‐time  prediction  w/  Hivemall  and  RDBMS 5
  • 6. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. What  is  Hivemall Scalable  machine  learning  library  built  on  the  top  of   Apache  Hive,  licensed  under  the  Apache  License  v2 Hadoop  HDFS MapReduce (MRv1) Hive /  PIG Hivemall Apache  YARN Apache  Tez DAG processing MR v2 Machine  Learning Check  http://github.com/myui/hivemall 6 Query  Processing Parallel  Data   Processing  Framework Resource  Management Distributed  File  System
  • 7. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. R M MM M HDFS HDFS M M M R M M M R HDFS M MM M M HDFS R MapReduce  and  DAG  engine MapReduce   DAG  engine Tez/Spark No  intermediate  DFS  reads/writes! 7
  • 8. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Very  easy  to  use;  Machine  Learning  on  SQL The  key  characteristic  of  Hivemall 100+  lines of  code Classification  with  Mahout CREATE  TABLE  lr_model AS SELECT feature,  -­‐-­‐ reducers  perform  model  averaging  in   parallel avg(weight)  as  weight FROM  ( SELECT  logress(features,label,..)  as  (feature,weight) FROM  train )  t  -­‐-­‐ map-­‐only  task GROUP  BY  feature;  -­‐-­‐ shuffled  to  reducers ü Machine  Learning  made  easy  for  SQL   developers  (ML  for  the  rest  of  us) ü APIs  are  very  stable  because  of  SQL   abstraction This  SQL  query  automatically  runs  in  parallel on  Hadoop   8
  • 9. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. List  of  functions  in  Hivemall  v0.3 9 • Classification  (both   binary-­‐ and  multi-­‐class) ü Perceptron ü Passive  Aggressive  (PA) ü Confidence  Weighted  (CW) ü Adaptive  Regularization  of   Weight  Vectors  (AROW) ü Soft  Confidence  Weighted  (SCW) ü AdaGrad+RDA • Regression ü Logistic  Regression  (SGD) ü PA  Regression ü AROW  Regression ü AdaGrad ü AdaDELTA • kNN and  Recommendation ü Minhash and  b-­‐Bit  Minhash (LSH  variant) ü Similarity  Search  using  K-­‐NN ü Matrix  Factorization • Feature  engineering ü Feature  hashing ü Feature  scaling (normalization,  z-­‐score)   ü TF-­‐IDF  vectorizer Treasure  Data  will  support  Hivemall v0.3.1  in  the  next  week!   bit.ly/hivemall-­‐mf
  • 10. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. • Contribution  from  Daniel  Dai  (Pig  PMC)  from   Hortonworks • To  be  supported  from  Pig  0.15 10 Hivemall  on  Apache  Pig
  • 11. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Plan  of  the  Talk 1. Brief  introduction  to  Hivemall 2. How  to  use  Hivemall 3. Real-­‐time  prediction  w/  Hivemall  and  RDBMS 11
  • 12. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall Machine Learning Training Prediction Prediction Model Label Feature  Vector Feature  Vector Label Data  preparation 12
  • 13. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Create external table e2006tfidf_train ( rowid int, label float, features ARRAY<STRING> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“ STORED AS TEXTFILE LOCATION '/dataset/E2006- tfidf/train'; How  to  use  Hivemall  -­‐ Data  preparation Define  a  Hive  table  for  training/testing  data 13
  • 14. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall Machine Learning Training Prediction Prediction Model Label Feature  Vector Feature  Vector Label Feature  Engineering 14
  • 15. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. create view e2006tfidf_train_scaled as select rowid, rescale(target,${min_label},${max_label}) as label, features from e2006tfidf_train; Applying a Min-Max Feature Normalization How  to  use  Hivemall  -­‐ Feature  Engineering Transforming  a  label  value   to  a  value  between  0.0  and  1.0 15
  • 16. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall Machine Learning Training Prediction Prediction Model Label Feature  Vector Feature  Vector Label Training 16
  • 17. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall  -­‐ Training CREATE TABLE lr_model AS SELECT feature, avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t GROUP BY feature Training  by  logistic  regression map-­‐only  task  to  learn  a  prediction  model Shuffle  map-­‐outputs  to  reduces  by  feature Reducers  perform  model  averaging   in  parallel 17
  • 18. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall  -­‐ Training CREATE TABLE news20b_cw_model1 AS SELECT feature, voted_avg(weight) as weight FROM (SELECT train_cw(features,label) as (feature,weight) FROM news20b_train ) t GROUP BY feature Training  of  Confidence  Weighted  Classifier Vote  to  use  negative  or  positive   weights  for  avg +0.7,  +0.3,  +0.2,  -­‐0.1,  +0.7 Training  for  the  CW  classifier 18
  • 19. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. create table news20mc_ensemble_model1 as select label, cast(feature as int) as feature, cast(voted_avg(weight) as float) as weight from (select train_multiclass_cw(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 union all select train_multiclass_arow(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 union all select train_multiclass_scw(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 ) t group by label, feature; Ensemble  learning  for  stable  prediction  performance Just  stack  prediction  models   by  union  all 19
  • 20. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall Machine Learning Training Prediction Prediction Model Label Feature  Vector Feature  Vector Label Prediction 20
  • 21. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall  -­‐ Prediction CREATE TABLE lr_predict as SELECT t.rowid, sigmoid(sum(m.weight)) as prob FROM testing_exploded t LEFT OUTER JOIN lr_model m ON (t.feature = m.feature) GROUP BY t.rowid Prediction  is  done  by  LEFT  OUTER  JOIN between  test  data  and  prediction  model No  need  to  load  the  entire  model  into  memory 21
  • 22. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Plan  of  the  Talk 1. Brief  introduction  to  Hivemall 2. How  to  use  Hivemall 3. Real-­‐time  prediction  w/  Hivemall  and  RDBMS 22
  • 23. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Type/Purpose  Matrix  of  Machine  Learning 23 Online Learning Offline Learning Online Prediction • Algorithm Trade  (HFT) • Twitter  real-­‐time   analysis • Ad-­‐tech (e.g.,  CTR/CVR   prediction) • Real-­‐time   recommendation Offline Prediction no/fewneeds? • Daily/weeklybatch   systems • Business Analytics/Reporting
  • 24. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall Machine Learning Batch Training on Hadoop Online Prediction on RDBMS Prediction Model Label Feature  Vector Feature  Vector Label Export   prediction  model 24
  • 25. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Export  Prediction  Model  to  a  RDBMS 25 hive> desc news20b_cw_model1; feature int weight double Any  RDBMS TD  export Periodical  export  is  very easy in  Treasure  Data 103 -0.4896543622016907 104 -0.0955817922949791 105 0.12560302019119263 106 0.09214721620082855
  • 26. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. 26 hive>  desc  testing_exploded;                                                     feature                                  string   value                                      float Real-­‐time  Prediction  on  MySQL #2  Preparing  a  Test  data  table SIGMOID(x) =  1.0  /  (1.0  +  exp(-­‐x)) Prediction Model Label Feature  Vector SELECT     sigmoid(sum(t.value   *  m.weight))  as  prob FROM testing_exploded   t  LEFT  OUTER  JOIN   prediction_model   m  ON  (t.feature  =  m.feature) #3  Online  prediction  on  MySQL   You  can  alternatively  use  SQL  view defining  for  testing  target Index  lookups  are  very efficient  in  RDBMSs http://bit.ly/hivemall-­‐rtp
  • 27. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Cost  of  Amazon  Machine  Learning Amazon-­‐ML  is  suspected  to  be  based  on  Vowpal Wabbit (single  process)   27 Data  Analysis  and  Model  Building  Fees $0.42/Instance  per  Hour Batch  Prediction $0.1/1000 requests Real-­‐time  Prediction $0.0001  per  a  request Pay-­‐per-­‐request    is  apparently  not  suitable  for  doing  prediction  for   each  web  request  (e.g.  online  CTR  prediction)
  • 28. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. 28 Real-­‐time  Prediction  on  Treasure  Data Run  batch  training job  periodically Real-­‐time  prediction on  a  RDBMS Periodical export
  • 29. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. 29 Beyond  Query-­‐as-­‐a-­‐Service! We  ❤️ Open-­‐source!  We  invented  .. We  are  Hiring!