SlideShare a Scribd company logo
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
Treasure  Data  Inc.
Research  Engineer
Makoto  YUI  @myui
2015/05/14
TD  tech  talk  #3  @Retty 1
http://myui.github.io/
20  min.  Introduction  to  Hivemall
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
Ø2015/04  Joined  Treasure  Data,  Inc.
Ø1st Research  Engineer  in  Treasure  Data
ØMy  mission  in  TD  is  developing  ML-­‐as-­‐a-­‐Service  (MLaaS)  
Ø2010/04-­‐2015/03  Senior  Researcher  at  National  Institute  
of  Advanced  Industrial  Science  and  Technology,  Japan.  
ØWorked  on  a  large-­‐scale  Machine  Learning  project  and  Parallel  
Databases  
Ø2009/03  Ph.D.  in  Computer  Science  from  NAIST
Ø My  research  topic  was  about  building  XML  native  database  and  
Parallel  Database  systems
ØSuper  programmer  award  from  the  MITOU  Foundation  
(a  Government  founded  program  for  finding  young  and  
talented  programmers)
Ø Super  creators  in  Treasure  Data:  Sada Furuhashi,  Keisuke  Nishida
2
Who  am    I  ?
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
3
0
2000
4000
6000
8000
10000
12000
Aug-­‐12Sep-­‐12Oct-­‐12Nov-­‐12Dec-­‐12
Jan-­‐13Feb-­‐13M
ar-­‐13Apr-­‐13M
ay-­‐13Jun-­‐13
Jul-­‐13Aug-­‐13Sep-­‐13Oct-­‐13Nov-­‐13Dec-­‐13
Jan-­‐14Feb-­‐14M
ar-­‐14Apr-­‐14M
ay-­‐14Jun-­‐14
Jul-­‐14Aug-­‐14Sep-­‐14Oct-­‐14
Billion  records  (Unit)
Service  in
Series  A  Funding
Reached  100  customers
Selected  as  “Cool  Vendor  
in  Big  Data”  by  Gartner
10  trillion
records  
5  trillion  records
Figures on Oct. 2014
4 hundred thousand (40万) records Imported for each SECOND!!
10+ trillion (10兆) records Total number of imported records
12 billion (120億) records # records sent by an Ad-tech company
Figures  of  Imported  Data  in  Treasure  Data
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
The  latest  numbers  in  Treasure  Data
100+
Customers
In Japan
15 trillion
# of
stored records
4,000
A single company
sends data to us
from 4,000 nodes
500,000
# of records
stored per a second
4
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
Plan  of  the  Talk
1. Brief  introduction  to  Hivemall
2. How  to  use  Hivemall
3. Real-­‐time  prediction  w/  Hivemall  and  RDBMS
5
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
What  is  Hivemall
Scalable  machine  learning  library  built  on  the  top  of  
Apache  Hive,  licensed  under  the  Apache  License  v2
Hadoop  HDFS
MapReduce
(MRv1)
Hive /  PIG
Hivemall
Apache  YARN
Apache  Tez
DAG processing
MR v2
Machine  Learning
Check  http://github.com/myui/hivemall
6
Query  Processing
Parallel  Data  
Processing  Framework
Resource  Management
Distributed  File  System
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
R
M MM
M
HDFS
HDFS
M M M
R
M M M
R
HDFS
M MM
M M
HDFS
R
MapReduce  and  DAG  engine
MapReduce   DAG  engine
Tez/Spark
No  intermediate  DFS  reads/writes!
7
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
Very  easy  to  use;  Machine  Learning  on  SQL
The  key  characteristic  of  Hivemall
100+  lines
of  code
Classification  with  Mahout
CREATE  TABLE  lr_model AS
SELECT
feature,  -­‐-­‐ reducers  perform  model  averaging  in  
parallel
avg(weight)  as  weight
FROM  (
SELECT  logress(features,label,..)  as  (feature,weight)
FROM  train
)  t  -­‐-­‐ map-­‐only  task
GROUP  BY  feature;  -­‐-­‐ shuffled  to  reducers
ü Machine  Learning  made  easy  for  SQL  
developers  (ML  for  the  rest  of  us)
ü APIs  are  very  stable  because  of  SQL  
abstraction
This  SQL  query  automatically  runs  in  parallel
on  Hadoop  
8
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
List  of  functions  in  Hivemall  v0.3
9
• Classification  (both  
binary-­‐ and  multi-­‐class)
ü Perceptron
ü Passive  Aggressive  (PA)
ü Confidence  Weighted  (CW)
ü Adaptive  Regularization  of  
Weight  Vectors  (AROW)
ü Soft  Confidence  Weighted  (SCW)
ü AdaGrad+RDA
• Regression
ü Logistic  Regression  (SGD)
ü PA  Regression
ü AROW  Regression
ü AdaGrad
ü AdaDELTA
• kNN and  Recommendation
ü Minhash and  b-­‐Bit  Minhash
(LSH  variant)
ü Similarity  Search  using  K-­‐NN
ü Matrix  Factorization
• Feature  engineering
ü Feature  hashing
ü Feature  scaling
(normalization,  z-­‐score)  
ü TF-­‐IDF  vectorizer
Treasure  Data  will  support  Hivemall
v0.3.1  in  the  next  week!  
bit.ly/hivemall-­‐mf
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
• Contribution  from  Daniel  Dai  (Pig  PMC)  from  
Hortonworks
• To  be  supported  from  Pig  0.15
10
Hivemall  on  Apache  Pig
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
Plan  of  the  Talk
1. Brief  introduction  to  Hivemall
2. How  to  use  Hivemall
3. Real-­‐time  prediction  w/  Hivemall  and  RDBMS
11
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
How  to  use  Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature  Vector
Feature  Vector
Label
Data  preparation
12
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
Create external table e2006tfidf_train (
rowid int,
label float,
features ARRAY<STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t'
COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-
tfidf/train';
How  to  use  Hivemall  -­‐ Data  preparation
Define  a  Hive  table  for  training/testing  data
13
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
How  to  use  Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature  Vector
Feature  Vector
Label
Feature  Engineering
14
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
create view e2006tfidf_train_scaled
as
select
rowid,
rescale(target,${min_label},${max_label})
as label,
features
from
e2006tfidf_train;
Applying a Min-Max Feature Normalization
How  to  use  Hivemall  -­‐ Feature  Engineering
Transforming  a  label  value  
to  a  value  between  0.0  and  1.0
15
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
How  to  use  Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature  Vector
Feature  Vector
Label
Training
16
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
How  to  use  Hivemall  -­‐ Training
CREATE TABLE lr_model AS
SELECT
feature,
avg(weight) as weight
FROM (
SELECT logress(features,label,..)
as (feature,weight)
FROM train
) t
GROUP BY feature
Training  by  logistic  regression
map-­‐only  task  to  learn  a  prediction  model
Shuffle  map-­‐outputs  to  reduces  by  feature
Reducers  perform  model  averaging  
in  parallel
17
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
How  to  use  Hivemall  -­‐ Training
CREATE TABLE news20b_cw_model1 AS
SELECT
feature,
voted_avg(weight) as weight
FROM
(SELECT
train_cw(features,label)
as (feature,weight)
FROM
news20b_train
) t
GROUP BY feature
Training  of  Confidence  Weighted  Classifier
Vote  to  use  negative  or  positive  
weights  for  avg
+0.7,  +0.3,  +0.2,  -­‐0.1,  +0.7
Training  for  the  CW  classifier
18
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
create table news20mc_ensemble_model1 as
select
label,
cast(feature as int) as feature,
cast(voted_avg(weight) as float) as weight
from
(select
train_multiclass_cw(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
union all
select
train_multiclass_arow(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
union all
select
train_multiclass_scw(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
) t
group by label, feature;
Ensemble  learning  for  stable  prediction  performance
Just  stack  prediction  models  
by  union  all
19
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
How  to  use  Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature  Vector
Feature  Vector
Label
Prediction
20
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
How  to  use  Hivemall  -­‐ Prediction
CREATE TABLE lr_predict
as
SELECT
t.rowid,
sigmoid(sum(m.weight)) as prob
FROM
testing_exploded t LEFT OUTER JOIN
lr_model m ON (t.feature = m.feature)
GROUP BY
t.rowid
Prediction  is  done  by  LEFT  OUTER  JOIN
between  test  data  and  prediction  model
No  need  to  load  the  entire  model  into  memory
21
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
Plan  of  the  Talk
1. Brief  introduction  to  Hivemall
2. How  to  use  Hivemall
3. Real-­‐time  prediction  w/  Hivemall  and  RDBMS
22
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
Type/Purpose  Matrix  of  Machine  Learning
23
Online
Learning
Offline
Learning
Online
Prediction
• Algorithm Trade  (HFT)
• Twitter  real-­‐time  
analysis
• Ad-­‐tech (e.g.,  CTR/CVR  
prediction)
• Real-­‐time  
recommendation
Offline
Prediction
no/fewneeds?
• Daily/weeklybatch  
systems
• Business
Analytics/Reporting
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
How  to  use  Hivemall
Machine
Learning
Batch Training on Hadoop
Online Prediction on RDBMS
Prediction
Model
Label
Feature  Vector
Feature  Vector
Label
Export  
prediction  model
24
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
Export  Prediction  Model  to  a  RDBMS
25
hive> desc news20b_cw_model1;
feature int
weight double
Any  RDBMS
TD  export
Periodical  export  is  very easy
in  Treasure  Data
103 -0.4896543622016907
104 -0.0955817922949791
105 0.12560302019119263
106 0.09214721620082855
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
26
hive>  desc  testing_exploded;                                                    
feature                                  string  
value                                      float
Real-­‐time  Prediction  on  MySQL
#2  Preparing  a  Test  data  table
SIGMOID(x) =  1.0  /  (1.0  +  exp(-­‐x))
Prediction
Model
Label
Feature  Vector
SELECT    
sigmoid(sum(t.value   *  m.weight))  as  prob
FROM
testing_exploded   t  LEFT  OUTER  JOIN  
prediction_model   m  ON  (t.feature  =  m.feature)
#3  Online  prediction  on  MySQL  
You  can  alternatively  use  SQL  view
defining  for  testing  target
Index  lookups  are  very
efficient  in  RDBMSs
http://bit.ly/hivemall-­‐rtp
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
Cost  of  Amazon  Machine  Learning
Amazon-­‐ML  is  suspected  to  be  based  on  Vowpal Wabbit
(single  process)  
27
Data  Analysis  and  Model  Building  Fees
$0.42/Instance  per  Hour
Batch  Prediction
$0.1/1000 requests
Real-­‐time  Prediction
$0.0001  per  a  request
Pay-­‐per-­‐request    is  apparently  not  suitable  for  doing  prediction  for  
each  web  request  (e.g.  online  CTR  prediction)
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
28
Real-­‐time  Prediction  on  Treasure  Data
Run  batch  training
job  periodically
Real-­‐time  prediction
on  a  RDBMS
Periodical
export
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
29
Beyond  Query-­‐as-­‐a-­‐Service!
We  ❤️  Open-­‐source!  We  Invented  ..
We  are  Hiring!

More Related Content

What's hot

EBPF and Linux Networking
EBPF and Linux NetworkingEBPF and Linux Networking
EBPF and Linux Networking
PLUMgrid
 
eBPF Trace from Kernel to Userspace
eBPF Trace from Kernel to UserspaceeBPF Trace from Kernel to Userspace
eBPF Trace from Kernel to Userspace
SUSE Labs Taipei
 
Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...
Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...
Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...
Henning Jacobs
 
라이브 서비스를 위한 게임 서버 구성
라이브 서비스를 위한 게임 서버 구성라이브 서비스를 위한 게임 서버 구성
라이브 서비스를 위한 게임 서버 구성
Hyunjik Bae
 
Room 1 - 1 - Benoit TELLIER - On premise email inbound service with Apache James
Room 1 - 1 - Benoit TELLIER - On premise email inbound service with Apache JamesRoom 1 - 1 - Benoit TELLIER - On premise email inbound service with Apache James
Room 1 - 1 - Benoit TELLIER - On premise email inbound service with Apache James
Vietnam Open Infrastructure User Group
 
Fundamentals of Quantum Computing
Fundamentals of Quantum ComputingFundamentals of Quantum Computing
Fundamentals of Quantum Computing
achakracu
 
Quantum Cryptography
Quantum CryptographyQuantum Cryptography
Quantum Cryptography
pixiejen
 
CNN 초보자가 만드는 초보자 가이드 (VGG 약간 포함)
CNN 초보자가 만드는 초보자 가이드 (VGG 약간 포함)CNN 초보자가 만드는 초보자 가이드 (VGG 약간 포함)
CNN 초보자가 만드는 초보자 가이드 (VGG 약간 포함)
Lee Seungeun
 
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
Thomas Graf
 
시즌 2: 멀티쓰레드 프로그래밍이 왜이리 힘드나요?
시즌 2: 멀티쓰레드 프로그래밍이 왜이리 힘드나요?시즌 2: 멀티쓰레드 프로그래밍이 왜이리 힘드나요?
시즌 2: 멀티쓰레드 프로그래밍이 왜이리 힘드나요?
내훈 정
 
Kubernetes 101
Kubernetes 101Kubernetes 101
Kubernetes 101
Huy Vo
 
Driving containerd operations with gRPC
Driving containerd operations with gRPCDriving containerd operations with gRPC
Driving containerd operations with gRPC
Docker, Inc.
 
Cilium - Fast IPv6 Container Networking with BPF and XDP
Cilium - Fast IPv6 Container Networking with BPF and XDPCilium - Fast IPv6 Container Networking with BPF and XDP
Cilium - Fast IPv6 Container Networking with BPF and XDP
Thomas Graf
 
Observer pattern with Stl, boost and qt
Observer pattern with Stl, boost and qtObserver pattern with Stl, boost and qt
Observer pattern with Stl, boost and qt
Daniel Eriksson
 
ClearPass 6.4.0 Release Notes
ClearPass 6.4.0 Release NotesClearPass 6.4.0 Release Notes
ClearPass 6.4.0 Release Notes
Aruba, a Hewlett Packard Enterprise company
 
5分でわかるWebRTCの仕組み - html5minutes vol.01
5分でわかるWebRTCの仕組み - html5minutes vol.015分でわかるWebRTCの仕組み - html5minutes vol.01
5分でわかるWebRTCの仕組み - html5minutes vol.01
西畑 一馬
 
Performance Tuning EC2 Instances
Performance Tuning EC2 InstancesPerformance Tuning EC2 Instances
Performance Tuning EC2 Instances
Brendan Gregg
 
[NDC 14] 가죽 장화를 먹게 해달라니 - &lt;야생의>의 자유도 높은 아이템 시스템 구현
[NDC 14] 가죽 장화를 먹게 해달라니 - &lt;야생의>의 자유도 높은 아이템 시스템 구현[NDC 14] 가죽 장화를 먹게 해달라니 - &lt;야생의>의 자유도 높은 아이템 시스템 구현
[NDC 14] 가죽 장화를 먹게 해달라니 - &lt;야생의>의 자유도 높은 아이템 시스템 구현Hoyoung Choi
 
Rancher and Kubernetes Best Practices
Rancher and  Kubernetes Best PracticesRancher and  Kubernetes Best Practices
Rancher and Kubernetes Best Practices
Avinash Patil
 
OpenStack networking
OpenStack networkingOpenStack networking
OpenStack networking
Sim Janghoon
 

What's hot (20)

EBPF and Linux Networking
EBPF and Linux NetworkingEBPF and Linux Networking
EBPF and Linux Networking
 
eBPF Trace from Kernel to Userspace
eBPF Trace from Kernel to UserspaceeBPF Trace from Kernel to Userspace
eBPF Trace from Kernel to Userspace
 
Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...
Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...
Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...
 
라이브 서비스를 위한 게임 서버 구성
라이브 서비스를 위한 게임 서버 구성라이브 서비스를 위한 게임 서버 구성
라이브 서비스를 위한 게임 서버 구성
 
Room 1 - 1 - Benoit TELLIER - On premise email inbound service with Apache James
Room 1 - 1 - Benoit TELLIER - On premise email inbound service with Apache JamesRoom 1 - 1 - Benoit TELLIER - On premise email inbound service with Apache James
Room 1 - 1 - Benoit TELLIER - On premise email inbound service with Apache James
 
Fundamentals of Quantum Computing
Fundamentals of Quantum ComputingFundamentals of Quantum Computing
Fundamentals of Quantum Computing
 
Quantum Cryptography
Quantum CryptographyQuantum Cryptography
Quantum Cryptography
 
CNN 초보자가 만드는 초보자 가이드 (VGG 약간 포함)
CNN 초보자가 만드는 초보자 가이드 (VGG 약간 포함)CNN 초보자가 만드는 초보자 가이드 (VGG 약간 포함)
CNN 초보자가 만드는 초보자 가이드 (VGG 약간 포함)
 
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
 
시즌 2: 멀티쓰레드 프로그래밍이 왜이리 힘드나요?
시즌 2: 멀티쓰레드 프로그래밍이 왜이리 힘드나요?시즌 2: 멀티쓰레드 프로그래밍이 왜이리 힘드나요?
시즌 2: 멀티쓰레드 프로그래밍이 왜이리 힘드나요?
 
Kubernetes 101
Kubernetes 101Kubernetes 101
Kubernetes 101
 
Driving containerd operations with gRPC
Driving containerd operations with gRPCDriving containerd operations with gRPC
Driving containerd operations with gRPC
 
Cilium - Fast IPv6 Container Networking with BPF and XDP
Cilium - Fast IPv6 Container Networking with BPF and XDPCilium - Fast IPv6 Container Networking with BPF and XDP
Cilium - Fast IPv6 Container Networking with BPF and XDP
 
Observer pattern with Stl, boost and qt
Observer pattern with Stl, boost and qtObserver pattern with Stl, boost and qt
Observer pattern with Stl, boost and qt
 
ClearPass 6.4.0 Release Notes
ClearPass 6.4.0 Release NotesClearPass 6.4.0 Release Notes
ClearPass 6.4.0 Release Notes
 
5分でわかるWebRTCの仕組み - html5minutes vol.01
5分でわかるWebRTCの仕組み - html5minutes vol.015分でわかるWebRTCの仕組み - html5minutes vol.01
5分でわかるWebRTCの仕組み - html5minutes vol.01
 
Performance Tuning EC2 Instances
Performance Tuning EC2 InstancesPerformance Tuning EC2 Instances
Performance Tuning EC2 Instances
 
[NDC 14] 가죽 장화를 먹게 해달라니 - &lt;야생의>의 자유도 높은 아이템 시스템 구현
[NDC 14] 가죽 장화를 먹게 해달라니 - &lt;야생의>의 자유도 높은 아이템 시스템 구현[NDC 14] 가죽 장화를 먹게 해달라니 - &lt;야생의>의 자유도 높은 아이템 시스템 구현
[NDC 14] 가죽 장화를 먹게 해달라니 - &lt;야생의>의 자유도 높은 아이템 시스템 구현
 
Rancher and Kubernetes Best Practices
Rancher and  Kubernetes Best PracticesRancher and  Kubernetes Best Practices
Rancher and Kubernetes Best Practices
 
OpenStack networking
OpenStack networkingOpenStack networking
OpenStack networking
 

Similar to Introduction to Hivemall

Hivemall tech talk at Redwood, CA
Hivemall tech talk at Redwood, CAHivemall tech talk at Redwood, CA
Hivemall tech talk at Redwood, CA
Makoto Yui
 
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
huguk
 
Managing Machine Learning workflows on Treasure Data
Managing Machine Learning workflows on Treasure DataManaging Machine Learning workflows on Treasure Data
Managing Machine Learning workflows on Treasure Data
Aki Ariga
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Srivatsan Ramanujam
 
Db tech show - hivemall
Db tech show - hivemallDb tech show - hivemall
Db tech show - hivemall
Makoto Yui
 
Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17
Makoto Yui
 
Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17
Makoto Yui
 
Apache Hivemall and my OSS experience
Apache Hivemall and my OSS experienceApache Hivemall and my OSS experience
Apache Hivemall and my OSS experience
Makoto Yui
 
Distributed TensorFlow on Hops (Papis London, April 2018)
Distributed TensorFlow on Hops (Papis London, April 2018)Distributed TensorFlow on Hops (Papis London, April 2018)
Distributed TensorFlow on Hops (Papis London, April 2018)
Jim Dowling
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
Srivatsan Ramanujam
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Esther Vasiete
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Skillspeed
 
Dancing with the Elephant
Dancing with the ElephantDancing with the Elephant
Dancing with the Elephant
DataWorks Summit
 
Jfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocksJfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocks
Jim Dowling
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworks
IJDKP
 
Mammothdb - Public VC Pitchdeck!
Mammothdb - Public VC Pitchdeck!Mammothdb - Public VC Pitchdeck!
Mammothdb - Public VC Pitchdeck!
Steve Keil
 
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
Adam Muise
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET Journal
 
The sensor data challenge - Innovations (not only) for the Internet of Things
The sensor data challenge - Innovations (not only) for the Internet of ThingsThe sensor data challenge - Innovations (not only) for the Internet of Things
The sensor data challenge - Innovations (not only) for the Internet of Things
Stephan Reimann
 

Similar to Introduction to Hivemall (20)

Hivemall tech talk at Redwood, CA
Hivemall tech talk at Redwood, CAHivemall tech talk at Redwood, CA
Hivemall tech talk at Redwood, CA
 
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
 
Managing Machine Learning workflows on Treasure Data
Managing Machine Learning workflows on Treasure DataManaging Machine Learning workflows on Treasure Data
Managing Machine Learning workflows on Treasure Data
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
 
Db tech show - hivemall
Db tech show - hivemallDb tech show - hivemall
Db tech show - hivemall
 
Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17
 
Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17
 
Apache Hivemall and my OSS experience
Apache Hivemall and my OSS experienceApache Hivemall and my OSS experience
Apache Hivemall and my OSS experience
 
Distributed TensorFlow on Hops (Papis London, April 2018)
Distributed TensorFlow on Hops (Papis London, April 2018)Distributed TensorFlow on Hops (Papis London, April 2018)
Distributed TensorFlow on Hops (Papis London, April 2018)
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
 
Dancing with the Elephant
Dancing with the ElephantDancing with the Elephant
Dancing with the Elephant
 
Jfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocksJfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocks
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworks
 
Mammothdb - Public VC Pitchdeck!
Mammothdb - Public VC Pitchdeck!Mammothdb - Public VC Pitchdeck!
Mammothdb - Public VC Pitchdeck!
 
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOP
 
The sensor data challenge - Innovations (not only) for the Internet of Things
The sensor data challenge - Innovations (not only) for the Internet of ThingsThe sensor data challenge - Innovations (not only) for the Internet of Things
The sensor data challenge - Innovations (not only) for the Internet of Things
 

More from Treasure Data, Inc.

GDPR: A Practical Guide for Marketers
GDPR: A Practical Guide for MarketersGDPR: A Practical Guide for Marketers
GDPR: A Practical Guide for Marketers
Treasure Data, Inc.
 
AR and VR by the Numbers: A Data First Approach to the Technology and Market
AR and VR by the Numbers: A Data First Approach to the Technology and MarketAR and VR by the Numbers: A Data First Approach to the Technology and Market
AR and VR by the Numbers: A Data First Approach to the Technology and Market
Treasure Data, Inc.
 
Introduction to Customer Data Platforms
Introduction to Customer Data PlatformsIntroduction to Customer Data Platforms
Introduction to Customer Data Platforms
Treasure Data, Inc.
 
Hands On: Javascript SDK
Hands On: Javascript SDKHands On: Javascript SDK
Hands On: Javascript SDK
Treasure Data, Inc.
 
Hands-On: Managing Slowly Changing Dimensions Using TD Workflow
Hands-On: Managing Slowly Changing Dimensions Using TD WorkflowHands-On: Managing Slowly Changing Dimensions Using TD Workflow
Hands-On: Managing Slowly Changing Dimensions Using TD Workflow
Treasure Data, Inc.
 
Brand Analytics Management: Measuring CLV Across Platforms, Devices and Apps
Brand Analytics Management: Measuring CLV Across Platforms, Devices and AppsBrand Analytics Management: Measuring CLV Across Platforms, Devices and Apps
Brand Analytics Management: Measuring CLV Across Platforms, Devices and Apps
Treasure Data, Inc.
 
How to Power Your Customer Experience with Data
How to Power Your Customer Experience with DataHow to Power Your Customer Experience with Data
How to Power Your Customer Experience with Data
Treasure Data, Inc.
 
Why Your VR Game is Virtually Useless Without Data
Why Your VR Game is Virtually Useless Without DataWhy Your VR Game is Virtually Useless Without Data
Why Your VR Game is Virtually Useless Without Data
Treasure Data, Inc.
 
Connecting the Customer Data Dots
Connecting the Customer Data DotsConnecting the Customer Data Dots
Connecting the Customer Data Dots
Treasure Data, Inc.
 
Harnessing Data for Better Customer Experience and Company Success
Harnessing Data for Better Customer Experience and Company SuccessHarnessing Data for Better Customer Experience and Company Success
Harnessing Data for Better Customer Experience and Company Success
Treasure Data, Inc.
 
Packaging Ecosystems -Monki Gras 2017
Packaging Ecosystems -Monki Gras 2017Packaging Ecosystems -Monki Gras 2017
Packaging Ecosystems -Monki Gras 2017
Treasure Data, Inc.
 
글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)
글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)
글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)
Treasure Data, Inc.
 
Keynote - Fluentd meetup v14
Keynote - Fluentd meetup v14Keynote - Fluentd meetup v14
Keynote - Fluentd meetup v14
Treasure Data, Inc.
 
Introduction to New features and Use cases of Hivemall
Introduction to New features and Use cases of HivemallIntroduction to New features and Use cases of Hivemall
Introduction to New features and Use cases of Hivemall
Treasure Data, Inc.
 
Scalable Hadoop in the cloud
Scalable Hadoop in the cloudScalable Hadoop in the cloud
Scalable Hadoop in the cloud
Treasure Data, Inc.
 
Using Embulk at Treasure Data
Using Embulk at Treasure DataUsing Embulk at Treasure Data
Using Embulk at Treasure Data
Treasure Data, Inc.
 
Scaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataScaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big Data
Treasure Data, Inc.
 
Treasure Data: Move your data from MySQL to Redshift with (not much more tha...
Treasure Data:  Move your data from MySQL to Redshift with (not much more tha...Treasure Data:  Move your data from MySQL to Redshift with (not much more tha...
Treasure Data: Move your data from MySQL to Redshift with (not much more tha...
Treasure Data, Inc.
 
Treasure Data From MySQL to Redshift
Treasure Data  From MySQL to RedshiftTreasure Data  From MySQL to Redshift
Treasure Data From MySQL to Redshift
Treasure Data, Inc.
 
Unifying Events and Logs into the Cloud
Unifying Events and Logs into the CloudUnifying Events and Logs into the Cloud
Unifying Events and Logs into the Cloud
Treasure Data, Inc.
 

More from Treasure Data, Inc. (20)

GDPR: A Practical Guide for Marketers
GDPR: A Practical Guide for MarketersGDPR: A Practical Guide for Marketers
GDPR: A Practical Guide for Marketers
 
AR and VR by the Numbers: A Data First Approach to the Technology and Market
AR and VR by the Numbers: A Data First Approach to the Technology and MarketAR and VR by the Numbers: A Data First Approach to the Technology and Market
AR and VR by the Numbers: A Data First Approach to the Technology and Market
 
Introduction to Customer Data Platforms
Introduction to Customer Data PlatformsIntroduction to Customer Data Platforms
Introduction to Customer Data Platforms
 
Hands On: Javascript SDK
Hands On: Javascript SDKHands On: Javascript SDK
Hands On: Javascript SDK
 
Hands-On: Managing Slowly Changing Dimensions Using TD Workflow
Hands-On: Managing Slowly Changing Dimensions Using TD WorkflowHands-On: Managing Slowly Changing Dimensions Using TD Workflow
Hands-On: Managing Slowly Changing Dimensions Using TD Workflow
 
Brand Analytics Management: Measuring CLV Across Platforms, Devices and Apps
Brand Analytics Management: Measuring CLV Across Platforms, Devices and AppsBrand Analytics Management: Measuring CLV Across Platforms, Devices and Apps
Brand Analytics Management: Measuring CLV Across Platforms, Devices and Apps
 
How to Power Your Customer Experience with Data
How to Power Your Customer Experience with DataHow to Power Your Customer Experience with Data
How to Power Your Customer Experience with Data
 
Why Your VR Game is Virtually Useless Without Data
Why Your VR Game is Virtually Useless Without DataWhy Your VR Game is Virtually Useless Without Data
Why Your VR Game is Virtually Useless Without Data
 
Connecting the Customer Data Dots
Connecting the Customer Data DotsConnecting the Customer Data Dots
Connecting the Customer Data Dots
 
Harnessing Data for Better Customer Experience and Company Success
Harnessing Data for Better Customer Experience and Company SuccessHarnessing Data for Better Customer Experience and Company Success
Harnessing Data for Better Customer Experience and Company Success
 
Packaging Ecosystems -Monki Gras 2017
Packaging Ecosystems -Monki Gras 2017Packaging Ecosystems -Monki Gras 2017
Packaging Ecosystems -Monki Gras 2017
 
글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)
글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)
글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)
 
Keynote - Fluentd meetup v14
Keynote - Fluentd meetup v14Keynote - Fluentd meetup v14
Keynote - Fluentd meetup v14
 
Introduction to New features and Use cases of Hivemall
Introduction to New features and Use cases of HivemallIntroduction to New features and Use cases of Hivemall
Introduction to New features and Use cases of Hivemall
 
Scalable Hadoop in the cloud
Scalable Hadoop in the cloudScalable Hadoop in the cloud
Scalable Hadoop in the cloud
 
Using Embulk at Treasure Data
Using Embulk at Treasure DataUsing Embulk at Treasure Data
Using Embulk at Treasure Data
 
Scaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataScaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big Data
 
Treasure Data: Move your data from MySQL to Redshift with (not much more tha...
Treasure Data:  Move your data from MySQL to Redshift with (not much more tha...Treasure Data:  Move your data from MySQL to Redshift with (not much more tha...
Treasure Data: Move your data from MySQL to Redshift with (not much more tha...
 
Treasure Data From MySQL to Redshift
Treasure Data  From MySQL to RedshiftTreasure Data  From MySQL to Redshift
Treasure Data From MySQL to Redshift
 
Unifying Events and Logs into the Cloud
Unifying Events and Logs into the CloudUnifying Events and Logs into the Cloud
Unifying Events and Logs into the Cloud
 

Recently uploaded

AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
Paris Salesforce Developer Group
 
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
upoux
 
Supermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdfSupermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdf
Kamal Acharya
 
Height and depth gauge linear metrology.pdf
Height and depth gauge linear metrology.pdfHeight and depth gauge linear metrology.pdf
Height and depth gauge linear metrology.pdf
q30122000
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
VANDANAMOHANGOUDA
 
Butterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdfButterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdf
Lubi Valves
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
uqyfuc
 
Blood finder application project report (1).pdf
Blood finder application project report (1).pdfBlood finder application project report (1).pdf
Blood finder application project report (1).pdf
Kamal Acharya
 
Presentation on Food Delivery Systems
Presentation on Food Delivery SystemsPresentation on Food Delivery Systems
Presentation on Food Delivery Systems
Abdullah Al Noman
 
openshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoinopenshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoin
snaprevwdev
 
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
PriyankaKilaniya
 
Levelised Cost of Hydrogen (LCOH) Calculator Manual
Levelised Cost of Hydrogen  (LCOH) Calculator ManualLevelised Cost of Hydrogen  (LCOH) Calculator Manual
Levelised Cost of Hydrogen (LCOH) Calculator Manual
Massimo Talia
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
ydzowc
 
Assistant Engineer (Chemical) Interview Questions.pdf
Assistant Engineer (Chemical) Interview Questions.pdfAssistant Engineer (Chemical) Interview Questions.pdf
Assistant Engineer (Chemical) Interview Questions.pdf
Seetal Daas
 
AI in customer support Use cases solutions development and implementation.pdf
AI in customer support Use cases solutions development and implementation.pdfAI in customer support Use cases solutions development and implementation.pdf
AI in customer support Use cases solutions development and implementation.pdf
mahaffeycheryld
 
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICSUNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
vmspraneeth
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
Divyanshu
 
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptxSENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
b0754201
 
Beckhoff Programmable Logic Control Overview Presentation
Beckhoff Programmable Logic Control Overview PresentationBeckhoff Programmable Logic Control Overview Presentation
Beckhoff Programmable Logic Control Overview Presentation
VanTuDuong1
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
Atif Razi
 

Recently uploaded (20)

AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
 
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
 
Supermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdfSupermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdf
 
Height and depth gauge linear metrology.pdf
Height and depth gauge linear metrology.pdfHeight and depth gauge linear metrology.pdf
Height and depth gauge linear metrology.pdf
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
 
Butterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdfButterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdf
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
 
Blood finder application project report (1).pdf
Blood finder application project report (1).pdfBlood finder application project report (1).pdf
Blood finder application project report (1).pdf
 
Presentation on Food Delivery Systems
Presentation on Food Delivery SystemsPresentation on Food Delivery Systems
Presentation on Food Delivery Systems
 
openshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoinopenshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoin
 
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
 
Levelised Cost of Hydrogen (LCOH) Calculator Manual
Levelised Cost of Hydrogen  (LCOH) Calculator ManualLevelised Cost of Hydrogen  (LCOH) Calculator Manual
Levelised Cost of Hydrogen (LCOH) Calculator Manual
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
 
Assistant Engineer (Chemical) Interview Questions.pdf
Assistant Engineer (Chemical) Interview Questions.pdfAssistant Engineer (Chemical) Interview Questions.pdf
Assistant Engineer (Chemical) Interview Questions.pdf
 
AI in customer support Use cases solutions development and implementation.pdf
AI in customer support Use cases solutions development and implementation.pdfAI in customer support Use cases solutions development and implementation.pdf
AI in customer support Use cases solutions development and implementation.pdf
 
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICSUNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
 
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptxSENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
 
Beckhoff Programmable Logic Control Overview Presentation
Beckhoff Programmable Logic Control Overview PresentationBeckhoff Programmable Logic Control Overview Presentation
Beckhoff Programmable Logic Control Overview Presentation
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
 

Introduction to Hivemall

  • 1. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Treasure  Data  Inc. Research  Engineer Makoto  YUI  @myui 2015/05/14 TD  tech  talk  #3  @Retty 1 http://myui.github.io/ 20  min.  Introduction  to  Hivemall
  • 2. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Ø2015/04  Joined  Treasure  Data,  Inc. Ø1st Research  Engineer  in  Treasure  Data ØMy  mission  in  TD  is  developing  ML-­‐as-­‐a-­‐Service  (MLaaS)   Ø2010/04-­‐2015/03  Senior  Researcher  at  National  Institute   of  Advanced  Industrial  Science  and  Technology,  Japan.   ØWorked  on  a  large-­‐scale  Machine  Learning  project  and  Parallel   Databases   Ø2009/03  Ph.D.  in  Computer  Science  from  NAIST Ø My  research  topic  was  about  building  XML  native  database  and   Parallel  Database  systems ØSuper  programmer  award  from  the  MITOU  Foundation   (a  Government  founded  program  for  finding  young  and   talented  programmers) Ø Super  creators  in  Treasure  Data:  Sada Furuhashi,  Keisuke  Nishida 2 Who  am    I  ?
  • 3. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. 3 0 2000 4000 6000 8000 10000 12000 Aug-­‐12Sep-­‐12Oct-­‐12Nov-­‐12Dec-­‐12 Jan-­‐13Feb-­‐13M ar-­‐13Apr-­‐13M ay-­‐13Jun-­‐13 Jul-­‐13Aug-­‐13Sep-­‐13Oct-­‐13Nov-­‐13Dec-­‐13 Jan-­‐14Feb-­‐14M ar-­‐14Apr-­‐14M ay-­‐14Jun-­‐14 Jul-­‐14Aug-­‐14Sep-­‐14Oct-­‐14 Billion  records  (Unit) Service  in Series  A  Funding Reached  100  customers Selected  as  “Cool  Vendor   in  Big  Data”  by  Gartner 10  trillion records   5  trillion  records Figures on Oct. 2014 4 hundred thousand (40万) records Imported for each SECOND!! 10+ trillion (10兆) records Total number of imported records 12 billion (120億) records # records sent by an Ad-tech company Figures  of  Imported  Data  in  Treasure  Data
  • 4. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. The  latest  numbers  in  Treasure  Data 100+ Customers In Japan 15 trillion # of stored records 4,000 A single company sends data to us from 4,000 nodes 500,000 # of records stored per a second 4
  • 5. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Plan  of  the  Talk 1. Brief  introduction  to  Hivemall 2. How  to  use  Hivemall 3. Real-­‐time  prediction  w/  Hivemall  and  RDBMS 5
  • 6. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. What  is  Hivemall Scalable  machine  learning  library  built  on  the  top  of   Apache  Hive,  licensed  under  the  Apache  License  v2 Hadoop  HDFS MapReduce (MRv1) Hive /  PIG Hivemall Apache  YARN Apache  Tez DAG processing MR v2 Machine  Learning Check  http://github.com/myui/hivemall 6 Query  Processing Parallel  Data   Processing  Framework Resource  Management Distributed  File  System
  • 7. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. R M MM M HDFS HDFS M M M R M M M R HDFS M MM M M HDFS R MapReduce  and  DAG  engine MapReduce   DAG  engine Tez/Spark No  intermediate  DFS  reads/writes! 7
  • 8. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Very  easy  to  use;  Machine  Learning  on  SQL The  key  characteristic  of  Hivemall 100+  lines of  code Classification  with  Mahout CREATE  TABLE  lr_model AS SELECT feature,  -­‐-­‐ reducers  perform  model  averaging  in   parallel avg(weight)  as  weight FROM  ( SELECT  logress(features,label,..)  as  (feature,weight) FROM  train )  t  -­‐-­‐ map-­‐only  task GROUP  BY  feature;  -­‐-­‐ shuffled  to  reducers ü Machine  Learning  made  easy  for  SQL   developers  (ML  for  the  rest  of  us) ü APIs  are  very  stable  because  of  SQL   abstraction This  SQL  query  automatically  runs  in  parallel on  Hadoop   8
  • 9. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. List  of  functions  in  Hivemall  v0.3 9 • Classification  (both   binary-­‐ and  multi-­‐class) ü Perceptron ü Passive  Aggressive  (PA) ü Confidence  Weighted  (CW) ü Adaptive  Regularization  of   Weight  Vectors  (AROW) ü Soft  Confidence  Weighted  (SCW) ü AdaGrad+RDA • Regression ü Logistic  Regression  (SGD) ü PA  Regression ü AROW  Regression ü AdaGrad ü AdaDELTA • kNN and  Recommendation ü Minhash and  b-­‐Bit  Minhash (LSH  variant) ü Similarity  Search  using  K-­‐NN ü Matrix  Factorization • Feature  engineering ü Feature  hashing ü Feature  scaling (normalization,  z-­‐score)   ü TF-­‐IDF  vectorizer Treasure  Data  will  support  Hivemall v0.3.1  in  the  next  week!   bit.ly/hivemall-­‐mf
  • 10. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. • Contribution  from  Daniel  Dai  (Pig  PMC)  from   Hortonworks • To  be  supported  from  Pig  0.15 10 Hivemall  on  Apache  Pig
  • 11. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Plan  of  the  Talk 1. Brief  introduction  to  Hivemall 2. How  to  use  Hivemall 3. Real-­‐time  prediction  w/  Hivemall  and  RDBMS 11
  • 12. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall Machine Learning Training Prediction Prediction Model Label Feature  Vector Feature  Vector Label Data  preparation 12
  • 13. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Create external table e2006tfidf_train ( rowid int, label float, features ARRAY<STRING> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“ STORED AS TEXTFILE LOCATION '/dataset/E2006- tfidf/train'; How  to  use  Hivemall  -­‐ Data  preparation Define  a  Hive  table  for  training/testing  data 13
  • 14. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall Machine Learning Training Prediction Prediction Model Label Feature  Vector Feature  Vector Label Feature  Engineering 14
  • 15. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. create view e2006tfidf_train_scaled as select rowid, rescale(target,${min_label},${max_label}) as label, features from e2006tfidf_train; Applying a Min-Max Feature Normalization How  to  use  Hivemall  -­‐ Feature  Engineering Transforming  a  label  value   to  a  value  between  0.0  and  1.0 15
  • 16. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall Machine Learning Training Prediction Prediction Model Label Feature  Vector Feature  Vector Label Training 16
  • 17. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall  -­‐ Training CREATE TABLE lr_model AS SELECT feature, avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t GROUP BY feature Training  by  logistic  regression map-­‐only  task  to  learn  a  prediction  model Shuffle  map-­‐outputs  to  reduces  by  feature Reducers  perform  model  averaging   in  parallel 17
  • 18. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall  -­‐ Training CREATE TABLE news20b_cw_model1 AS SELECT feature, voted_avg(weight) as weight FROM (SELECT train_cw(features,label) as (feature,weight) FROM news20b_train ) t GROUP BY feature Training  of  Confidence  Weighted  Classifier Vote  to  use  negative  or  positive   weights  for  avg +0.7,  +0.3,  +0.2,  -­‐0.1,  +0.7 Training  for  the  CW  classifier 18
  • 19. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. create table news20mc_ensemble_model1 as select label, cast(feature as int) as feature, cast(voted_avg(weight) as float) as weight from (select train_multiclass_cw(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 union all select train_multiclass_arow(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 union all select train_multiclass_scw(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 ) t group by label, feature; Ensemble  learning  for  stable  prediction  performance Just  stack  prediction  models   by  union  all 19
  • 20. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall Machine Learning Training Prediction Prediction Model Label Feature  Vector Feature  Vector Label Prediction 20
  • 21. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall  -­‐ Prediction CREATE TABLE lr_predict as SELECT t.rowid, sigmoid(sum(m.weight)) as prob FROM testing_exploded t LEFT OUTER JOIN lr_model m ON (t.feature = m.feature) GROUP BY t.rowid Prediction  is  done  by  LEFT  OUTER  JOIN between  test  data  and  prediction  model No  need  to  load  the  entire  model  into  memory 21
  • 22. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Plan  of  the  Talk 1. Brief  introduction  to  Hivemall 2. How  to  use  Hivemall 3. Real-­‐time  prediction  w/  Hivemall  and  RDBMS 22
  • 23. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Type/Purpose  Matrix  of  Machine  Learning 23 Online Learning Offline Learning Online Prediction • Algorithm Trade  (HFT) • Twitter  real-­‐time   analysis • Ad-­‐tech (e.g.,  CTR/CVR   prediction) • Real-­‐time   recommendation Offline Prediction no/fewneeds? • Daily/weeklybatch   systems • Business Analytics/Reporting
  • 24. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall Machine Learning Batch Training on Hadoop Online Prediction on RDBMS Prediction Model Label Feature  Vector Feature  Vector Label Export   prediction  model 24
  • 25. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Export  Prediction  Model  to  a  RDBMS 25 hive> desc news20b_cw_model1; feature int weight double Any  RDBMS TD  export Periodical  export  is  very easy in  Treasure  Data 103 -0.4896543622016907 104 -0.0955817922949791 105 0.12560302019119263 106 0.09214721620082855
  • 26. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. 26 hive>  desc  testing_exploded;                                                     feature                                  string   value                                      float Real-­‐time  Prediction  on  MySQL #2  Preparing  a  Test  data  table SIGMOID(x) =  1.0  /  (1.0  +  exp(-­‐x)) Prediction Model Label Feature  Vector SELECT     sigmoid(sum(t.value   *  m.weight))  as  prob FROM testing_exploded   t  LEFT  OUTER  JOIN   prediction_model   m  ON  (t.feature  =  m.feature) #3  Online  prediction  on  MySQL   You  can  alternatively  use  SQL  view defining  for  testing  target Index  lookups  are  very efficient  in  RDBMSs http://bit.ly/hivemall-­‐rtp
  • 27. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Cost  of  Amazon  Machine  Learning Amazon-­‐ML  is  suspected  to  be  based  on  Vowpal Wabbit (single  process)   27 Data  Analysis  and  Model  Building  Fees $0.42/Instance  per  Hour Batch  Prediction $0.1/1000 requests Real-­‐time  Prediction $0.0001  per  a  request Pay-­‐per-­‐request    is  apparently  not  suitable  for  doing  prediction  for   each  web  request  (e.g.  online  CTR  prediction)
  • 28. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. 28 Real-­‐time  Prediction  on  Treasure  Data Run  batch  training job  periodically Real-­‐time  prediction on  a  RDBMS Periodical export
  • 29. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. 29 Beyond  Query-­‐as-­‐a-­‐Service! We  ❤️  Open-­‐source!  We  Invented  .. We  are  Hiring!