HadoopCon'16, Taipei @myui

Hivemall: Machine Learning
Library for Apache Hive/Spark
Research Engineer
Makoto YUI (油井誠) @myui
<myui@treasure-data.com>
12016/09/09 HadoopCon 16, Taipei

Ø 2015.04~ Research Engineer at Treasure Data,
Inc.
• My mission is developing ML-as-a-Service in a Hadoop-as-
a-service company
Ø 2010.04-2015.03 Senior Researcher at National
Institute of Advanced Industrial Science and
Technology, Japan. 産業技術総合研究所
• Developed Hivemall as a personal research project
Ø 2009.03 Ph.D. in Computer Science from NAIST
• Majored in Parallel Data Processing, not ML then
Ø Visiting scholar in CWI, Amsterdam and Univ. Edinburgh
Little about me ..
2016/09/09 HadoopCon 16, Taipei 2

Hiro Yoshikawa
CEO
Kaz Ota
CTO
Sada Furuhashi
Chief Architect
Open source business
veteran
Founder - world’s
largest Hadoop group
Invented Fluentd,
Messagepack
TODAY 
100+ Employees, 30M+ funding
2015 
New ofﬁce in Seoul, Korea
2013 
New ofﬁce in Tokyo, Japan
2012 
Founded in Mountain View, CA
Investors
Jerry Yang 
Yahoo! Founder
Bill Tai 
Angel Investor
Yukihiro Matsumoto 
Ruby Inventor
Sierra Ventures - Tim Guleri 
Entrerprise Software
Scale Ventures - Andy Vitus 
B2B SaaS
Treasure Data

We Open-source! TD invented ..
Streaming log collector Bulk data import/export efficient binary serialization
Streaming Query Processor
Machine learning on Hadoop
digdag.io
Workflow engine (Beta)

Microsoft Operation Management Suite and Google Cloud Platform
(Kubernates) are using Fluentd for log collection
Point
Our technology users

Treasure Data’s Solution

Big Data Stats in TD

Ad-tech
IoT
三菱重工
Agency / Trading Desk DMP / DSP Ad-Network
Diverse Corporate Identity Manual 02
コーポレートカラー
千歳緑（ちとせみどり）
この千歳緑をDiversのコーポレートカラーとします。
千歳緑は、常緑の松の緑をさし、吉祥的な意味を持つ事から、おめでたく、喜ばしい意味を持ちます。
繁栄・幸運を意味し、吉祥天は幸福・美・富を顕す神であるとともに、美女の代名詞ともされています。
■ CMYK / プロセスカラー
C : 85% M : 17% Y : 76% K : 57%
■ PANTONE / プロセスカラー
555EC
■ RGB / モニター
R : 0 G : 80 B : 60
背景と干渉する場合に使用するボックスロゴ
背景と干渉する場合に使用するボックスロゴ白黒
白黒のみの場合
EC Media Game/SNS
Gaminge-Commerce Internet Service
Retail Finance TechnologyTelecommunicationMaker
Other domain
Our Customers

1. What is Hivemall (introduction)
2. Why Hivemall (motivations etc.)
3. Hivemall Internals
4. How to use Hivemall
5. Future roadmap
Agenda

What is Hivemall
Scalable machine learning library built
as a collection of Hive UDFs, licensed
under the Apache License v2
12
https://github.com/myui/hivemall

Hadoop HDFS
MapReduce
(MRv1)
Hivemall
Apache YARN
Apache Tez
DAG processing
Machine Learning
Query Processing
Parallel Data
Processing Framework
Resource Management
Distributed File System
Cloud Storage
SparkSQL
Apache Spark
MESOS
Hive Pig
MLlib
Hivemall’s Technology Stack
Amazon S3

Hivemall’s Vision: ML on SQL
Classification with Mahout
CREATE TABLE lr_model AS
SELECT
feature, -- reducers perform model averaging in
parallel
avg(weight) as weight
FROM (
SELECT logress(features,label,..) as (feature,weight)
FROM train
) t -- map-only task
GROUP BY feature; -- shuffled to reducers
✓Machine Learning made easy for SQL
developers (ML for the rest of us)
✓Interactive and Stable APIs w/ SQL abstraction
This SQL query automatically runs in
parallel on Hadoop

List of supported Algorithms
Classification
✓ Perceptron
✓ Passive Aggressive (PA, PA1,
PA2)
✓ Confidence Weighted (CW)
✓ Adaptive Regularization of
Weight Vectors (AROW)
✓ Soft Confidence Weighted
(SCW)
✓ AdaGrad+RDA
✓ Factorization Machines
✓ RandomForest Classification
15
Regression
✓Logistic Regression (SGD)
✓AdaGrad (logistic loss)
✓AdaDELTA (logistic loss)
✓PA Regression
✓AROW Regression
✓Factorization Machines
✓RandomForest Regression
SCW is a good first choice
Try RandomForest if SCW does
not work
Logistic regression is good for
getting a probability of a positive
class
Factorization Machines is good
where features are sparse and
categorical ones

List of Algorithms for Recommendation
16
K-Nearest Neighbor
✓ Minhash and b-Bit Minhash
(LSH variant)
✓ Similarity Search on Vector
Space
(Euclid/Cosine/Jaccard/Angular)
Matrix Completion
✓ Matrix Factorization
✓ Factorization Machines
(regression)
each_top_k function of Hivemall is
useful for recommending top-k items

Other Supported Algorithms
17
Anomaly Detection
✓ Local Outlier Factor (LoF)
Feature Engineering
✓Feature Hashing
✓Feature Scaling
(normalization, z-score)
✓ TF-IDF vectorizer
✓ Polynomial Expansion
(Feature Pairing)
✓ Amplifier
NLP
✓Basic Englist text Tokenizer
✓Japanese Tokenizer
(Kuromoji)

• CTR prediction of Ad click logs
• Algorithm: Logistic regression
• Freakout Inc., Smartnews, and more
• Gender prediction of Ad click logs
• Algorithm: Classification
• Scaleout Inc.
Industry use cases of Hivemall

• Scaleout Inc.
• Item/User recommendation
• Algorithm: Recommendation
• Wish.com, GMO pepabo
19
Problem: Recommendation using hot-item is hard in hand-crafted
product market because each creator sells few single items (will
soon become out-of-stock)
minne.com

• Scaleout Inc.
• Value prediction of Real estates
• Algorithm: Regression
• Livesense

• Scaleout Inc.
• Value prediction of Real estates
• Algorithm: Regression
• Livesense
• User score calculation
• Algrorithm: Regression
• Klout
21
bit.ly/klout-hivemall
Influencer marketing
klout.com

OISIX, a leading food delivery service company in Japan,
used Hivemall’s Logistic Regression to get churn probability
Churn Detection of Monthly Payment Service
Churn rate dropped almost by half by giving gift points to
customers being predicted to leave J

1. What is Hivemall
5. Future roadmap
Agenda

Motivation – Why a new ML framework?
Mahout?
Vowpal Wabbit?
(w/ Hadoop streaming)
Spark MLlib?
0xdata H2O? Cloudera Oryx?
Machine Learning frameworks out there that
run with Hadoop
Quick Poll:
How many people in this room are using them?
24

How I used to do ML projects before Hivemall
Given raw data stored on Hadoop HDFS
Raw
Data
HDFS
S3 Feature Vector
Extract-Transform-Load
Machine Learning
file
height:173cm
weight:60kg
age:34
gender: man
…

Raw
Data
HDFS
S3 Feature Vector
height:173cm
weight:60kg
age:34
gender: man
…
file
Need to do expensive data
preprocessing
(Joins, Filtering, and Formatting of Data
that does not fit in memory)
Machine Learning

Raw
Data
HDFS
S3 Feature Vector
file
Do not scale
Have to learn R/Python APIs
height:173cm
weight:60kg
age:34
gender: man
…

Hivemall’s Vision: ML on SQL (again)
Classification with Mahout
SELECT
feature, -- reducers perform model averaging in
parallel
FROM (
SELECT logress(features,label,..) as (feature,weight)
FROM train
) t -- map-only task
GROUP BY feature; -- shuffled to reducers
✓Machine Learning made easy for SQL
developers (ML for the rest of us)
✓Interactive and Stable APIs w/ SQL abstraction
This SQL query automatically runs in
parallel on Hadoop 2016/09/09 HadoopCon 16, Taipei 28

29
Hivemall on Apache Spark
Installation is very easy as follows:
$ spark-shell --packages maropu:hivemall-spark:0.0.6

1. What is Hivemall
5. Future roadmap
Agenda

Implemented machine learning algorithms as
User-Defined Table generating Functions (UDTFs)
How Hivemall works in training
+1, <1,2>
..
+1, <1,7,9>
-1, <1,3, 9>
..
+1, <3,8>
tuple
<label, array<features>>
tuple<feature, weights>
Prediction model
UDTF
Relation
<feature, weights>
param-mix param-mix
Training
table
Shuffle
by feature
train train
● Resulting prediction model is a
relation of feature and its weight
● # of mapper and reducers are
configurable
UDTF is a function that returns a relation
Parallelism is Powerful

32
train train
+1, <1,2>
..
+1, <1,7,9>
-1, <1,3, 9>
..
+1, <3,8>
tuple
<label, featues>
array<weight>
Training
table
-1, <2,7, 9>
..
+1, <3,8>
MIX
-1, <2,7, 9>
..
+1, <3,8>
train train
array<weight>
Parameter averaging (bagging)

Alternative Approach in Hivemall
Hivemall provides the amplify UDTF to enumerate
iteration effects in machine learning without several
MapReduce steps
SET hivevar:xtimes=3;
CREATE VIEW training_x3
as
SELECT
*
FROM (
SELECT
amplify(${xtimes}, *) as (rowid, label, features)
FROM
training
) t
CLUSTER BY rand()

1. What is Hivemall
5. Future roadmap
Agenda

How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature
Vector
Feature Vector
Label
Data preparation 352016/09/09 HadoopCon 16, Taipei

Create external table e2006tfidf_train (
rowid int,
label float,
features ARRAY<STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t'
COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';
How to use Hivemall - Data preparation
Define a Hive table for training/testing data

How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature
Vector
Feature Vector
Label
Feature Engineering

create view e2006tfidf_train_scaled
as
select
rowid,
rescale(target,${min_label},${max_label})
as label,
features
from
e2006tfidf_train;
Applying a Min-Max Feature
Normalization
How to use Hivemall - Feature Engineering
Transforming a label value
to a value between 0.0 and 1.0

How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature
Vector
Feature Vector
Label
Training

How to use Hivemall - Training
SELECT
feature,
FROM (
SELECT logress(features,label,..)
as (feature,weight)
FROM train
) t
GROUP BY feature
Training by logistic regression
map-only task to learn a prediction model
Shuffle map-outputs to reduces by feature
Reducers perform model averaging
in parallel

How to use Hivemall - Training
CREATE TABLE news20b_cw_model1 AS
SELECT
feature,
voted_avg(weight) as weight
FROM
(SELECT
train_cw(features,label)
as (feature,weight)
FROM
news20b_train
) t
GROUP BY feature
Training of Confidence Weighted Classifier
Vote to use negative or positive
weights for avg
+0.7, +0.3, +0.2, -0.1, +0.7
Training for the CW classifier

How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature
Vector
Feature Vector
Label
Prediction

How to use Hivemall - Prediction
CREATE TABLE lr_predict
as
SELECT
t.rowid,
sigmoid(sum(m.weight)) as prob
FROM
testing_exploded t LEFT OUTER JOIN
lr_model m ON (t.feature = m.feature)
GROUP BY
t.rowid
Prediction is done by LEFT OUTER JOIN
between test data and prediction model
No need to load the entire model into memory

Real-time prediction
Machine
Learning
Batch Training on Hadoop
Online Prediction on RDBMS
Prediction
Model
Label
Feature
Vector
Feature Vector
Label
Export
prediction model
44
bit.ly/hivemall-rtp

RandomForest in Hivemall
Ensemble of Decision Trees

Training of RandomForest

Prediction of RandomForest

1. What is Hivemall
5. Future roadmap
Agenda

49
Future of Hivemall
Hivemall will become Apache Hivemall (?)
Now on voting though..

50
Apache Incubation status

• Makoto Yui <Treasure Data>
• Takeshi Yamamuro <NTT>
Ø Hivemall on Apache Spark
• Daniel Dai <Hortonworks>
Ø Hivemall on Apache Pig
Ø Apache Pig PMC member
• Tsuyoshi Ozawa <NTT>
ØApache Hadoop PMC member
• Kai Sasaki <Treasure Data>
51
Initial committers

Champion
Nominated Mentors
52
Project mentors
• Reynold Xin <Databricks, ASF member>
Apache Spark PMC member
• Markus Weimer <Microsoft, ASF member>
Apache REEF PMC member
• Xiangrui Meng <Databricks, ASF member>
Apache Spark PMC member
• Roman Shaposhnik <Pivotal, ASF member>
Apache Bigtop/Incubator PMC member

• Possibly enter Apache Incubator soon
• IP clearance and project/repository site
setup
•Contribution guideline
•Create who use Hivemall list
•More documentations! Sept to Nov
• Initial Apache Release will be Dec (or
late Nov?)
53
Roadmap

ü Hivemall on Spark 2.0 w/ Dataframe
support
ü XGBoost support
54
Coming New Features - already merged in Master
Please Refer
bit.ly/hivemall-xgboost
for detail

ü ChangeFinder
• Efficient algorithm for finding change point and outliers
from timeseries data
55
J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting
Outliers and Change Points from Time Series,” IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.

ü ChangeFinder
• Efficient algorithm for finding change point and outliers
from timeseries data
56
J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting
Outliers and Change Points from Time Series,” IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.

ü Various Evaluation Metrics
•PR #326
57

• v0.5-beta{1,2} release (Oct-Nov)
üone-hot encoding
ü Field-aware Factorization Machines
ü Kernelized Passive Aggressive
üGeneralized Linear Model
ü Optimizer framework including ADAM
ü L1/L2 regularization
ü Gradient Tree Boosting
ü Online LDA
58
Other undergoing new features

Conclusion and Takeaway
Hivemall provides a collection of machine
learning algorithms as Hive UDFs/UDTFs
59
Ø For SQL users that need ML
Ø For whom already using Hive
Ø Easy-of-use and scalability in mind
Do not require coding, packaging, compiling or
introducing a new programming language or APIs.
Hivemall’s Positioning
We welcome your contributions to Apache Hivemall J

60
Any feature request or questions?
#hivemall

HadoopCon'16, Taipei @myui

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to HadoopCon'16, Taipei @myui

Similar to HadoopCon'16, Taipei @myui (20)

More from Makoto Yui

More from Makoto Yui (20)

Recently uploaded

Recently uploaded (20)

HadoopCon'16, Taipei @myui