2nd Hivemall meetup 20151020

Introduction to Hivemall
and it’s new features in v0.4
Research Engineer
Makoto YUI @myui
2015/10/20 Hivemall meetup #2 1
Tweet w/ #hivemallmtup
http://eventdots.jp/event/571107

Ø 2015.04 Joined Treasure Data, Inc.
1st Research Engineer in Treasure Data
My mission in TD is developing ML-as-a-Service
Ø 2010.04-2015.03 Senior Researcher at National
Institute of Advanced Industrial Science and
Technology, Japan.
Worked on a large-scale Machine Learning project
and Parallel Databases
Ø 2009.03 Ph.D. in Computer Science from NAIST
Ø Super programmer award from the MITOU
Foundation
Who am I ?

Agenda
1. What is Hivemall
2. How to use Hivemall
3. New Features in Hivemall v0.4
1. Random Forest
2. Factorization Machine
4. Development Roadmap of Hivemall

What is Hivemall
Scalable machine learning library built as a collection of
Hive UDFs, licensed under the Apache License v2
https://github.com/myui/hivemall

What is Hivemall
Hadoop HDFS
MapReduce
(MR v1)
Hive / PIG
Hivemall
Apache YARN
Apache Tez
DAG processing
MR v2
Machine Learning
Query Processing
Parallel Data
Processing Framework
Resource Management
Distributed File System
Scalable machine learning library built as a collection of
Hive UDFs, licensed under the Apache License v2

Hivemall’s Vision: ML on SQL
Classification with Mahout
CREATE TABLE lr_model AS
SELECT
feature, -- reducers perform model averaging in
parallel
avg(weight) as weight
FROM (
SELECT logress(features,label,..) as (feature,weight)
FROM train
) t -- map-only task
GROUP BY feature; -- shuffled to reducers
✓Machine Learning made easy for SQL
developers (ML for the rest of us)
✓Interactive and Stable APIs w/ SQL abstraction
This SQL query automatically runs in
parallel on Hadoop

List of Features in Hivemall v0.3.2
Classification (both
binary- and multi-class)
✓ Perceptron
✓ Passive Aggressive (PA)
✓ Confidence Weighted (CW)
✓ Adaptive Regularization of
Weight Vectors (AROW)
✓ Soft Confidence Weighted
(SCW)
✓ AdaGrad+RDA
Regression
✓Logistic Regression (SGD)
✓PA Regression
✓AROW Regression
✓AdaGrad
✓AdaDELTA
kNN and Recommendation
✓ Minhash and b-Bit Minhash
(LSH variant)
✓ Similarity Search using K-NN
(Euclid/Cosine/Jaccard/Angular)
✓ Matrix Factorization
Feature engineering
✓ Feature Hashing
✓ Feature Scaling
(normalization, z-score)
✓ TF-IDF vectorizer
✓ Polynomial Expansion
Anomaly Detection
✓ Local Outlier Factor
Treasure Data supports Hivemall v0.3.2-3

Ø CTR prediction of Ad click logs
• Algorithm: Logistic regression
• Freakout Inc. and more
Ø Gender prediction of Ad click logs
• Algorithm: Classification
• Scaleout Inc.
Ø Churn Detection
• Algorithm: Regression
• OISIX and more
Ø Item/User recommendation
• Algorithm: Recommendation (Matrix Factorization / kNN)
• Adtech Companies, ISP portal, and more
Ø Value prediction of Real estates
• Algorithm: Regression
• Livesense
Industry use cases of Hivemall
82015/10/20 Hivemall meetup #2

How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature Vector
Feature Vector
Label
Data preparation 2015/10/20 Hivemall meetup #2 9

CREATE EXTERNAL TABLE e2006tfidf_train (
rowid int,
label float,
features ARRAY<STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t'
COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';
How to use Hivemall - Data preparation
Define a Hive table for training/testing data

How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature Vector
Feature Vector
Label
Feature Engineering

create view e2006tfidf_train_scaled
as
select
rowid,
rescale(target,${min_label},${max_label})
as label,
features
from
e2006tfidf_train;
Applying a Min-Max Feature Normalization
How to use Hivemall - Feature Engineering
Transforming a label value
to a value between 0.0 and 1.0

How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature Vector
Feature Vector
Label
Training

How to use Hivemall - Training
CREATE TABLE lr_model AS
SELECT
feature,
avg(weight) as weight
FROM (
SELECT logress(features,label,..)
as (feature,weight)
FROM train
) t
GROUP BY feature
Training by logistic regression
map-only task to learn a prediction model
Shuffle map-outputs to reduces by feature
Reducers perform model averaging
in parallel

How to use Hivemall - Training
CREATE TABLE news20b_cw_model1 AS
SELECT
feature,
voted_avg(weight) as weight
FROM
(SELECT
train_cw(features,label)
as (feature,weight)
FROM
news20b_train
) t
GROUP BY feature
Training of Confidence Weighted Classifier
Vote to use negative or positive
weights for avg
+0.7, +0.3, +0.2, -0.1, +0.7
Training for the CW classifier

create table news20mc_ensemble_model1as
select
label,
cast(feature as int) as feature,
cast(voted_avg(weight)as float) as weight
from
(select
train_multiclass_cw(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
union all
select
train_multiclass_arow(addBias(features),label)
from
news20mc_train_x3
union all
select
train_multiclass_scw(addBias(features),label)
from
news20mc_train_x3
) t
group by label,feature;
Ensemble learning for stable prediction performance
Just stack prediction models
by union all
26 / 43

How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature Vector
Feature Vector
Label
Prediction

How to use Hivemall - Prediction
CREATE TABLE lr_predict
as
SELECT
t.rowid,
sigmoid(sum(m.weight)) as prob
FROM
testing_exploded t LEFT OUTER JOIN
lr_model m ON (t.feature = m.feature)
GROUP BY
t.rowid
Prediction is done by LEFT OUTER JOIN
between test data and prediction model
No need to load the entire model into memory

How to use Hivemall
Machine
Learning
Batch Training on Hadoop
Online Prediction on RDBMS
Prediction
Model
Label
Feature Vector
Feature Vector
Label
Export
prediction model

Online Prediction on MySQL (RDBMS)
Quick (msec) response on a RDBMS
by adding an index to feature column
bit.ly/hivemall-mysql

Agenda
1. What is Hivemall
1. Random Forest

Features to be supported in Hivemall v0.4
1.RandomForest
• classification, regression
• Based on Smile github.com/haifengl/smile
2.Factorization Machine
• classification, regression (factorization)
Planned to release v0.4 in Oct.
Factorization Machine are often used by data science
competition winners (Criteo/Avazu CTR prediction)

RandomForest in Hivemall v0.4
Ensemble of Decision Trees
Already available on a development (smile) branch
and it’s usage is explained in the project wiki
Bagging

Training of RandomForest

Out-of-bag tests and Variable Importance

Prediction of RandomForest

RandomForest
DEMO
http://bit.ly/hivemall-rf

Factorization Machine
Matrix Factorization

Context information (e.g., time)
can be considered
Source: http://www.ismll.uni-hildesheim.de/pub/pdfs/Rendle2010FM.pdf

Factorization Model with degress=2 (2-way interaction)
Global Bias
Regression coefficience
of j-th variable
Pairwise Interaction
Factorization

≈ Polynomial Regression + Factorization
For a feature [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2].
bit.ly/hivemall-poly

DEMO

Agenda
1. What is Hivemall
1. Random Forest

Features to be supported in Hivemall v0.4.1
1.Gradient Tree Boosting
• classifier, regression
2.Field-aware Factorization Machine
• classification, regression (factorization)
• Existing implementation, i.e., LibFFM, only can be
applied for classification
Planned to release v0.4.1 in Nov/Dec.

Gradient Tree Boosting (or Gradient Boosting Trees)
RF ≈ Bagging + Decision Trees
parallel execution of decision trees
GBT ≈ Boosting + Decision Trees
Sequential execution of decision trees

Gradient Tree Boosting

Features to be supported in Hivemall v0.4.2
1. Online LDA
• topic modeling, clustering
2. Mix server on Apache YARN
• Service for parameter sharing among workers
• working w/ @maropu
Planned to release v0.4.2 in Dec/Jan.

External service to share parameters by distributed
training processes in the middle of training
What’s Mix Server?
・・・・・・
Model updates
Async add
Piggy back if …
AVG/Argmin KLD accumulator
hash(feature) % N
Non-blocking Channel
(single shared TCP connection w/ TCP keepalive)
classifiers
Mix serv.Mix serv.
Computation/training
is not being blocked
Taking benefits of asynchronous non-blocking I/O
is the core idea behind Hivemall’s MIX protocol

create table kdd10a_pa1_model1 as
select
feature,
cast(voted_avg(weight) as float) as weight
from
(select
train_pa1(addBias(features),label,"-mix host01,host02,host03")
as (feature,weight)
from
kdd10a_train_x3
) t
group by feature;
How to use Mix Server

Conclusion and Takeaway
New features in v0.4
• Random Forest
• Factorization Machine
More will follow in v0.4.1
Next Actions
• Propose Hivemall to
Apache Incubator
• New Hivemall Logo
Hivemall provides a collection of machine
learning algorithms as Hive UDFs/UDTFs
The latest version of Hivemall is available on
Treasure Data and used by several companies
Including OISIX, Livesense, Scaleout, and Freakout.

Beyond Query-as-a-Service!
We Open-source! We invented ..
We are hiring machine learning engineer!

Additional slides

Recommendation
Rating prediction of a Matrix
Can be applied for user/Item Recommendation

44
Factorize a matrix
into a product of matrices
having k-latent factor

45
Mean Rating
Regularization
Bias
for each user/item
Criteria of Biased MF
Factorization

46
Training of Matrix Factorization
Support iterative training using local disk cache

47
Prediction of Matrix Factorization

ØAlgorithm is different
Spark: ALS-WR
(considers regularization)
Hivemall: Biased-MF
(considers regularization and biases)
ØUsability
Spark: 100+ line Scala coding
Hivemall: SQL (would be more easy to use)
ØPrediction Accuracy
Almost same for MovieLens 10M datasets
Comparison to Spark MLlib

rowid features
1 ["reflectance:0.5252967","specific_heat:0.19863537","weight:0.
0"]
13255163"]
052084323"]
Unsupervised Learning: Anomaly Detection
Sensor data etc.
Anomaly detection runs on a series of SQL queries

Anomalies in a Sensor Data
Source: https://codeiq.jp/q/207

Image Source: https://en.wikipedia.org/wiki/Local_outlier_factor
Local Outlier Factor (LoF)
Basic idea of LOF: comparing the local density of a
point with the densities of its neighbors

DEMO: Local Outlier Factor
rowid features
0"]
13255163"]
052084323"]

2nd Hivemall meetup 20151020

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to 2nd Hivemall meetup 20151020

Similar to 2nd Hivemall meetup 20151020 (20)

More from Makoto Yui

More from Makoto Yui (20)

Recently uploaded

Recently uploaded (20)

2nd Hivemall meetup 20151020