Hivemall dbtechshowcase 20160713 #dbts2016

Machine Learning Made Easy
by using Hivemall
Research Engineer
Makoto YUI @myui
<myui@treasure-data.com>
bit.ly/hivemall
12016/07/13 DB tech showcase

➢2015/04 Joined Treasure Data, Inc.
➢1st Research Engineer in Treasure Data
➢My mission in TD is developing ML-as-a-Service
(MLaaS)
➢2010/04-2015/03 Senior Researcher at National
Institute of Advanced Industrial Science and
Technology, Japan.
➢Worked on a large-scale Machine Learning project
and Parallel Databases
➢2009/03 Ph.D. in Computer Science from NAIST
➢XML native database and Parallel Database systems
Who am I ?
2

External
Integrations
SQL
Server
CRM
RDBMS
App log
Sensor
Apache log
ERP
Hive
Batch
Adhoc
Presto
API
ODBC
JDBC
PUSH
Treasure Agent
BI tools
Data analysis
Data Collectors
Embedded
Embulk
Mobile SDK
JS SDK
Treasure Data Cloud Service
Machine
Learning
900,000
Records stored
per sec.
3

0
2000
4000
6000
8000
10000
12000
(単位)10億レコード
サービス開始
Series A Funding
100社導入
Gartner社「Cool Vendor in
Big Data」に選定される
10兆件
５兆レコード
数字でみるトレジャーデータ (2014年10月):
40万レコード毎秒インポートされるデータの数
10兆レコード以上インポートされたデータの数
120億アドテク業界のお客様1社によって毎日送られてくるデー
タ
Data Imported to Treasure Data
4

1. What is Hivemall (short intro.)
2. Why Hivemall (motivations etc.)
3. How to use Hivemall
Agenda
5

What is Hivemall
Scalable machine learning library built as a collection of Hive
UDFs, licensed under the Apache License v2
Hadoop HDFS
MapReduce
(MRv1)
Hivemall
Apache YARN
Apache Tez
DAG processing
Machine Learning
Query Processing
Parallel Data
Processing Framework
Resource Management
Distributed File System
SparkSQL
Apache Spark
MESOS
Hive Pig
MLlib
6

Won IDG’s InfoWorld 2014
Bossie Awards 2014: The best open source big data tools
InfoWorld's top picks in distributeddata processing, data analytics,machine
learning,NoSQL databases,and the Hadoop ecosystem
(awarded along w/ Spark, Tez, Jupyter notebook, Pandas, Impala, Kafka)
bit.ly/hivemall-award
7

Classification
✓ Perceptron
✓ Passive Aggressive (PA, PA1,
PA2)
✓ Confidence Weighted (CW)
✓ Adaptive Regularization of
Weight Vectors (AROW)
✓ Soft Confidence Weighted
(SCW)
✓ AdaGrad+RDA
✓ Factorization Machines
✓ RandomForest Classification
Regression
✓Logistic Regression (SGD)
✓PA Regression
✓AROW Regression
✓AdaGrad(logistic loss)
✓AdaDELTA (logistic loss)
✓Factorization Machines
✓RandomForest Regression
List of supported Algorithms
8

List of supported Algorithms
Classification
✓ Perceptron
✓ Passive Aggressive (PA, PA1,
PA2)
✓ Confidence Weighted (CW)
✓ Adaptive Regularization of
Weight Vectors (AROW)
✓ Soft Confidence Weighted
(SCW)
✓ AdaGrad+RDA
✓ RandomForest Classification
Regression
✓Logistic Regression (SGD)
✓AdaGrad(logistic loss)
✓AdaDELTA (logistic loss)
✓PA Regression
✓AROW Regression
✓Factorization Machines
✓RandomForest Regression
SCW is a good first choice
Try RandomForest if SCW does
not work
Logistic regression is good for
getting a probability of a
positive class
Factorization Machines is good
where features are sparse and
categorical ones
9

List of Algorithms for Recommendation
K-Nearest Neighbor
✓ Minhash and b-Bit Minhash
(LSH variant)
✓ Similarity Search on Vector
Space
(Euclid/Cosine/Jaccard/Angular)
Matrix Completion
✓ Matrix Factorization
(regression)
each_top_k function of Hivemall is
useful for recommending top-k items
10

Other Supported Algorithms
Anomaly Detection
✓ Local Outlier Factor (LoF)
Feature Engineering
✓Feature Hashing
✓Feature Scaling
(normalization, z-score)
✓ TF-IDF vectorizer
✓ Polynomial Expansion
(Feature Pairing)
✓ Amplifier
NLP
✓Basic Englist text Tokenizer
✓Japanese Tokenizer
(Kuromoji)
11

Ø CTR prediction of Ad click logs
• Freakout Inc., Fan communication, and more
• Replaced Spark MLlib w/ Hivemall at company X
Industry use cases of Hivemall
http://www.slideshare.net/masakazusano75/sano-hmm-2015051212

ØGender prediction of Ad click logs
• Scaleout Inc. and Fan commucations
http://eventdots.jp/eventreport/458208
13

Ø Value prediction of Real estates
• Livesense
http://www.slideshare.net/y-ken/real-estate-tech-with-hivemall 14

Source: http://itnp.net/article/2016/02/18/2286.html
15

ØChurn Detection
• OISIX
http://www.slideshare.net/TaisukeFukawa/hivemall-meetup-vol2-oisix 16

17
会員サービスの解約予測
•10万人の会員による定期購
買が会社全体の売上、利益を
左右するが、解約リスクのあ
る会員を事前に把握、防止す
る策を欠いていた
• 統計の専門知識無しで機械学習
• 解約予測リストへのポイント付
与により解約率が半減
• 解約リスクを伴う施策、イベン
トを炙り出すと同時に、非解約
者の特徴的な行動も把握可能に
• リスク度合いに応じて UI を変
更するなど間接的なサービス改
善も実現
•機械学習を行い、過去1ヶ月間
のデータをもとに未来1ヶ月間
に解約する可能性の高い顧客リ
ストを作成
•具体的には、学習用テーブル作
成 -> 正規化 -> 学習モデル作成
-> ロジスティック回帰の各ステ
ップをTD + Hivemall を用いて
クエリで簡便に実現
Web
Mobile
属性情報
行動ログ
クレーム情報
流入元
利用サービス情報
直接施策
間接施策
ポイント付与ケアコール
成功体験への誘導UI 変更
予測に使うデータ

ØRecommendation
• Portal site
18

1. What is Hivemall (short intro.)
2. Why Hivemall (motivations etc.)
Agenda
19

Why Hivemall
1. In my experience working on ML, I used Hive
for preprocessing and Python (scikit-learn etc.)
for ML. This was INEFFICIENT and ANNOYING.
Also, Python is not as scalable as Hive.
2. Why not run ML algorithms inside Hive? Less
components to manage and more scalable.
That’s why I build Hivemall.
20

How I used to do ML projects before Hivemall
Given raw data stored on Hadoop HDFS
Raw
Data
HDFS
S3 Feature Vector
height:173cm
weight:60kg
age:34
gender: man
…
Extract-Transform-Load
Machine Learning
file
21

Raw
Data
HDFS
S3 Feature Vector
height:173cm
weight:60kg
age:34
gender: man
…
file
Need to do expensive
data preprocessing
(Joins, Filtering, and Formatting of
Data that does not fit in memory)
Machine Learning 22

Raw
Data
HDFS
S3 Feature Vector
height:173cm
weight:60kg
age:34
gender: man
…
file
Do not scale
Have to learn R/Python APIs
23

How I used to do ML before Hivemall
Raw
Data
HDFS
S3 Feature Vector
height:173cm
weight:60kg
age:34
gender: man
…
Does not meet my needs
In terms of its scalability, ML algorithms, and usability
I ❤ scalable
SQL query
24

Framework User interface
Mahout Java API Programming
Spark MLlib/MLI Scala API programming
Scala Shell (REPL)
H2O R programming
GUI
Cloudera Oryx Http REST API programming
Vowpal Wabbit
(w/ Hadoop streaming)
C++ API programming
Command Line
Survey on existing ML frameworks
Existing distributed machine learning frameworks
are NOT easy to use
25

Hivemall’s Vision: ML on SQL
Classification with Mahout
CREATE TABLE lr_model AS
SELECT
feature, -- reducers perform model averaging in
parallel
avg(weight) as weight
FROM (
SELECT logress(features,label,..) as (feature,weight)
FROM train
) t -- map-only task
GROUP BY feature; -- shuffled to reducers
✓Machine Learning made easy for SQL
developers (ML for the rest of us)
✓Interactive and Stable APIs w/ SQL abstraction
This SQL query automatically runs in
parallel on Hadoop 26

Hivemall on Apache Spark
Installation is very easy as follows:
$ spark-shell --packages maropu:hivemall-spark:0.0.6
27

1. What is Hivemall
2. Why Hivemall
Agenda
28

How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature
Vector
Feature Vector
Label
Data preparation 29

Create external table e2006tfidf_train(
rowid int,
label float,
features ARRAY<STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t'
COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';
How to use Hivemall - Data preparation
Define a Hive table for training/testing data
30

How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature
Vector
Feature Vector
Label
Feature Engineering
31

create view e2006tfidf_train_scaled
as
select
rowid,
rescale(target,${min_label},${max_label})
as label,
features
from
e2006tfidf_train;
Applying a Min-Max Feature Normalization
How to use Hivemall - Feature Engineering
Transforming a label value
to a value between 0.0 and 1.0
32

How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature
Vector
Feature Vector
Label
Training
33

How to use Hivemall - Training
CREATE TABLE lr_model AS
SELECT
feature,
avg(weight) as weight
FROM (
SELECT logress(features,label,..)
as (feature,weight)
FROM train
) t
GROUP BY feature
Training by logistic regression
map-only task to learn a prediction model
Shuffle map-outputs to reduces by feature
Reducers perform model averaging
in parallel
34

How to use Hivemall - Training
CREATE TABLE news20b_cw_model1 AS
SELECT
feature,
voted_avg(weight) as weight
FROM
(SELECT
train_cw(features,label)
as (feature,weight)
FROM
news20b_train
) t
GROUP BY feature
Training of Confidence Weighted Classifier
Vote to use negative or positive
weights for avg
+0.7, +0.3, +0.2, -0.1, +0.7
Training for the CW classifier
35

How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature
Vector
Feature Vector
Label
Prediction
36

How to use Hivemall - Prediction
CREATE TABLE lr_predict
as
SELECT
t.rowid,
sigmoid(sum(m.weight)) as prob
FROM
testing_exploded t LEFT OUTER JOIN
lr_model m ON (t.feature = m.feature)
GROUP BY
t.rowid
Prediction is done by LEFT OUTER JOIN
between test data and prediction model
No need to load the entire model into memory
37

Real-time prediction
Machine
Learning
Batch Training on Hadoop
Online Prediction on RDBMS
Prediction
Model
Label
Feature
Vector
Feature Vector
Label
Export
prediction model
bit.ly/hivemall-rtp
38

Export Prediction Model to a RDBMS
Any RDBMS
TD export
Periodical export is very easy
in Treasure Data
103 -0.4896543622016907
104 -0.0955817922949791
105 0.12560302019119263
106 0.09214721620082855
39
Prediction
Model

Real-time Prediction on MySQL
SIGMOID(x) = 1.0 / (1.0 + exp(-x))
Prediction
Model
Label
Feature Vector
SELECT
sigmoid(sum(t.value * m.weight)) as prob
FROM
testing_exploded t LEFT OUTER JOIN
prediction_model m ON (t.feature = m.feature)
Online prediction on MySQL
Index lookups are very
efficient in RDBMSs
40

RandomForest in Hivemall
Ensemble of Decision Trees
41

44
https://console.treasuredata.com/jobs/75633717

Conclusion
Hivemall provides a collection of machine
learning algorithms as Hive UDFs/UDTFs
Ø For SQL users that need ML
Ø For whom already using Hive
Ø Easy-of-use and scalability in mind
Do not require coding, packaging, compiling or
introducing a new programming language or APIs.
Hivemall’s Positioning
Treasure Data provides ML-as-a-Service using
the latest version of Hivemall
45

We support machine learning in Cloud
Any feature request? Or, questions?
46

Hivemall dbtechshowcase 20160713 #dbts2016

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (6)

Similar to Hivemall dbtechshowcase 20160713 #dbts2016

Similar to Hivemall dbtechshowcase 20160713 #dbts2016 (20)

More from Makoto Yui

More from Makoto Yui (20)

Recently uploaded

Recently uploaded (20)

Hivemall dbtechshowcase 20160713 #dbts2016