Introduction to Apache Hivemall v0.5.0

Introduction to Apache Hivemall v0.5.0:
Machine Learning on Hive/Spark
Makoto YUI @myui
ApacheCon North America 2018
Takashi Yamamuro @maropu
@ApacheHivemall
1). Principal Engineer,
2). Research Engineer,
1

Plan of the talk
1. Introduction to Hivemall
2. Hivemall on Spark
Background, quick walk-through of feature, usages,
what's new in v0.5.0, and future roadmaps
New top-k join enhancement, and a feature plan
for Supporting spark 2.3 and feature selection
2
Slide available: bit.ly/hivemall-apachecon18

We released the first Apache release
v0.5.0 on Mar 3rd, 2018 !
hivemall.incubator.apache.org
We plan to start voting for the 2nd Apache release (v0.5.2) in
the next month (Oct 2018).
3

What’s new in v0.5.0?
Anomaly/Change Point
Detection
Topic Modeling
(Soft Clustering)
Algorithm:
LDA, pLSA
Algorithm:
ChangeFinder, SST
Hivmall on Spark
2.0/2.1/2.1
SparkSQL/Dataframe support,
Top-k data processing
ApacheCon North America 2018 4

5
Running machine learning on massive data stored on data warehouse
Make
It!
Suppose …
Background

6
Running machine learning on massive data stored on data warehouse
Scalability? Data movement? Tool?
Concerns:

Approach #1
7
Data warehouse
Data
preprocessing
Machine Learning
Typical Data Scientist’s Solution
Small data?

8
Data warehouse
Data
preprocessing
Machine Learning
Approach #2 Data Engineer’s Solution

9
Q: Is Dataframe a great idea
for data (pre-)processing?

10
Q: Do you like it?
(for production-ready data preprocessing)
p Yes
p No
p Maybe
I like it for simple
data processing

11
Q: Do you really like it?
(for messy real-world data preprocessing)
p Yes
p No
p Maybe

12
Real-world ML pipelines (could be more complex)
Join
Extract Feature
Datasource
#1
Datasource
#2
Datasource
#3
Extract Feature
Feature Scaling
Feature Hashing
Feature Engineering
Feature Selection
Train by
Logistic Regression
Train by
RandomForest
Train by
Factorization Machines
Ensemble
Evaluate
Predict

13
Q: Have you ever seen/write
hundreds-thousands lines of
preprocessing in Dataframe?
Hundreds-lines of SQL queries for data pre-precessing are well seen.

14
Q. Fun to play with it?
(scala/python coding for trivial things)
Do you write testing codes?
IMPO, notebook codes are error-prone for production uses

My Suggestion
15
Data warehouse
Data
preprocessing
Machine Learning
+ Scalability
+ Durability/Stability
+ Functionalities
(UDFs, JSON, Windowing functions)
Push more works back to
DB where data resides
(including some ML logics)
One size does not fit all though ...

Machine Learning
in SQL queries

BigQuery ML at Google I/O 2018
17
https://ai.googleblog.com/2018/07/machine-learning-in-google-bigquery.html

18
Could I use ML-in-SQL in my cluster?

19
Open-source Machine Learning Solution
for SQL-on-Hadoop
https://hivemall.apache.org (incubating)

What is Apache Hivemall
Scalable machine learning library built
as a collection of Hive UDFs
Multi/Cross
platform VersatileScalableEase-of-use

Hivemall is easy and scalable …
ML made easy for SQL developers
Born to be parallel and scalable
Ease-of-use
Scalable
100+ lines
of code
CREATE TABLE lr_model AS
SELECT
feature, -- reducers perform model averaging in parallel
avg(weight) as weight
FROM (
SELECT logress(features,label,..) as (feature,weight)
FROM train
) t -- map-only task
GROUP BY feature; -- shuffled to reducers
This query automatically runs in parallel on Hadoop

Hivemall is a multi/cross-platform ML library
HiveQL SparkSQL/Dataframe API Pig Latin
Hivemall is Multi/Cross platform ..
Multi/Cross
platform
prediction models built by Hive can be used from Spark, and conversely,
prediction models build by Spark can be used from Hive

Hadoop HDFS
MapReduce
(MRv1)
Hivemall
Apache YARN
Apache Tez
DAG processing
Machine Learning
Query Processing
Parallel Data
Processing Framework
Resource Management
Distributed File System
Cloud Storage
SparkSQL
Apache Spark
MESOS
Hive Pig
MLlib
Hivemall’s Technology Stack
Amazon S3

Hivemall on Apache Hive

Hivemall on Apache Spark Dataframe

Hivemall on SparkSQL

Hivemall on Apache Pig

Online Prediction by Apache Streaming

List of Supported Algorithms
Classification
✓ Perceptron
✓ Passive Aggressive (PA, PA1, PA2)
✓ Confidence Weighted (CW)
✓ Adaptive Regularization of Weight
Vectors (AROW)
✓ Soft Confidence Weighted (SCW)
✓ AdaGrad+RDA
✓ Factorization Machines
✓ RandomForest Classification
Regression
✓Logistic Regression (SGD)
✓AdaGrad (logistic loss)
✓AdaDELTA (logistic loss)
✓PA Regression
✓AROW Regression
✓Factorization Machines
✓RandomForest Regression
SCW is a good first choice
Try RandomForest if SCW does not
work
Logistic regression is good for getting a
probability of a positive class
Factorization Machines is good where
features are sparse and categorical ones

Generic Classifier/Regressor
OLD Style New Style from v0.5.0

•Squared Loss
•Quantile Loss
•Epsilon Insensitive Loss
•Squared Epsilon Insensitive
Loss
•Huber Loss
Generic Classifier/Regressor
Available Loss functions
•HingeLoss
•LogLoss (synonym: logistic)
•SquaredHingeLoss
•ModifiedHuberLoss
• L1
• L2
• ElasticNet
• RDA
Other options
For Binary Classification:
For Regression:
• SGD
• AdaGrad
• AdaDelta
• ADAM
Optimizer
• Iteration support
• mini-batch
• Early stopping
Regularization

RandomForest in Hivemall
Ensemble of Decision Trees

Training of RandomForest
Good news: Sparse Vector Input (Libsvm
format) is supported since v0.5.0 in
addition Dense Vector input.

Prediction of RandomForest

Decision Tree Visualization

SELECT train_xgboost_classifier(features, label) as (model_id, model)
FROM training_data
XGBoost support in Hivemall (beta version)
SELECT rowed, AVG(predicted) as predicted
FROM (
-- predict with each model
SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted)
-- join each test record with each model
FROM xgboost_models CROSS JOIN test_data_with_id
) t
GROUP BY rowid;

Supported Algorithms for Recommendation
K-Nearest Neighbor
✓ Minhash and b-Bit Minhash
(LSH variant)
✓ Similarity Search on Vector
Space
(Euclid/Cosine/Jaccard/Angular)
Matrix Completion
✓ Matrix Factorization
✓ Factorization Machines
(regression)
each_top_k function of Hivemall is useful for
recommending top-k items

Other Supported Algorithms
Feature Engineering
✓Feature Hashing
✓Feature Scaling
(normalization, z-score)
✓ Feature Binning
✓ TF-IDF vectorizer
✓ Polynomial Expansion
✓ Amplifier
NLP
✓Basic Englist text Tokenizer
✓English/Japanese/Chinese
Tokenizer
Evaluation metrics
✓AUC, nDCG, logloss, precision
recall@K, and etc

Feature Engineering – Feature Hashing

Feature Engineering – Feature Binning
Maps quantitative variables to fixed number of
bins based on quantiles/distribution
Map Ages into 3 bins

Feature Engineering – Feature Binning
42

Evaluation Metrics

Other Supported Features
Anomaly Detection
✓Local Outlier Factor (LoF)
✓ChangeFinder
Clustering / Topic models
✓Online mini-batch LDA
✓Online mini-batch PLSA
Change Point Detection
✓ChangeFinder
✓Singular Spectrum
Transformation

Efficient algorithm for finding change point and outliers from
time-series data
J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.
Anomaly/Change-point Detection by ChangeFinder

Take this…

…and do this!

• T. Ide and K. Inoue, "Knowledge Discovery from Heterogeneous Dynamic Systems using Change-Point
Correlations", Proc. SDM, 2005T.
• T. Ide and K. Tsuda, "Change-point detection using Krylov subspace learning", Proc. SDM, 2007.
Change-point detection by Singular Spectrum Transformation

Online mini-batch LDA

Probabilistic Latent Semantic Analysis - training

Probabilistic Latent Semantic Analysis - predict

ü Spark 2.3 support
ü Merged Brickhouse UDFs
ü Field-aware Factorization Machines
ü SLIM recommendation
What’s new in the coming v0.5.2
Xia Ning and George Karypis, SLIM: Sparse Linear Methods for Top-N Recommender Systems, Proc. ICDM, 2011.
Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, and Chih-Jen Lin, "Field-aware Factorization Machines for CTR
Prediction", Proc. RecSys. 2016.
State-of-the-art method for CTR prediction, often used algorithm in Kaggle
Very promising algorithm for top-k recommendation
52

ü Word2Vec support
ü Multi-class Logistic Regression
ü More efficient XGBoost support
ü LightGBM support
ü Gradient Boosting
ü Kafka KSQL UDF porting
Future work for v0.6 and later
PR#91
PR#116

,
• .
•
• . .

- : -:
•
• 665 . - 5 . - . -
.
• :
• . 5 . ,
56 5
• :
• 5 . .

()Copyright©2018 NTT corp. All Rights Reserved.
-: , : 2
• - :1 . -:
• - 31 - 1:- 1 31
• 31 :
$ 7> FD E F=:
= DF FD E $ D =:$ E F$= A
F > : FD E $E: $ : > : /-.2/$A "
F > :$ > :E 5$ > F L D E M""$ "
E:F 1 (

• ( 0 2 244 0 24 40
10 0 1 00 0 0
• )0 10 0 2
• ) E F F 1C 8 :
• C .F58C 8E8 * 8C 8 E
• EC8 : 8 /8 C :
• :8 28 8C
• I
3 *0.0
FCE 8C C E 58 F EE ( 5 E H H

•
• / 5 5 *55 -5 9 3 5 / :
9 95 5 53 9 5 A 9 5 39 5 A 29 .D ,

• .
• 6+ -6 / - - + / 6 + + 6+ -
• - / / /- 6 -+ - / +60
• .
• 6 / - + / / +60
• 6 - + - -+ / / 6 +
• / 6 - - / -6 6 + - / +60
.

•
• / . 66/ 6 ++ : 6 1 6/ . / 1/
. 6 . 6 6/ / . 1

*Copyright©2018 NTT corp. All Rights Reserved.
• ,7 299 A 3 7 A 7 2
,-1 ).. ( 2
• - 1 :3 13 1 23
A A 2 A
5 1: 3 6 $$5 6 0 1 $/ /163$ 1 0/ 6 3 /::
12 1 0/ 6 3 /::
/1 /53 , -3
: / 53 $ /
/ 53 $6 3 /:: / ... D 6 23 3 23 1 3 /

• . 3 3 3
• 4 . 3
• 1 24 1
• 4 43
2 1

• 6 . .21 6 6
• ## :2. 6 .- # 426#42 : 4:#-
:. :# .0 .::2 6 4 $$ /2-/
0 6 6 .1.1 1 6 6 6
0. ## :2. 6 .- # 426#42 : 4:#- :. :#
.0 .::2 6# $$ 26
0. ## :2. 6 .- # 426#42 : 4:#- :. :#
.0 .::2 6# $$ .:

)(Copyright©2018 NTT corp. All Rights Reserved.
2 . .
// Downloads Spark v2.3 and launches a spark-shell with Hivemall
$ : C < C == C
. D / D > > D : 6 =: CF> > DD 6 :=
C5 = - F = D : / C <$ 6$ > D =: CF> "$= 6 0 )$D : $ "
C5 = - D : / $ : D25 >
D
= = 6 E = E== = D E "
DE C F 5D E== = D E "

3 . -
-6-) :- =
-6 6 ( = - = - ,6 -=> 6-. 6 "
>: -=> B"
- = $) - B"

- -.
= D ( L CF, = L D = 6 E C O 6 CF6 D
= D ( L
D EG> D, D
P 6 LM " ) O CABL ) O CABL
P .
P 6 L CF:DGA A LM " D D
P ) LM " O CABL
P . 6 CF6 D
P
P 7 LM
L C ACF

(Copyright©2018 NTT corp. All Rights Reserved.
. .-
E 6 6 6EF )
6 8: * F EF : E F"DB =8"# : 6F D E #
B8 FBD" : 6F D E #
6
B D 8= F=B E
8: >B= " B8 : 8:" : 6F D # *** B8 " : 6F D # . . #
DB , " DB =8 #
6 "E= B=8"E " = F $ 6 ###

. - 4
N >G>) JABG C MB>OB6M B G> B:B FBR T JABG:> GB
N >G>) AC MB>OB6M B G> B:B FBR T:BNO:> GBT
N >G>) >NOB
NLG
S 9. .,: MJRFA NFD JFA >GPB " RBFDEO *9 MBAF OBA
S 6 :M>FI:> GB O
S . : 6 :. 6 JABG:> GB
S 6 O CB>OPMB ( CB>OPMB
S 6 = MJRFA
NOMF >MDFI

• - . .
- . :
• , :
• . ) 0/- > :=:C
-: : -
+ 7 :>C0 =
C
) C :> > 3 C
) C :> > 3 C
) 3 > 3 7
+ 7 C :C $" ". (" "), $" ". (" "))
, = C C >C : 7

•
• A J KN I=D KA J$ K = E K=J K > I = I
-
J D , JK=)
D K C > + D=>K > B A IA K >$ I R )) AD$ 17R"
J=D= K D=>K > I "$ D=>K > R" IA K > PR"" J J I=R"
NAK D E
I C $ I C " =I IKAKA .P I R" I<=I.P J I= <=J """
N =I= I C + K "
E K=J K :P JA IC ADD - J DP

• : 3 : : .
• A J KN I=D KA J$ K = E K=J K > I = I
>3 :1 : 3:> : . 13>> :
J D , JK=)
D K C > + D=>K > B A IA K >$ I R )) AD$ 2 7R"
J=D= K D=>K > I "$ D=>K > R" IA K > PR"" J J I=R"
NAK D E
I C $ I C " =I IKAKA .P I R" I<=I.P J I= <=J """
N =I= I C + K "
E K=J K :P JA IC ADD - 2J DP
:> 3
- 2 : 3 1
3 2

• ::-
• .= AD= : A = A7 = > A A=> = 7 = >
- : :- -
: ) > A
: A=> + ( : A+ :
: A A=> 7A+ : A+ = >H ((( 7A+ = >H
: A+ H 7A+ H = H
- >: A A 3 - , ::

• : :
-

• : :
-
K-length
priority queue
Computes top-K rows
by using a priority queue

• : :
-
K-length
priority queue
Computes top-K rows
by using a priority queue
Only joins top-K rows

• - -
• (7) ) / ) ) ) , )
7 ) 7 ) ) , 7
- :
(7) ) ) )

• - -
• / 8 8 7 , / 8
7/ / 7 8 7 /8
- :
7/

,Copyright©2018 NTT corp. All Rights Reserved.
• - - *: -::
• 7 JD L J EEP J L K =H> JH EL
PK = E E # > =H E K = L K L
:- - :
K= E / LH D0 E
.. PK = E E ..
7 E >2 K H 8H (# 9 JH ( :# 9 JH ) :
- 1 = K JL L H JH ( # )
- H= E8 E 7= 9 JH ( # ((:
1 = K JL L H JH ) # )
H= E8 E 7= 9 JH ) # P )+:
- -* -* : * - - *

• - 3:
3 -:1 : 1 1
! :1 : : : : : -

• : -: : :
: =: -:
• 1 : 8 1 :
1 : 8 - 8
+ : : -:
Data Extraction (e.g., by SQL) Feature Selection (e.g., by scikit-learn)
Selected Features
Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice
about Joins before Feature Selection, Proceedings of SIGMOD, 2016.

• : -: : :
: =: -:
• 8 8 : 2 8 : 1 : 8 21
:8 2 : - 1 8 1 : 8 :
+ : : -:
Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice
about Joins before Feature Selection, Proceedings of SIGMOD, 2016.
Data Extraction + Feature Selection
Join Pruning by Data Statistics

Conclusion and Takeaway
Hivemall is a multi/cross-platform ML library
providing a collection of machine learning algorithms as Hive UDFs/UDTFs
The 2nd Apache release (v0.5.2) will appear soon!
We welcome your contributions to Apache Hivemall J
HiveQL SparkSQL/Dataframe API Pig Latin

Thank you! Questions?
Mentors wanted!

Introduction to Apache Hivemall v0.5.0

More Related Content

What's hot

Similar to Introduction to Apache Hivemall v0.5.0

More from Makoto Yui

Recently uploaded

Introduction to Apache Hivemall v0.5.0