SlideShare a Scribd company logo
Introduction to Apache Hivemall v0.5.0:
Machine Learning on Hive/Spark
Makoto YUI @myui
ApacheCon North America 2018
Takashi Yamamuro @maropu
@ApacheHivemall
1). Principal Engineer,
2). Research Engineer,
1
Plan of the talk
1. Introduction to Hivemall
2. Hivemall on Spark
ApacheCon North America 2018
Background, quick walk-through of feature, usages,
what's new in v0.5.0, and future roadmaps
New top-k join enhancement, and a feature plan
for Supporting spark 2.3 and feature selection
2
Slide available: bit.ly/hivemall-apachecon18
We released the first Apache release
v0.5.0 on Mar 3rd, 2018 !
hivemall.incubator.apache.org
ApacheCon North America 2018
We plan to start voting for the 2nd Apache release (v0.5.2) in
the next month (Oct 2018).
3
What’s new in v0.5.0?
Anomaly/Change Point
Detection
Topic Modeling
(Soft Clustering)
Algorithm:
LDA, pLSA
Algorithm:
ChangeFinder, SST
Hivmall on Spark
2.0/2.1/2.1
SparkSQL/Dataframe support,
Top-k data processing
ApacheCon North America 2018 4
5
Running machine learning on massive data stored on data warehouse
Make
It!
ApacheCon North America 2018
Suppose …
Background
6
Running machine learning on massive data stored on data warehouse
Scalability? Data movement? Tool?
ApacheCon North America 2018
Concerns:
Approach #1
7
Data warehouse
Data
preprocessing
Machine Learning
Typical Data Scientist’s Solution
Small data?
ApacheCon North America 2018
8
Data warehouse
Data
preprocessing
Machine Learning
Approach #2 Data Engineer’s Solution
ApacheCon North America 2018
9
Q: Is Dataframe a great idea
for data (pre-)processing?
ApacheCon North America 2018
10
Q: Do you like it?
(for production-ready data preprocessing)
p Yes
p No
p Maybe
ApacheCon North America 2018
I like it for simple
data processing
11
Q: Do you really like it?
(for messy real-world data preprocessing)
p Yes
p No
p Maybe
ApacheCon North America 2018
12
Real-world ML pipelines (could be more complex)
Join
Extract Feature
Datasource
#1
Datasource
#2
Datasource
#3
Extract Feature
Feature Scaling
Feature Hashing
Feature Engineering
Feature Selection
Train by
Logistic Regression
Train by
RandomForest
Train by
Factorization Machines
Ensemble
Evaluate
Predict
ApacheCon North America 2018
13
Q: Have you ever seen/write
hundreds-thousands lines of
preprocessing in Dataframe?
ApacheCon North America 2018
Hundreds-lines of SQL queries for data pre-precessing are well seen.
14
Q. Fun to play with it?
(scala/python coding for trivial things)
Do you write testing codes?
IMPO, notebook codes are error-prone for production uses
ApacheCon North America 2018
My Suggestion
15
Data warehouse
Data
preprocessing
Machine Learning
+ Scalability
+ Durability/Stability
+ Functionalities
(UDFs, JSON, Windowing functions)
Push more works back to
DB where data resides
(including some ML logics)
One size does not fit all though ...
ApacheCon North America 2018
Machine Learning
in SQL queries
ApacheCon North America 2018 16
BigQuery ML at Google I/O 2018
17
https://ai.googleblog.com/2018/07/machine-learning-in-google-bigquery.html
ApacheCon North America 2018
18
Could I use ML-in-SQL in my cluster?
ApacheCon North America 2018
19
Open-source Machine Learning Solution
for SQL-on-Hadoop
https://hivemall.apache.org (incubating)
ApacheCon North America 2018
What is Apache Hivemall
Scalable machine learning library built
as a collection of Hive UDFs
Multi/Cross
platform VersatileScalableEase-of-use
ApacheCon North America 2018 20
Hivemall is easy and scalable …
ML made easy for SQL developers
Born to be parallel and scalable
Ease-of-use
Scalable
100+ lines
of code
CREATE TABLE lr_model AS
SELECT
feature, -- reducers perform model averaging in parallel
avg(weight) as weight
FROM (
SELECT logress(features,label,..) as (feature,weight)
FROM train
) t -- map-only task
GROUP BY feature; -- shuffled to reducers
This query automatically runs in parallel on Hadoop
ApacheCon North America 2018 21
Hivemall is a multi/cross-platform ML library
HiveQL SparkSQL/Dataframe API Pig Latin
Hivemall is Multi/Cross platform ..
Multi/Cross
platform
prediction models built by Hive can be used from Spark, and conversely,
prediction models build by Spark can be used from Hive
ApacheCon North America 2018 22
Hadoop HDFS
MapReduce
(MRv1)
Hivemall
Apache YARN
Apache Tez
DAG processing
Machine Learning
Query Processing
Parallel Data
Processing Framework
Resource Management
Distributed File System
Cloud Storage
SparkSQL
Apache Spark
MESOS
Hive Pig
MLlib
Hivemall’s Technology Stack
Amazon S3
ApacheCon North America 2018 23
Hivemall on Apache Hive
ApacheCon North America 2018 24
Hivemall on Apache Spark Dataframe
ApacheCon North America 2018 25
Hivemall on SparkSQL
ApacheCon North America 2018 26
Hivemall on Apache Pig
ApacheCon North America 2018 27
Online Prediction by Apache Streaming
ApacheCon North America 2018 28
List of Supported Algorithms
Classification
✓ Perceptron
✓ Passive Aggressive (PA, PA1, PA2)
✓ Confidence Weighted (CW)
✓ Adaptive Regularization of Weight
Vectors (AROW)
✓ Soft Confidence Weighted (SCW)
✓ AdaGrad+RDA
✓ Factorization Machines
✓ RandomForest Classification
Regression
✓Logistic Regression (SGD)
✓AdaGrad (logistic loss)
✓AdaDELTA (logistic loss)
✓PA Regression
✓AROW Regression
✓Factorization Machines
✓RandomForest Regression
SCW is a good first choice
Try RandomForest if SCW does not
work
Logistic regression is good for getting a
probability of a positive class
Factorization Machines is good where
features are sparse and categorical ones
ApacheCon North America 2018 29
Generic Classifier/Regressor
OLD Style New Style from v0.5.0
ApacheCon North America 2018 30
•Squared Loss
•Quantile Loss
•Epsilon Insensitive Loss
•Squared Epsilon Insensitive
Loss
•Huber Loss
Generic Classifier/Regressor
Available Loss functions
•HingeLoss
•LogLoss (synonym: logistic)
•SquaredHingeLoss
•ModifiedHuberLoss
• L1
• L2
• ElasticNet
• RDA
Other options
For Binary Classification:
For Regression:
• SGD
• AdaGrad
• AdaDelta
• ADAM
Optimizer
• Iteration support
• mini-batch
• Early stopping
Regularization
ApacheCon North America 2018 31
RandomForest in Hivemall
Ensemble of Decision Trees
ApacheCon North America 2018 32
Training of RandomForest
Good news: Sparse Vector Input (Libsvm
format) is supported since v0.5.0 in
addition Dense Vector input.
ApacheCon North America 2018 33
Prediction of RandomForest
ApacheCon North America 2018 34
Decision Tree Visualization
ApacheCon North America 2018 35
Decision Tree Visualization
ApacheCon North America 2018 36
SELECT train_xgboost_classifier(features, label) as (model_id, model)
FROM training_data
XGBoost support in Hivemall (beta version)
SELECT rowed, AVG(predicted) as predicted
FROM (
-- predict with each model
SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted)
-- join each test record with each model
FROM xgboost_models CROSS JOIN test_data_with_id
) t
GROUP BY rowid;
ApacheCon North America 2018 37
Supported Algorithms for Recommendation
K-Nearest Neighbor
✓ Minhash and b-Bit Minhash
(LSH variant)
✓ Similarity Search on Vector
Space
(Euclid/Cosine/Jaccard/Angular)
Matrix Completion
✓ Matrix Factorization
✓ Factorization Machines
(regression)
each_top_k function of Hivemall is useful for
recommending top-k items
ApacheCon North America 2018 38
Other Supported Algorithms
Feature Engineering
✓Feature Hashing
✓Feature Scaling
(normalization, z-score)
✓ Feature Binning
✓ TF-IDF vectorizer
✓ Polynomial Expansion
✓ Amplifier
NLP
✓Basic Englist text Tokenizer
✓English/Japanese/Chinese
Tokenizer
Evaluation metrics
✓AUC, nDCG, logloss, precision
recall@K, and etc
ApacheCon North America 2018 39
Feature Engineering – Feature Hashing
ApacheCon North America 2018 40
Feature Engineering – Feature Binning
Maps quantitative variables to fixed number of
bins based on quantiles/distribution
Map Ages into 3 bins
ApacheCon North America 2018 41
ApacheCon North America 2018
Feature Engineering – Feature Binning
42
Evaluation Metrics
ApacheCon North America 2018 43
Other Supported Features
Anomaly Detection
✓Local Outlier Factor (LoF)
✓ChangeFinder
Clustering / Topic models
✓Online mini-batch LDA
✓Online mini-batch PLSA
Change Point Detection
✓ChangeFinder
✓Singular Spectrum
Transformation
ApacheCon North America 2018 44
Efficient algorithm for finding change point and outliers from
time-series data
J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.
Anomaly/Change-point Detection by ChangeFinder
ApacheCon North America 2018 45
Take this…
Anomaly/Change-point Detection by ChangeFinder
ApacheCon North America 2018 46
Anomaly/Change-point Detection by ChangeFinder
…and do this!
ApacheCon North America 2018 47
• T. Ide and K. Inoue, "Knowledge Discovery from Heterogeneous Dynamic Systems using Change-Point
Correlations", Proc. SDM, 2005T.
• T. Ide and K. Tsuda, "Change-point detection using Krylov subspace learning", Proc. SDM, 2007.
Change-point detection by Singular Spectrum Transformation
ApacheCon North America 2018 48
Online mini-batch LDA
ApacheCon North America 2018 49
Probabilistic Latent Semantic Analysis - training
ApacheCon North America 2018 50
Probabilistic Latent Semantic Analysis - predict
ApacheCon North America 2018 51
ü Spark 2.3 support
ü Merged Brickhouse UDFs
ü Field-aware Factorization Machines
ü SLIM recommendation
What’s new in the coming v0.5.2
ApacheCon North America 2018
Xia Ning and George Karypis, SLIM: Sparse Linear Methods for Top-N Recommender Systems, Proc. ICDM, 2011.
Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, and Chih-Jen Lin, "Field-aware Factorization Machines for CTR
Prediction", Proc. RecSys. 2016.
State-of-the-art method for CTR prediction, often used algorithm in Kaggle
Very promising algorithm for top-k recommendation
52
ü Word2Vec support
ü Multi-class Logistic Regression
ü More efficient XGBoost support
ü LightGBM support
ü Gradient Boosting
ü Kafka KSQL UDF porting
Future work for v0.6 and later
PR#91
PR#116
ApacheCon North America 2018 53
Copyright©2018 NTT corp. All Rights Reserved.
Copyright©2018 NTT corp. All Rights Reserved.
,
• .
•
• . .
Copyright©2018 NTT corp. All Rights Reserved.
- : -:
•
• 665 . - 5 . - . -
.
• :
• . 5 . ,
56 5
• :
• 5 . .
()Copyright©2018 NTT corp. All Rights Reserved.
-: , : 2
• - :1 . -:
• - 31 - 1:- 1 31
• 31 :
$ 7> FD E F=:
= DF FD E $ D =:$ E F$= A
F > : FD E $E: $ : > : /-.2/$A "
F > :$ > :E 5$ > F L D E M""$ "
E:F 1 (
Copyright©2018 NTT corp. All Rights Reserved.
• ( 0 2 244 0 24 40
10 0 1 00 0 0
• )0 10 0 2
• ) E F F 1C 8 :
• C .F58C 8E8 * 8C 8 E
• EC8 : 8 /8 C :
• :8 28 8C
• I
3 *0.0
FCE 8C C E 58 F EE ( 5 E H H
Copyright©2018 NTT corp. All Rights Reserved.
•
• / 5 5 *55 -5 9 3 5 / :
9 95 5 53 9 5 A 9 5 39 5 A 29 .D ,
Copyright©2018 NTT corp. All Rights Reserved.
• .
• 6+ -6 / - - + / 6 + + 6+ -
• - / / /- 6 -+ - / +60
• .
• 6 / - + / / +60
• 6 - + - -+ / / 6 +
• / 6 - - / -6 6 + - / +60
.
Copyright©2018 NTT corp. All Rights Reserved.
•
• / . 66/ 6 ++ : 6 1 6/ . / 1/
. 6 . 6 6/ / . 1
*Copyright©2018 NTT corp. All Rights Reserved.
• ,7 299 A 3 7 A 7 2
,-1 ).. ( 2
• - 1 :3 13 1 23
A A 2 A
5 1: 3 6 $$5 6 0 1 $/ /163$ 1 0/ 6 3 /::
12 1 0/ 6 3 /::
/1 /53 , -3
: / 53 $ /
/ 53 $6 3 /:: / ... D 6 23 3 23 1 3 /
Copyright©2018 NTT corp. All Rights Reserved.
• . 3 3 3
• 4 . 3
• 1 24 1
• 4 43
2 1
Copyright©2018 NTT corp. All Rights Reserved.
• 6 . .21 6 6
• ## :2. 6 .- # 426#42 : 4:#-
:. :# .0 .::2 6 4 $$ /2-/
0 6 6 .1.1 1 6 6 6
0. ## :2. 6 .- # 426#42 : 4:#- :. :#
.0 .::2 6# $$ 26
0. ## :2. 6 .- # 426#42 : 4:#- :. :#
.0 .::2 6# $$ .:
)(Copyright©2018 NTT corp. All Rights Reserved.
2 . .
// Downloads Spark v2.3 and launches a spark-shell with Hivemall
$ : C < C == C
. D / D > > D : 6 =: CF> > DD 6 :=
C5 = - F = D : / C <$ 6$ > D =: CF> "$= 6 0 )$D : $ "
C5 = - D : / $ : D25 >
D
= = 6 E = E== = D E "
DE C F 5D E== = D E "
Copyright©2018 NTT corp. All Rights Reserved.
3 . -
-6-) :- =
-6 6 ( = - = - ,6 -=> 6-. 6 "
>: -=> B"
- = $) - B"
Copyright©2018 NTT corp. All Rights Reserved.
- -.
= D ( L CF, = L D = 6 E C O 6 CF6 D
= D ( L
D EG> D, D
P 6 LM " ) O CABL ) O CABL
P .
P 6 L CF:DGA A LM " D D
P ) LM " O CABL
P . 6 CF6 D
P
P 7 LM
L C ACF
(Copyright©2018 NTT corp. All Rights Reserved.
. .-
E 6 6 6EF )
6 8: * F EF : E F"DB =8"# : 6F D E #
B8 FBD" : 6F D E #
6
B D 8= F=B E
8: >B= " B8 : 8:" : 6F D # *** B8 " : 6F D # . . #
DB , " DB =8 #
6 "E= B=8"E " = F $ 6 ###
Copyright©2018 NTT corp. All Rights Reserved.
. - 4
N >G>) JABG C MB>OB6M B G> B:B FBR T JABG:> GB
N >G>) AC MB>OB6M B G> B:B FBR T:BNO:> GBT
N >G>) >NOB
NLG
S 9. .,: MJRFA NFD JFA >GPB " RBFDEO *9 MBAF OBA
S 6 :M>FI:> GB O
S . : 6 :. 6 JABG:> GB
S 6 O CB>OPMB ( CB>OPMB
S 6 = MJRFA
NOMF >MDFI
(Copyright©2018 NTT corp. All Rights Reserved.
• - . .
- . :
• , :
• . ) 0/- > :=:C
-: : -
+ 7 :>C0 =
C
) C :> > 3 C
) C :> > 3 C
) 3 > 3 7
+ 7 C :C $" ". (" "), $" ". (" "))
, = C C >C : 7
(Copyright©2018 NTT corp. All Rights Reserved.
•
• A J KN I=D KA J$ K = E K=J K > I = I
-
J D , JK=)
D K C > + D=>K > B A IA K >$ I R )) AD$ 17R"
J=D= K D=>K > I "$ D=>K > R" IA K > PR"" J J I=R"
NAK D E
I C $ I C " =I IKAKA .P I R" I<=I.P J I= <=J """
N =I= I C + K "
E K=J K :P JA IC ADD - J DP
(Copyright©2018 NTT corp. All Rights Reserved.
• : 3 : : .
• A J KN I=D KA J$ K = E K=J K > I = I
>3 :1 : 3:> : . 13>> :
J D , JK=)
D K C > + D=>K > B A IA K >$ I R )) AD$ 2 7R"
J=D= K D=>K > I "$ D=>K > R" IA K > PR"" J J I=R"
NAK D E
I C $ I C " =I IKAKA .P I R" I<=I.P J I= <=J """
N =I= I C + K "
E K=J K :P JA IC ADD - 2J DP
:> 3
- 2 : 3 1
3 2
Copyright©2018 NTT corp. All Rights Reserved.
• ::-
• .= AD= : A = A7 = > A A=> = 7 = >
- : :- -
: ) > A
: A=> + ( : A+ :
: A A=> 7A+ : A+ = >H ((( 7A+ = >H
: A+ H 7A+ H = H
- >: A A 3 - , ::
Copyright©2018 NTT corp. All Rights Reserved.
• : :
-
Copyright©2018 NTT corp. All Rights Reserved.
• : :
-
K-length
priority queue
Computes top-K rows
by using a priority queue
Copyright©2018 NTT corp. All Rights Reserved.
• : :
-
K-length
priority queue
Computes top-K rows
by using a priority queue
Only joins top-K rows
Copyright©2018 NTT corp. All Rights Reserved.
• - -
• (7) ) / ) ) ) , )
7 ) 7 ) ) , 7
- :
(7) ) ) )
Copyright©2018 NTT corp. All Rights Reserved.
• - -
• / 8 8 7 , / 8
7/ / 7 8 7 /8
- :
7/
,Copyright©2018 NTT corp. All Rights Reserved.
• - - *: -::
• 7 JD L J EEP J L K =H> JH EL
PK = E E # > =H E K = L K L
:- - :
K= E / LH D0 E
.. PK = E E ..
7 E >2 K H 8H (# 9 JH ( :# 9 JH ) :
- 1 = K JL L H JH ( # )
- H= E8 E 7= 9 JH ( # ((:
1 = K JL L H JH ) # )
H= E8 E 7= 9 JH ) # P )+:
- -* -* : * - - *
Copyright©2018 NTT corp. All Rights Reserved.
• - 3:
3 -:1 : 1 1
! :1 : : : : : -
Copyright©2018 NTT corp. All Rights Reserved.
• : -: : :
: =: -:
• 1 : 8 1 :
1 : 8 - 8
+ : : -:
Data Extraction (e.g., by SQL) Feature Selection (e.g., by scikit-learn)
Selected Features
Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice
about Joins before Feature Selection, Proceedings of SIGMOD, 2016.
Copyright©2018 NTT corp. All Rights Reserved.
• : -: : :
: =: -:
• 8 8 : 2 8 : 1 : 8 21
:8 2 : - 1 8 1 : 8 :
+ : : -:
Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice
about Joins before Feature Selection, Proceedings of SIGMOD, 2016.
Data Extraction + Feature Selection
Join Pruning by Data Statistics
Conclusion and Takeaway
Hivemall is a multi/cross-platform ML library
providing a collection of machine learning algorithms as Hive UDFs/UDTFs
The 2nd Apache release (v0.5.2) will appear soon!
We welcome your contributions to Apache Hivemall J
HiveQL SparkSQL/Dataframe API Pig Latin
ApacheCon North America 2018 83
Thank you! Questions?
ApacheCon North America 2018 84
Mentors wanted!

More Related Content

What's hot

Evaluation of TPC-H on Spark and Spark SQL in ALOJA
Evaluation of TPC-H on Spark and Spark SQL in ALOJAEvaluation of TPC-H on Spark and Spark SQL in ALOJA
Evaluation of TPC-H on Spark and Spark SQL in ALOJA
DataWorks Summit
 
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
An AI-Powered Chatbot to Simplify Apache Spark Performance ManagementAn AI-Powered Chatbot to Simplify Apache Spark Performance Management
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
Databricks
 

What's hot (20)

Evaluation of TPC-H on Spark and Spark SQL in ALOJA
Evaluation of TPC-H on Spark and Spark SQL in ALOJAEvaluation of TPC-H on Spark and Spark SQL in ALOJA
Evaluation of TPC-H on Spark and Spark SQL in ALOJA
 
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4jExtending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
 
Oracle Spatial Studio: Fast and Easy Spatial Analytics and Maps
Oracle Spatial Studio:  Fast and Easy Spatial Analytics and MapsOracle Spatial Studio:  Fast and Easy Spatial Analytics and Maps
Oracle Spatial Studio: Fast and Easy Spatial Analytics and Maps
 
Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!
 
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
 
Gain Insights with Graph Analytics
Gain Insights with Graph Analytics Gain Insights with Graph Analytics
Gain Insights with Graph Analytics
 
DMM.comラボはなぜSparkを採用したのか?レコメンドエンジン開発の裏側をお話します!
DMM.comラボはなぜSparkを採用したのか?レコメンドエンジン開発の裏側をお話します!DMM.comラボはなぜSparkを採用したのか?レコメンドエンジン開発の裏側をお話します!
DMM.comラボはなぜSparkを採用したのか?レコメンドエンジン開発の裏側をお話します!
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatri
 
PGQL: A Language for Graphs
PGQL: A Language for GraphsPGQL: A Language for Graphs
PGQL: A Language for Graphs
 
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
Apache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseApache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBase
 
Flink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at AlibabaFlink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at Alibaba
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
Guacamole Fiesta: What do avocados and databases have in common?
Guacamole Fiesta: What do avocados and databases have in common?Guacamole Fiesta: What do avocados and databases have in common?
Guacamole Fiesta: What do avocados and databases have in common?
 
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
 
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
 
Build Knowledge Graphs with Oracle RDF to Extract More Value from Your Data
Build Knowledge Graphs with Oracle RDF to Extract More Value from Your DataBuild Knowledge Graphs with Oracle RDF to Extract More Value from Your Data
Build Knowledge Graphs with Oracle RDF to Extract More Value from Your Data
 
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
An AI-Powered Chatbot to Simplify Apache Spark Performance ManagementAn AI-Powered Chatbot to Simplify Apache Spark Performance Management
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
 

Similar to Introduction to Apache Hivemall v0.5.0

Advanced Analytics in Hadoop
Advanced Analytics in HadoopAdvanced Analytics in Hadoop
Advanced Analytics in Hadoop
AnalyticsWeek
 

Similar to Introduction to Apache Hivemall v0.5.0 (20)

Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
 
Idea behind Apache Hivemall
Idea behind Apache HivemallIdea behind Apache Hivemall
Idea behind Apache Hivemall
 
Diagnose Your Microservices
Diagnose Your MicroservicesDiagnose Your Microservices
Diagnose Your Microservices
 
Jfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocksJfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocks
 
AI at Scale
AI at ScaleAI at Scale
AI at Scale
 
GPORCA: Query Optimization as a Service
GPORCA: Query Optimization as a ServiceGPORCA: Query Optimization as a Service
GPORCA: Query Optimization as a Service
 
Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
FPGA Acceleration of Apache Spark on AWS
FPGA Acceleration of Apache Spark on AWSFPGA Acceleration of Apache Spark on AWS
FPGA Acceleration of Apache Spark on AWS
 
Leveraging Apache Spark to Develop AI-Enabled Products and Services at Bosch
Leveraging Apache Spark to Develop AI-Enabled Products and Services at BoschLeveraging Apache Spark to Develop AI-Enabled Products and Services at Bosch
Leveraging Apache Spark to Develop AI-Enabled Products and Services at Bosch
 
From Hadoop to Enterprise Data Warehouse
From Hadoop to Enterprise Data WarehouseFrom Hadoop to Enterprise Data Warehouse
From Hadoop to Enterprise Data Warehouse
 
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
 
Advanced Analytics in Hadoop
Advanced Analytics in HadoopAdvanced Analytics in Hadoop
Advanced Analytics in Hadoop
 
The Oracle Autonomous Database
The Oracle Autonomous DatabaseThe Oracle Autonomous Database
The Oracle Autonomous Database
 
Sloupcové uložení dat a použití in-memory technologií u řešení Exadata
Sloupcové uložení dat a použití in-memory technologií u řešení ExadataSloupcové uložení dat a použití in-memory technologií u řešení Exadata
Sloupcové uložení dat a použití in-memory technologií u řešení Exadata
 
“Quantum” Performance Effects: beyond the Core
“Quantum” Performance Effects: beyond the Core“Quantum” Performance Effects: beyond the Core
“Quantum” Performance Effects: beyond the Core
 
Apache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to ApacheApache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to Apache
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)
 
Hyderabad Mar 2019 - Autonomous Database
Hyderabad Mar 2019 - Autonomous DatabaseHyderabad Mar 2019 - Autonomous Database
Hyderabad Mar 2019 - Autonomous Database
 

More from Makoto Yui

More from Makoto Yui (20)

Apache Hivemall and my OSS experience
Apache Hivemall and my OSS experienceApache Hivemall and my OSS experience
Apache Hivemall and my OSS experience
 
Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6
 
What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0
 
What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0
 
Revisiting b+-trees
Revisiting b+-treesRevisiting b+-trees
Revisiting b+-trees
 
Incubating Apache Hivemall
Incubating Apache HivemallIncubating Apache Hivemall
Incubating Apache Hivemall
 
Apache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, MiamiApache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, Miami
 
機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会
 
Podling Hivemall in the Apache Incubator
Podling Hivemall in the Apache IncubatorPodling Hivemall in the Apache Incubator
Podling Hivemall in the Apache Incubator
 
Dots20161029 myui
Dots20161029 myuiDots20161029 myui
Dots20161029 myui
 
Hadoopsummit16 myui
Hadoopsummit16 myuiHadoopsummit16 myui
Hadoopsummit16 myui
 
HadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myuiHadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myui
 
3rd Hivemall meetup
3rd Hivemall meetup3rd Hivemall meetup
3rd Hivemall meetup
 
Recommendation 101 using Hivemall
Recommendation 101 using HivemallRecommendation 101 using Hivemall
Recommendation 101 using Hivemall
 
Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016
 
Introduction to Hivemall
Introduction to HivemallIntroduction to Hivemall
Introduction to Hivemall
 
Tdtechtalk20160425myui
Tdtechtalk20160425myuiTdtechtalk20160425myui
Tdtechtalk20160425myui
 
Tdtechtalk20160330myui
Tdtechtalk20160330myuiTdtechtalk20160330myui
Tdtechtalk20160330myui
 
Datascientistsymp1113
Datascientistsymp1113Datascientistsymp1113
Datascientistsymp1113
 
2nd Hivemall meetup 20151020
2nd Hivemall meetup 201510202nd Hivemall meetup 20151020
2nd Hivemall meetup 20151020
 

Recently uploaded

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 

Recently uploaded (20)

Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 

Introduction to Apache Hivemall v0.5.0

  • 1. Introduction to Apache Hivemall v0.5.0: Machine Learning on Hive/Spark Makoto YUI @myui ApacheCon North America 2018 Takashi Yamamuro @maropu @ApacheHivemall 1). Principal Engineer, 2). Research Engineer, 1
  • 2. Plan of the talk 1. Introduction to Hivemall 2. Hivemall on Spark ApacheCon North America 2018 Background, quick walk-through of feature, usages, what's new in v0.5.0, and future roadmaps New top-k join enhancement, and a feature plan for Supporting spark 2.3 and feature selection 2 Slide available: bit.ly/hivemall-apachecon18
  • 3. We released the first Apache release v0.5.0 on Mar 3rd, 2018 ! hivemall.incubator.apache.org ApacheCon North America 2018 We plan to start voting for the 2nd Apache release (v0.5.2) in the next month (Oct 2018). 3
  • 4. What’s new in v0.5.0? Anomaly/Change Point Detection Topic Modeling (Soft Clustering) Algorithm: LDA, pLSA Algorithm: ChangeFinder, SST Hivmall on Spark 2.0/2.1/2.1 SparkSQL/Dataframe support, Top-k data processing ApacheCon North America 2018 4
  • 5. 5 Running machine learning on massive data stored on data warehouse Make It! ApacheCon North America 2018 Suppose … Background
  • 6. 6 Running machine learning on massive data stored on data warehouse Scalability? Data movement? Tool? ApacheCon North America 2018 Concerns:
  • 7. Approach #1 7 Data warehouse Data preprocessing Machine Learning Typical Data Scientist’s Solution Small data? ApacheCon North America 2018
  • 8. 8 Data warehouse Data preprocessing Machine Learning Approach #2 Data Engineer’s Solution ApacheCon North America 2018
  • 9. 9 Q: Is Dataframe a great idea for data (pre-)processing? ApacheCon North America 2018
  • 10. 10 Q: Do you like it? (for production-ready data preprocessing) p Yes p No p Maybe ApacheCon North America 2018 I like it for simple data processing
  • 11. 11 Q: Do you really like it? (for messy real-world data preprocessing) p Yes p No p Maybe ApacheCon North America 2018
  • 12. 12 Real-world ML pipelines (could be more complex) Join Extract Feature Datasource #1 Datasource #2 Datasource #3 Extract Feature Feature Scaling Feature Hashing Feature Engineering Feature Selection Train by Logistic Regression Train by RandomForest Train by Factorization Machines Ensemble Evaluate Predict ApacheCon North America 2018
  • 13. 13 Q: Have you ever seen/write hundreds-thousands lines of preprocessing in Dataframe? ApacheCon North America 2018 Hundreds-lines of SQL queries for data pre-precessing are well seen.
  • 14. 14 Q. Fun to play with it? (scala/python coding for trivial things) Do you write testing codes? IMPO, notebook codes are error-prone for production uses ApacheCon North America 2018
  • 15. My Suggestion 15 Data warehouse Data preprocessing Machine Learning + Scalability + Durability/Stability + Functionalities (UDFs, JSON, Windowing functions) Push more works back to DB where data resides (including some ML logics) One size does not fit all though ... ApacheCon North America 2018
  • 16. Machine Learning in SQL queries ApacheCon North America 2018 16
  • 17. BigQuery ML at Google I/O 2018 17 https://ai.googleblog.com/2018/07/machine-learning-in-google-bigquery.html ApacheCon North America 2018
  • 18. 18 Could I use ML-in-SQL in my cluster? ApacheCon North America 2018
  • 19. 19 Open-source Machine Learning Solution for SQL-on-Hadoop https://hivemall.apache.org (incubating) ApacheCon North America 2018
  • 20. What is Apache Hivemall Scalable machine learning library built as a collection of Hive UDFs Multi/Cross platform VersatileScalableEase-of-use ApacheCon North America 2018 20
  • 21. Hivemall is easy and scalable … ML made easy for SQL developers Born to be parallel and scalable Ease-of-use Scalable 100+ lines of code CREATE TABLE lr_model AS SELECT feature, -- reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t -- map-only task GROUP BY feature; -- shuffled to reducers This query automatically runs in parallel on Hadoop ApacheCon North America 2018 21
  • 22. Hivemall is a multi/cross-platform ML library HiveQL SparkSQL/Dataframe API Pig Latin Hivemall is Multi/Cross platform .. Multi/Cross platform prediction models built by Hive can be used from Spark, and conversely, prediction models build by Spark can be used from Hive ApacheCon North America 2018 22
  • 23. Hadoop HDFS MapReduce (MRv1) Hivemall Apache YARN Apache Tez DAG processing Machine Learning Query Processing Parallel Data Processing Framework Resource Management Distributed File System Cloud Storage SparkSQL Apache Spark MESOS Hive Pig MLlib Hivemall’s Technology Stack Amazon S3 ApacheCon North America 2018 23
  • 24. Hivemall on Apache Hive ApacheCon North America 2018 24
  • 25. Hivemall on Apache Spark Dataframe ApacheCon North America 2018 25
  • 26. Hivemall on SparkSQL ApacheCon North America 2018 26
  • 27. Hivemall on Apache Pig ApacheCon North America 2018 27
  • 28. Online Prediction by Apache Streaming ApacheCon North America 2018 28
  • 29. List of Supported Algorithms Classification ✓ Perceptron ✓ Passive Aggressive (PA, PA1, PA2) ✓ Confidence Weighted (CW) ✓ Adaptive Regularization of Weight Vectors (AROW) ✓ Soft Confidence Weighted (SCW) ✓ AdaGrad+RDA ✓ Factorization Machines ✓ RandomForest Classification Regression ✓Logistic Regression (SGD) ✓AdaGrad (logistic loss) ✓AdaDELTA (logistic loss) ✓PA Regression ✓AROW Regression ✓Factorization Machines ✓RandomForest Regression SCW is a good first choice Try RandomForest if SCW does not work Logistic regression is good for getting a probability of a positive class Factorization Machines is good where features are sparse and categorical ones ApacheCon North America 2018 29
  • 30. Generic Classifier/Regressor OLD Style New Style from v0.5.0 ApacheCon North America 2018 30
  • 31. •Squared Loss •Quantile Loss •Epsilon Insensitive Loss •Squared Epsilon Insensitive Loss •Huber Loss Generic Classifier/Regressor Available Loss functions •HingeLoss •LogLoss (synonym: logistic) •SquaredHingeLoss •ModifiedHuberLoss • L1 • L2 • ElasticNet • RDA Other options For Binary Classification: For Regression: • SGD • AdaGrad • AdaDelta • ADAM Optimizer • Iteration support • mini-batch • Early stopping Regularization ApacheCon North America 2018 31
  • 32. RandomForest in Hivemall Ensemble of Decision Trees ApacheCon North America 2018 32
  • 33. Training of RandomForest Good news: Sparse Vector Input (Libsvm format) is supported since v0.5.0 in addition Dense Vector input. ApacheCon North America 2018 33
  • 34. Prediction of RandomForest ApacheCon North America 2018 34
  • 35. Decision Tree Visualization ApacheCon North America 2018 35
  • 36. Decision Tree Visualization ApacheCon North America 2018 36
  • 37. SELECT train_xgboost_classifier(features, label) as (model_id, model) FROM training_data XGBoost support in Hivemall (beta version) SELECT rowed, AVG(predicted) as predicted FROM ( -- predict with each model SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted) -- join each test record with each model FROM xgboost_models CROSS JOIN test_data_with_id ) t GROUP BY rowid; ApacheCon North America 2018 37
  • 38. Supported Algorithms for Recommendation K-Nearest Neighbor ✓ Minhash and b-Bit Minhash (LSH variant) ✓ Similarity Search on Vector Space (Euclid/Cosine/Jaccard/Angular) Matrix Completion ✓ Matrix Factorization ✓ Factorization Machines (regression) each_top_k function of Hivemall is useful for recommending top-k items ApacheCon North America 2018 38
  • 39. Other Supported Algorithms Feature Engineering ✓Feature Hashing ✓Feature Scaling (normalization, z-score) ✓ Feature Binning ✓ TF-IDF vectorizer ✓ Polynomial Expansion ✓ Amplifier NLP ✓Basic Englist text Tokenizer ✓English/Japanese/Chinese Tokenizer Evaluation metrics ✓AUC, nDCG, logloss, precision recall@K, and etc ApacheCon North America 2018 39
  • 40. Feature Engineering – Feature Hashing ApacheCon North America 2018 40
  • 41. Feature Engineering – Feature Binning Maps quantitative variables to fixed number of bins based on quantiles/distribution Map Ages into 3 bins ApacheCon North America 2018 41
  • 42. ApacheCon North America 2018 Feature Engineering – Feature Binning 42
  • 44. Other Supported Features Anomaly Detection ✓Local Outlier Factor (LoF) ✓ChangeFinder Clustering / Topic models ✓Online mini-batch LDA ✓Online mini-batch PLSA Change Point Detection ✓ChangeFinder ✓Singular Spectrum Transformation ApacheCon North America 2018 44
  • 45. Efficient algorithm for finding change point and outliers from time-series data J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006. Anomaly/Change-point Detection by ChangeFinder ApacheCon North America 2018 45
  • 46. Take this… Anomaly/Change-point Detection by ChangeFinder ApacheCon North America 2018 46
  • 47. Anomaly/Change-point Detection by ChangeFinder …and do this! ApacheCon North America 2018 47
  • 48. • T. Ide and K. Inoue, "Knowledge Discovery from Heterogeneous Dynamic Systems using Change-Point Correlations", Proc. SDM, 2005T. • T. Ide and K. Tsuda, "Change-point detection using Krylov subspace learning", Proc. SDM, 2007. Change-point detection by Singular Spectrum Transformation ApacheCon North America 2018 48
  • 49. Online mini-batch LDA ApacheCon North America 2018 49
  • 50. Probabilistic Latent Semantic Analysis - training ApacheCon North America 2018 50
  • 51. Probabilistic Latent Semantic Analysis - predict ApacheCon North America 2018 51
  • 52. ü Spark 2.3 support ü Merged Brickhouse UDFs ü Field-aware Factorization Machines ü SLIM recommendation What’s new in the coming v0.5.2 ApacheCon North America 2018 Xia Ning and George Karypis, SLIM: Sparse Linear Methods for Top-N Recommender Systems, Proc. ICDM, 2011. Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, and Chih-Jen Lin, "Field-aware Factorization Machines for CTR Prediction", Proc. RecSys. 2016. State-of-the-art method for CTR prediction, often used algorithm in Kaggle Very promising algorithm for top-k recommendation 52
  • 53. ü Word2Vec support ü Multi-class Logistic Regression ü More efficient XGBoost support ü LightGBM support ü Gradient Boosting ü Kafka KSQL UDF porting Future work for v0.6 and later PR#91 PR#116 ApacheCon North America 2018 53
  • 54. Copyright©2018 NTT corp. All Rights Reserved.
  • 55. Copyright©2018 NTT corp. All Rights Reserved. , • . • • . .
  • 56. Copyright©2018 NTT corp. All Rights Reserved. - : -: • • 665 . - 5 . - . - . • : • . 5 . , 56 5 • : • 5 . .
  • 57. ()Copyright©2018 NTT corp. All Rights Reserved. -: , : 2 • - :1 . -: • - 31 - 1:- 1 31 • 31 : $ 7> FD E F=: = DF FD E $ D =:$ E F$= A F > : FD E $E: $ : > : /-.2/$A " F > :$ > :E 5$ > F L D E M""$ " E:F 1 (
  • 58. Copyright©2018 NTT corp. All Rights Reserved. • ( 0 2 244 0 24 40 10 0 1 00 0 0 • )0 10 0 2 • ) E F F 1C 8 : • C .F58C 8E8 * 8C 8 E • EC8 : 8 /8 C : • :8 28 8C • I 3 *0.0 FCE 8C C E 58 F EE ( 5 E H H
  • 59. Copyright©2018 NTT corp. All Rights Reserved. • • / 5 5 *55 -5 9 3 5 / : 9 95 5 53 9 5 A 9 5 39 5 A 29 .D ,
  • 60. Copyright©2018 NTT corp. All Rights Reserved. • . • 6+ -6 / - - + / 6 + + 6+ - • - / / /- 6 -+ - / +60 • . • 6 / - + / / +60 • 6 - + - -+ / / 6 + • / 6 - - / -6 6 + - / +60 .
  • 61. Copyright©2018 NTT corp. All Rights Reserved. • • / . 66/ 6 ++ : 6 1 6/ . / 1/ . 6 . 6 6/ / . 1
  • 62. *Copyright©2018 NTT corp. All Rights Reserved. • ,7 299 A 3 7 A 7 2 ,-1 ).. ( 2 • - 1 :3 13 1 23 A A 2 A 5 1: 3 6 $$5 6 0 1 $/ /163$ 1 0/ 6 3 /:: 12 1 0/ 6 3 /:: /1 /53 , -3 : / 53 $ / / 53 $6 3 /:: / ... D 6 23 3 23 1 3 /
  • 63. Copyright©2018 NTT corp. All Rights Reserved. • . 3 3 3 • 4 . 3 • 1 24 1 • 4 43 2 1
  • 64. Copyright©2018 NTT corp. All Rights Reserved. • 6 . .21 6 6 • ## :2. 6 .- # 426#42 : 4:#- :. :# .0 .::2 6 4 $$ /2-/ 0 6 6 .1.1 1 6 6 6 0. ## :2. 6 .- # 426#42 : 4:#- :. :# .0 .::2 6# $$ 26 0. ## :2. 6 .- # 426#42 : 4:#- :. :# .0 .::2 6# $$ .:
  • 65. )(Copyright©2018 NTT corp. All Rights Reserved. 2 . . // Downloads Spark v2.3 and launches a spark-shell with Hivemall $ : C < C == C . D / D > > D : 6 =: CF> > DD 6 := C5 = - F = D : / C <$ 6$ > D =: CF> "$= 6 0 )$D : $ " C5 = - D : / $ : D25 > D = = 6 E = E== = D E " DE C F 5D E== = D E "
  • 66. Copyright©2018 NTT corp. All Rights Reserved. 3 . - -6-) :- = -6 6 ( = - = - ,6 -=> 6-. 6 " >: -=> B" - = $) - B"
  • 67. Copyright©2018 NTT corp. All Rights Reserved. - -. = D ( L CF, = L D = 6 E C O 6 CF6 D = D ( L D EG> D, D P 6 LM " ) O CABL ) O CABL P . P 6 L CF:DGA A LM " D D P ) LM " O CABL P . 6 CF6 D P P 7 LM L C ACF
  • 68. (Copyright©2018 NTT corp. All Rights Reserved. . .- E 6 6 6EF ) 6 8: * F EF : E F"DB =8"# : 6F D E # B8 FBD" : 6F D E # 6 B D 8= F=B E 8: >B= " B8 : 8:" : 6F D # *** B8 " : 6F D # . . # DB , " DB =8 # 6 "E= B=8"E " = F $ 6 ###
  • 69. Copyright©2018 NTT corp. All Rights Reserved. . - 4 N >G>) JABG C MB>OB6M B G> B:B FBR T JABG:> GB N >G>) AC MB>OB6M B G> B:B FBR T:BNO:> GBT N >G>) >NOB NLG S 9. .,: MJRFA NFD JFA >GPB " RBFDEO *9 MBAF OBA S 6 :M>FI:> GB O S . : 6 :. 6 JABG:> GB S 6 O CB>OPMB ( CB>OPMB S 6 = MJRFA NOMF >MDFI
  • 70. (Copyright©2018 NTT corp. All Rights Reserved. • - . . - . : • , : • . ) 0/- > :=:C -: : - + 7 :>C0 = C ) C :> > 3 C ) C :> > 3 C ) 3 > 3 7 + 7 C :C $" ". (" "), $" ". (" ")) , = C C >C : 7
  • 71. (Copyright©2018 NTT corp. All Rights Reserved. • • A J KN I=D KA J$ K = E K=J K > I = I - J D , JK=) D K C > + D=>K > B A IA K >$ I R )) AD$ 17R" J=D= K D=>K > I "$ D=>K > R" IA K > PR"" J J I=R" NAK D E I C $ I C " =I IKAKA .P I R" I<=I.P J I= <=J """ N =I= I C + K " E K=J K :P JA IC ADD - J DP
  • 72. (Copyright©2018 NTT corp. All Rights Reserved. • : 3 : : . • A J KN I=D KA J$ K = E K=J K > I = I >3 :1 : 3:> : . 13>> : J D , JK=) D K C > + D=>K > B A IA K >$ I R )) AD$ 2 7R" J=D= K D=>K > I "$ D=>K > R" IA K > PR"" J J I=R" NAK D E I C $ I C " =I IKAKA .P I R" I<=I.P J I= <=J """ N =I= I C + K " E K=J K :P JA IC ADD - 2J DP :> 3 - 2 : 3 1 3 2
  • 73. Copyright©2018 NTT corp. All Rights Reserved. • ::- • .= AD= : A = A7 = > A A=> = 7 = > - : :- - : ) > A : A=> + ( : A+ : : A A=> 7A+ : A+ = >H ((( 7A+ = >H : A+ H 7A+ H = H - >: A A 3 - , ::
  • 74. Copyright©2018 NTT corp. All Rights Reserved. • : : -
  • 75. Copyright©2018 NTT corp. All Rights Reserved. • : : - K-length priority queue Computes top-K rows by using a priority queue
  • 76. Copyright©2018 NTT corp. All Rights Reserved. • : : - K-length priority queue Computes top-K rows by using a priority queue Only joins top-K rows
  • 77. Copyright©2018 NTT corp. All Rights Reserved. • - - • (7) ) / ) ) ) , ) 7 ) 7 ) ) , 7 - : (7) ) ) )
  • 78. Copyright©2018 NTT corp. All Rights Reserved. • - - • / 8 8 7 , / 8 7/ / 7 8 7 /8 - : 7/
  • 79. ,Copyright©2018 NTT corp. All Rights Reserved. • - - *: -:: • 7 JD L J EEP J L K =H> JH EL PK = E E # > =H E K = L K L :- - : K= E / LH D0 E .. PK = E E .. 7 E >2 K H 8H (# 9 JH ( :# 9 JH ) : - 1 = K JL L H JH ( # ) - H= E8 E 7= 9 JH ( # ((: 1 = K JL L H JH ) # ) H= E8 E 7= 9 JH ) # P )+: - -* -* : * - - *
  • 80. Copyright©2018 NTT corp. All Rights Reserved. • - 3: 3 -:1 : 1 1 ! :1 : : : : : -
  • 81. Copyright©2018 NTT corp. All Rights Reserved. • : -: : : : =: -: • 1 : 8 1 : 1 : 8 - 8 + : : -: Data Extraction (e.g., by SQL) Feature Selection (e.g., by scikit-learn) Selected Features Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016.
  • 82. Copyright©2018 NTT corp. All Rights Reserved. • : -: : : : =: -: • 8 8 : 2 8 : 1 : 8 21 :8 2 : - 1 8 1 : 8 : + : : -: Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. Data Extraction + Feature Selection Join Pruning by Data Statistics
  • 83. Conclusion and Takeaway Hivemall is a multi/cross-platform ML library providing a collection of machine learning algorithms as Hive UDFs/UDTFs The 2nd Apache release (v0.5.2) will appear soon! We welcome your contributions to Apache Hivemall J HiveQL SparkSQL/Dataframe API Pig Latin ApacheCon North America 2018 83
  • 84. Thank you! Questions? ApacheCon North America 2018 84 Mentors wanted!