This document discusses Hivemall, an open source machine learning library for Apache Hive, Spark, and Pig. It provides concise summaries of Hivemall in 3 sentences or less:
Hivemall is a scalable machine learning library built as a collection of Hive UDFs that allows users to perform machine learning tasks like classification, regression, and recommendation using SQL queries. Hivemall supports many popular machine learning algorithms and can run in parallel on large datasets using Apache Spark, Hive, Pig, and other big data frameworks. The document outlines how to run a machine learning workflow with Hivemall on Spark, including loading data, building a model, and making predictions.
Azure Monitor & Application Insight to monitor Infrastructure & Application
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
1. Hivemall: Scalable machine learning
library for Apache Hive/Spark/Pig
1). Research Engineer, Treasure Data
Makoto YUI @myui
2016/10/26 Hadoop Summit '16, Tokyo
2). Research Engineer, NTT
Takashi Yamamuro @maropu
#hivemall
2. Plan of the talk
1. Introduction to Hivemall
2. Hivemall on Spark
2016/10/26 Hadoop Summit '16, Tokyo 2
3. 2016/10/26 Hadoop Summit '16, Tokyo 3
Hivemall enters Apache Incubator
on Sept 13, 2016 🎉
http://incubator.apache.org/projects/hivemall.html
@ApacheHivemall
4. • Makoto Yui <Treasure Data>
• Takeshi Yamamuro <NTT>
Hivemall on Apache Spark
• Daniel Dai <Hortonworks>
Hivemall on Apache Pig
Apache Pig PMC member
• Tsuyoshi Ozawa <NTT>
Apache Hadoop PMC member
• Kai Sasaki <Treasure Data>
2016/10/26 Hadoop Summit '16, Tokyo 4
Initial committers
5. Champion
Nominated Mentors
2016/10/26 Hadoop Summit '16, Tokyo 5
Project mentors
• Reynold Xin <Databricks, ASF member>
Apache Spark PMC member
• Markus Weimer <Microsoft, ASF member>
Apache REEF PMC member
• Xiangrui Meng <Databricks, ASF member>
Apache Spark PMC member
• Roman Shaposhnik <Pivotal, ASF member>
Apache Bigtop/Incubator PMC member
6. What is Apache Hivemall
Scalable machine learning library built
as a collection of Hive UDFs
2016/10/26 Hadoop Summit '16, Tokyo
github.com/myui/hivemall
hivemall.incubator.apache.org
7. Hivemall’s Vision: ML on SQL
2016/10/26 Hadoop Summit '16, Tokyo 7
Classification with Mahout
CREATE TABLE lr_model AS
SELECT
feature, -- reducers perform model averaging in parallel
avg(weight) as weight
FROM (
SELECT logress(features,label,..) as (feature,weight)
FROM train
) t -- map-only task
GROUP BY feature; -- shuffled to reducers
✓ Machine Learning made easy for SQL developers
✓ Interactive and Stable APIs w/ SQL abstraction
This SQL query automatically runs in
parallel on Hadoop
8. Hadoop HDFS
MapReduce
(MRv1)
Hivemall
Apache YARN
Apache Tez
DAG processing
Machine Learning
Query Processing
Parallel Data
Processing Framework
Resource Management
Distributed File System
Cloud Storage
SparkSQL
Apache Spark
MESOS
Hive Pig
MLlib
Hivemall’s Technology Stack
2016/10/26 Hadoop Summit '16, Tokyo 8
Amazon S3
9. List of supported Algorithms
Classification
✓ Perceptron
✓ Passive Aggressive (PA, PA1, PA2)
✓ Confidence Weighted (CW)
✓ Adaptive Regularization of Weight
Vectors (AROW)
✓ Soft Confidence Weighted (SCW)
✓ AdaGrad+RDA
✓ Factorization Machines
✓ RandomForest Classification
2016/10/26 Hadoop Summit '16, Tokyo 9
Regression
✓Logistic Regression (SGD)
✓AdaGrad (logistic loss)
✓AdaDELTA (logistic loss)
✓PA Regression
✓AROW Regression
✓Factorization Machines
✓RandomForest Regression
SCW is a good first choice
Try RandomForest if SCW does not work
Logistic regression is good for getting a probability of
a positive class
Factorization Machines is good where features are
sparse and categorical ones
10. List of Algorithms for Recommendation
2016/10/26 Hadoop Summit '16, Tokyo 10
K-Nearest Neighbor
✓ Minhash and b-Bit Minhash
(LSH variant)
✓ Similarity Search on Vector
Space
(Euclid/Cosine/Jaccard/Angular)
Matrix Completion
✓ Matrix Factorization
✓ Factorization Machines
(regression)
each_top_k function of Hivemall is useful for
recommending top-k items
11. 2016/10/26 Hadoop Summit '16, Tokyo 11
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k query processing
student class score
3 a 90
2 a 80
1 b 70
6 b 60
List top-2 students for each class
12. 2016/10/26 Hadoop Summit '16, Tokyo 12
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k query processing
List top-2 students for each class
SELECT * FROM (
SELECT
*,
rank() over (partition by class order by score desc)
as rank
FROM table
) t
WHERE rank <= 2
13. 2016/10/26 Hadoop Summit '16, Tokyo 13
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k query processing
List top-2 students for each class
SELECT
each_top_k(
2, class, score,
class, student
) as (rank, score, class, student)
FROM (
SELECT * FROM table
DISTRIBUTE BY class SORT BY class
) t
14. 2016/10/26 Hadoop Summit '16, Tokyo 14
Top-k query processing by RANK OVER()
partition by class
Node 1
Sort by class, score
rank over()
rank >= 2
15. 2016/10/26 Hadoop Summit '16, Tokyo 15
Top-k query processing by EACH_TOP_K
distributed by class
Node 1
Sort by class
each_top_k
OUTPUT only K
items
16. 2016/10/26 Hadoop Summit '16, Tokyo 16
Comparison between RANK and EACH_TOP_K
distributed by class
Sort by class
each_top_k
Sort by class, score
rank over()
rank >= 2
SORTING IS HEAVY
NEED TO
PROCESS ALL
OUTPUT only K
items
Each_top_k is very efficient where the number of class is large
Bounded Priority Queue
is utilized
17. Performance reported by TD customer
2016/10/26 Hadoop Summit '16, Tokyo 17
•1,000 students in each class
•20 million classes
RANK over() query does not finishes in 24 hours
EACH_TOP_K finishes in 2 hours
Refer for detail
https://speakerdeck.com/kaky0922/hivemall-meetup-20160908
19. Industry use cases of Hivemall
CTR prediction of Ad click logs
• Algorithm: Logistic regression
• Freakout Inc., Smartnews, and more
Gender prediction of Ad click logs
• Algorithm: Classification
• Scaleout Inc.
2016/10/26 Hadoop Summit '16, Tokyo 19
http://www.slideshare.net/eventdotsjp/hivemall
20. Industry use cases of Hivemall
Item/User recommendation
• Algorithm: Recommendation
• Wish.com, GMO pepabo
2016/10/26 Hadoop Summit '16, Tokyo 20
Problem: Recommendation using hot-item is hard in hand-crafted
product market because each creator sells few single items (will
soon become out-of-stock)
minne.com
21. Industry use cases of Hivemall
Value prediction of Real estates
• Algorithm: Regression
• Livesense
2016/10/26 Hadoop Summit '16, Tokyo 21
22. Industry use cases of Hivemall
User score calculation
• Algrorithm: Regression
• Klout
2016/10/26 Hadoop Summit '16, Tokyo 22
bit.ly/klout-hivemall
Influencer marketing
klout.com
24. Efficient algorithm for finding change point and outliers from
timeseries data
2016/10/26 Hadoop Summit '16, Tokyo 24
J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting Outliers and Change Points from Time Series,” IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.
Anomaly/Change-point Detection by ChangeFinder
25. Efficient algorithm for finding change point and outliers from
timeseries data
2016/10/26 Hadoop Summit '16, Tokyo 25
Anomaly/Change-point Detection by ChangeFinder
J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting Outliers and Change Points from Time Series,” IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.
26. 2016/10/26 Hadoop Summit '16, Tokyo 26
• T. Ide and K. Inoue, "Knowledge Discovery from Heterogeneous Dynamic Systems using Change-Point
Correlations", Proc. SDM, 2005T.
• T. Ide and K. Tsuda, "Change-point detection using Krylov subspace learning", Proc. SDM, 2007.
Change-point detection by Singular Spectrum Transformation
28. 2016/10/26 Hadoop Summit '16, Tokyo 28
Feature Engineering – Feature Binning
Maps quantitative variables to fixed number of
bins based on quantiles/distribution
Map Ages into 3 bins
31. 2016/10/26 Hadoop Summit '16, Tokyo 31
Feature Transformation – Onehot encoding
Maps a categorical variable to a
unique number starting from 1
32. Spark 2.0 support
XGBoost Integration
Field-aware Factorization Machines
Generalized Linear Model
• Optimizer framework including ADAM
• L1/L2 regularization
2016/10/26 Hadoop Summit '16, Tokyo 32
Other new features to come
56. Conclusion and Takeaway
Hivemall provides a collection of machine
learning algorithms as Hive UDFs/UDTFs
2016/10/26 Hadoop Summit '16, Tokyo 56
We welcome your contributions to Apache Hivemall