2. Hivemall entered Apache Incubator
on Sept 13, 2016
Since then, we invited 3 contributors as new committers (a
committer has been voted as PPMC). Currently, we are working
toward the first Apache release (v0.5.0).
hivemall.incubator.apache.org
22018/2/15 Plazma OSS day
3. What’s new in v0.5.0?
3
Anomaly/Change Point
Detection
Topic Modeling
(Soft Clustering)
Algorithm:
LDA, pLSA
Algorithm:
ChangeFinder, SST
Hivmall on Spark
2.0/2.1/2.1
SparkSQL/Dataframe support,
Top-k data processing
2018/2/15 Plazma OSS day
4. What is Apache Hivemall
Scalable machine learning library built
as a collection of Hive UDFs
Multi/Cross
platform
VersatileScalableEase-of-use
42018/2/15 Plazma OSS day
5. Hivemall is easy and scalable …
ML made easy for SQL developers
Born to be parallel and scalable
Ease-of-use
Scalable
100+ lines
of code
CREATE TABLE lr_model AS
SELECT
feature, -- reducers perform model averaging in parallel
avg(weight) as weight
FROM (
SELECT logress(features,label,..) as (feature,weight)
FROM train
) t -- map-only task
GROUP BY feature; -- shuffled to reducers
This query automatically runs in parallel on Hadoop
52018/2/15 Plazma OSS day
6. Hivemall is a multi/cross-platform ML library
HiveQL SparkSQL/Dataframe API Pig Latin
Hivemall is Multi/Cross platform ..
Multi/Cross
platform
prediction models built by Hive can be used from Spark, and conversely,
prediction models build by Spark can be used from Hive
62018/2/15 Plazma OSS day
7. Hadoop HDFS
MapReduce
(MRv1)
Hivemall
Apache YARN
Apache Tez
DAG processing
Machine Learning
Query Processing
Parallel Data
Processing Framework
Resource Management
Distributed File System
Cloud Storage
SparkSQL
Apache Spark
MESOS
Hive Pig
MLlib
Hivemall’s Technology Stack
Amazon S3
72018/2/15 Plazma OSS day
14. 14
•Squared Loss
•Quantile Loss
•Epsilon Insensitive Loss
•Squared Epsilon Insensitive
Loss
•Huber Loss
Generic Classifier/Regressor
Available Loss functions
•HingeLoss
•LogLoss (synonym: logistic)
•SquaredHingeLoss
•ModifiedHuberLoss
• L1
• L2
• ElasticNet
• RDA
Other options
For Binary Classification:
For Regression:
• SGD
• AdaGrad
• AdaDelta
• ADAM
Optimizer
• Iteration support
• mini-batch
• Early stopping
Regularization
2018/2/15 Plazma OSS day
15. Versatile
Hivemall is a Versatile library ..
ü Not only for Machine Learning
ü provides a bunch of generic utility functions
Each organization has own sets of
UDFs for data preprocessing
Don’t Repeat Yourself!
Don’t Repeat Yourself!
152018/2/15 Plazma OSS day
16. Hivemall generic functions
Array and Map Bit and compress String and NLP
Brickhouse UDFs are merged in v0.5.2 release.
We welcome contributing your generic UDFs to Hivemall
Geo Spatial
Top-k processing
> TF/IDF
> TILE
> MAP_URL
162018/2/15 Plazma OSS day
17. 2018/2/15 Plazma OSS day
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k query processing
List top-2 students for each class
SELECT * FROM (
SELECT
*,
rank() over (partition by class order by score desc)
as rank
FROM table
) t
WHERE rank <= 2
RANK over() query does not finishes in 24 hours L
where 20 million MOOCs classes and avg 1,000 students in each classes
17
18. 2018/2/15 Plazma OSS day
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k query processing
List top-2 students for each class
SELECT
each_top_k(
2, class, score,
class, student
) as (rank, score, class, student)
FROM (
SELECT * FROM table
DISTRIBUTE BY class SORT BY class
) t
EACH_TOP_K finishes in 2 hours J
18
21. 21
SELECT count(distinct id) FROM data
More useful functions (Sketch, NLP)
SELECT approx_count_distinct(id) FROM data
select tokenize_ja(“ ",
"normal", null, null, "https://s3.amazonaws.com/td-
hivemall/dist/kuromoji-user-dict-neologd.csv.gz");
[“ ”, "," "," "]
2018/2/15 Plazma OSS day
22. List of Supported Algorithms
Classification
✓ Perceptron
✓ Passive Aggressive (PA, PA1, PA2)
✓ Confidence Weighted (CW)
✓ Adaptive Regularization of Weight
Vectors (AROW)
✓ Soft Confidence Weighted (SCW)
✓ AdaGrad+RDA
✓ Factorization Machines
✓ RandomForest Classification
Regression
✓Logistic Regression (SGD)
✓AdaGrad (logistic loss)
✓AdaDELTA (logistic loss)
✓PA Regression
✓AROW Regression
✓Factorization Machines
✓RandomForest Regression
SCW is a good first choice
Try RandomForest if SCW does not
work
Logistic regression is good for getting a
probability of a positive class
Factorization Machines is good where
features are sparse and categorical ones
222018/2/15 Plazma OSS day
24. Training of RandomForest
24
Good news: Sparse Vector Input (Libsvm
format) is supported since v0.5.0 in
addition Dense Vector!
2018/2/15 Plazma OSS day
28. 28
SELECT train_xgboost_classifier(features, label) as (model_id, model)
FROM training_data
XGBoost support in Hivemall (beta version)
SELECT rowed, AVG(predicted) as predicted
FROM (
-- predict with each model
SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted)
-- join each test record with each model
FROM xgboost_models CROSS JOIN test_data_with_id
) t
GROUP BY rowid;
2018/2/15 Plazma OSS day
29. Supported Algorithms for Recommendation
K-Nearest Neighbor
✓ Minhash and b-Bit Minhash
(LSH variant)
✓ Similarity Search on Vector
Space
(Euclid/Cosine/Jaccard/Angular)
Matrix Completion
✓ Matrix Factorization
✓ Factorization Machines
(regression)
each_top_k function of Hivemall is useful for
recommending top-k items
292018/2/15 Plazma OSS day
32. Feature Engineering – Feature Binning
Maps quantitative variables to fixed number of
bins based on quantiles/distribution
Map Ages into 3 bins
322018/2/15 Plazma OSS day
35. Other Supported Features
Anomaly Detection
✓Local Outlier Factor (LoF)
✓ChangeFinder
Clustering / Topic models
✓Online mini-batch LDA
✓Online mini-batch PLSA
Change Point Detection
✓ChangeFinder
✓Singular Spectrum
Transformation
352018/2/15 Plazma OSS day
36. Efficient algorithm for finding change point and outliers from
time-series data
J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.
Anomaly/Change-point Detection by ChangeFinder
362018/2/15 Plazma OSS day
39. Efficient algorithm for finding change point and outliers from
timeseries data
Anomaly/Change-point Detection by ChangeFinder
J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.
392018/2/15 Plazma OSS day
40. • T. Ide and K. Inoue, "Knowledge Discovery from Heterogeneous Dynamic Systems using Change-Point
Correlations", Proc. SDM, 2005T.
• T. Ide and K. Tsuda, "Change-point detection using Krylov subspace learning", Proc. SDM, 2007.
Change-point detection by Singular Spectrum Transformation
402018/2/15 Plazma OSS day
44. ü Word2Vec support
ü Multi-class Logistic Regression
ü Field-aware Factorization Machines
ü SLIM recommendation
ü More efficient XGBoost support
ü LightGBM support
ü DecisionTree prediction tracing
ü Gradient Boosting
Future work for v0.5.2 and later
44
PR#91
PR#116
PR#58
PR#111
2018/2/15 Plazma OSS day
45. ü Word2Vec support
ü Multi-class Logistic Regression
ü Field-aware Factorization Machines
ü SLIM recommendation
ü More efficient XGBoost support
ü LightGBM support
ü DecisionTree prediction tracing
ü Gradient Boosting
Future work for v0.5.2 and later
45
PR#91
PR#116
PR#58
PR#111
2018/2/15 Plazma OSS day
46. Conclusion and Takeaway
Hivemall is a multi/cross-platform ML library
providing a collection of machine learning algorithms as Hive UDFs/UDTFs
The first Apache release (v0.5.0) will appear soon!
We welcome your contributions to Apache Hivemall J
HiveQL SparkSQL/Dataframe API Pig Latin
462018/2/15 Plazma OSS day
47. Any feature request or questions?
BTW, we are hiring!
472018/2/15 Plazma OSS day