7. Hivemall’s Vision: ML on SQL
2016/10/26 Hadoop Summit '16, Tokyo 7
Classification with Mahout
CREATE TABLE lr_model AS
SELECT
feature, -- reducers perform model averaging in parallel
avg(weight) as weight
FROM (
SELECT logress(features,label,..) as (feature,weight)
FROM train
) t -- map-only task
GROUP BY feature; -- shuffled to reducers
✓ Machine Learning made easy for SQL developers
✓ Interactive and Stable APIs w/ SQL abstraction
This SQL query automatically runs in
parallel on Hadoop
9. List of supported Algorithms
Classification
✓ Perceptron
✓ Passive Aggressive (PA, PA1, PA2)
✓ Confidence Weighted (CW)
✓ Adaptive Regularization of Weight
Vectors (AROW)
✓ Soft Confidence Weighted (SCW)
✓ AdaGrad+RDA
✓ Factorization Machines
✓ RandomForest Classification
2016/10/26 Hadoop Summit '16, Tokyo 9
Regression
✓Logistic Regression (SGD)
✓AdaGrad (logistic loss)
✓AdaDELTA (logistic loss)
✓PA Regression
✓AROW Regression
✓Factorization Machines
✓RandomForest Regression
SCW is a good first choice
Try RandomForest if SCW does not
work
Logistic regression is good for getting a
probability of a positive class
Factorization Machines is good where
features are sparse and categorical ones
12. 2016/10/26 Hadoop Summit '16, Tokyo 12
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k query processing
List top-2 students for each class
SELECT * FROM (
SELECT
*,
rank() over (partition by class order by score desc)
as rank
FROM table
) t
WHERE rank <= 2
13. 2016/10/26 Hadoop Summit '16, Tokyo 13
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k query processing
List top-2 students for each class
SELECT
each_top_k(
2, class, score,
class, student
) as (rank, score, class, student)
FROM (
SELECT * FROM table
DISTRIBUTE BY class SORT BY class
) t
32. ü Spark 2.0 support
ü XGBoost Integration
ü Field-aware Factorization Machines
ü Generalized Linear Model
• Optimizer framework including ADAM
• L1/L2 regularization
2016/10/26 Hadoop Summit '16, Tokyo 32
Other new features to come