Introduction to Apache Hivemall
v0.5.2 and v0.6
Principal Engineer
Makoto YUI @myui
@ApacheHivemall
1Hadoop Conf Japan - Mar 14, 2019
Hadoop Conf Japan - Mar 14, 2019 2
We Open-source!
Streaming log collector Bulk data import/export Efficient binary serialization
Machine learning on Hadoop Workflow EngineEmbedded version of Fluentd
Machine Learning
in SQL queries
3
Hadoop Conf Japan - Mar 14, 2019
BigQuery ML at Google I/O 2018
4
https://ai.googleblog.com/2018/07/machine-learning-in-google-bigquery.html
Hadoop Conf Japan - Mar 14, 2019
5
Could I use ML-in-SQL in my cluster?
Hadoop Conf Japan - Mar 14, 2019
6
Open-source Machine Learning Solution
for SQL-on-Hadoop
Hadoop Conf Japan - Mar 14, 2019
hivemall.apache.org (incubating)
7
HiveQL SparkSQL/Dataframe API Pig Latin
Hivemall is a multi/cross platform ML library
that provides rich set of functions
Hadoop Conf Japan - Mar 14, 2019
Hivemall on Apache Hive
8Hadoop Conf Japan - Mar 14, 2019
Hivemall on Apache Spark Dataframe
9Hadoop Conf Japan - Mar 14, 2019
Hivemall on SparkSQL
10Hadoop Conf Japan - Mar 14, 2019
Hivemall on Apache Pig
11Hadoop Conf Japan - Mar 14, 2019
Online Prediction by Apache Streaming
12Hadoop Conf Japan - Mar 14, 2019
New in v0.5.2 – Brickhouse UDFs
Hadoop Conf Japan - Mar 14, 2019 13
JSON
Hyper
LogLog
New in v0.5.2 – Field-aware Factorization Machines
Hadoop Conf Japan - Mar 14, 2019 14
Hadoop Conf Japan - Mar 14, 2019 15
New in v0.5.2 – Okapi BM25 term weighting
Plan for v0.6
16Hadoop Conf Japan - Mar 14, 2019
Release in April-May, 2019
ü New state-of-the-art optimizers like AdamHD (merged)
ü Gradient boosting
ü Stable XGBoost support
ü More efficient Sparse vector support in RandomForest
ü Spark 2.4 support
17
SELECT train_xgboost_classifier(features, label) as (model_id, model)
FROM training_data
XGBoost support in Hivemall (beta version)
SELECT rowed, AVG(predicted) as predicted
FROM (
-- predict with each model
SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted)
-- join each test record with each model
FROM xgboost_models CROSS JOIN test_data_with_id
) t
GROUP BY rowid;
Hadoop Conf Japan - Mar 14, 2019
ü Word2Vec support
ü Multi-class Logistic Regression
ü Hyperparameter tuning (e.g., grid search)
ü Yarn application/standalone Hivemall
Future work (v0.7 or later)
18
PR#91
PR#116
Hadoop Conf Japan - Mar 14, 2019
Hadoop Conf Japan - Mar 14, 2019 19
We are hiring..
Engineer (Java/Scala/Ruby), Data Scientist, Sales Engineer, SRE, Support Engineer

Introduction to Apache Hivemall v0.5.2 and v0.6

  • 1.
    Introduction to ApacheHivemall v0.5.2 and v0.6 Principal Engineer Makoto YUI @myui @ApacheHivemall 1Hadoop Conf Japan - Mar 14, 2019
  • 2.
    Hadoop Conf Japan- Mar 14, 2019 2 We Open-source! Streaming log collector Bulk data import/export Efficient binary serialization Machine learning on Hadoop Workflow EngineEmbedded version of Fluentd
  • 3.
    Machine Learning in SQLqueries 3 Hadoop Conf Japan - Mar 14, 2019
  • 4.
    BigQuery ML atGoogle I/O 2018 4 https://ai.googleblog.com/2018/07/machine-learning-in-google-bigquery.html Hadoop Conf Japan - Mar 14, 2019
  • 5.
    5 Could I useML-in-SQL in my cluster? Hadoop Conf Japan - Mar 14, 2019
  • 6.
    6 Open-source Machine LearningSolution for SQL-on-Hadoop Hadoop Conf Japan - Mar 14, 2019 hivemall.apache.org (incubating)
  • 7.
    7 HiveQL SparkSQL/Dataframe APIPig Latin Hivemall is a multi/cross platform ML library that provides rich set of functions Hadoop Conf Japan - Mar 14, 2019
  • 8.
    Hivemall on ApacheHive 8Hadoop Conf Japan - Mar 14, 2019
  • 9.
    Hivemall on ApacheSpark Dataframe 9Hadoop Conf Japan - Mar 14, 2019
  • 10.
    Hivemall on SparkSQL 10HadoopConf Japan - Mar 14, 2019
  • 11.
    Hivemall on ApachePig 11Hadoop Conf Japan - Mar 14, 2019
  • 12.
    Online Prediction byApache Streaming 12Hadoop Conf Japan - Mar 14, 2019
  • 13.
    New in v0.5.2– Brickhouse UDFs Hadoop Conf Japan - Mar 14, 2019 13 JSON Hyper LogLog
  • 14.
    New in v0.5.2– Field-aware Factorization Machines Hadoop Conf Japan - Mar 14, 2019 14
  • 15.
    Hadoop Conf Japan- Mar 14, 2019 15 New in v0.5.2 – Okapi BM25 term weighting
  • 16.
    Plan for v0.6 16HadoopConf Japan - Mar 14, 2019 Release in April-May, 2019 ü New state-of-the-art optimizers like AdamHD (merged) ü Gradient boosting ü Stable XGBoost support ü More efficient Sparse vector support in RandomForest ü Spark 2.4 support
  • 17.
    17 SELECT train_xgboost_classifier(features, label)as (model_id, model) FROM training_data XGBoost support in Hivemall (beta version) SELECT rowed, AVG(predicted) as predicted FROM ( -- predict with each model SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted) -- join each test record with each model FROM xgboost_models CROSS JOIN test_data_with_id ) t GROUP BY rowid; Hadoop Conf Japan - Mar 14, 2019
  • 18.
    ü Word2Vec support üMulti-class Logistic Regression ü Hyperparameter tuning (e.g., grid search) ü Yarn application/standalone Hivemall Future work (v0.7 or later) 18 PR#91 PR#116 Hadoop Conf Japan - Mar 14, 2019
  • 19.
    Hadoop Conf Japan- Mar 14, 2019 19 We are hiring.. Engineer (Java/Scala/Ruby), Data Scientist, Sales Engineer, SRE, Support Engineer