LT: Idea behind Apache Hivemall
Makoto YUI <myui@apache.org>
Principal Engineer,
1ApacheCon North America 2018
2
Running machine learning on massive data stored on data warehouse
Make
It!
ApacheCon North America 2018
Suppose …
3
Running machine learning on massive data stored on data warehouse
Scalability? Data movement? Tool?
ApacheCon North America 2018
Concerns:
Approach #1
4
Data warehouse
Data
preprocessing
Machine Learning
Typical Data Scientist’s Solution
Small data?
ApacheCon North America 2018
5
Data warehouse
Data
preprocessing
Machine Learning
Approach #2 Data Engineer’s Solution
ApacheCon North America 2018
6
Q: Is Dataframe a great idea
for data (pre-)processing?
ApacheCon North America 2018
7
Q: Do you like it?
(for production-ready data preprocessing)
p Yes
p No
p Maybe
ApacheCon North America 2018
I like it for simple data
processing
8
Q: Do you really like it?
(for messy real-world data preprocessing)
p Yes
p No
p Maybe
ApacheCon North America 2018
9
Real-world ML pipelines (could be more complex)
Join
Extract Feature
Datasource
#1
Datasource
#2
Datasource
#3
Extract Feature
Feature Scaling
Feature Hashing
Feature Engineering
Feature Selection
Train by
Logistic Regression
Train by
RandomForest
Train by
Factorization Machines
Ensemble
Evaluate
Predict
ApacheCon North America 2018
10
Q: Have you ever seen/write
hundreds-thousands lines of
preprocessing in Dataframe?
ApacheCon North America 2018
11
Q. Fun to play with it?
(scala/python coding for trivial things)
Do you write testing codes?
IMPO, notebook codes are error-prone for production uses
ApacheCon North America 2018
My Suggestion
12
Data warehouse
Data
preprocessing
Machine Learning
+ Scalability
+ Durability/Stability
+ Functionalities
(UDFs, JSON, Windowing functions)
Push more works back to
DB where data resides
(including some ML logics)
One size does not fit all though ...
ApacheCon North America 2018
Machine Learning
in SQL queries
13
ApacheCon North America 2018
BigQuery ML at Google I/O 2018
14
https://ai.googleblog.com/2018/07/machine-learning-in-google-bigquery.html
ApacheCon North America 2018
15
Could I use ML-in-SQL in my cluster?
ApacheCon North America 2018
16
Open-source Machine Learning Solution
for SQL-on-Hadoop
https://hivemall.apache.org (incubating)
ApacheCon North America 2018
17
HiveQL SparkSQL/Dataframe API Pig Latin
Hivemall is a multi/cross platform ML library
that provides rich set of functions
ApacheCon North America 2018
18
Thank you! Follow us @ApacheHivemall
Check out our talk tomorrow 16:40~
at ballroom J
Mentors wanted
ApacheCon North America 2018
ApacheCon North America 2018 19
20
CREATE TABLE model AS
SELECT
feature, -- reducers perform model averaging in parallel
avg(weight) as weight
FROM (
SELECT train_classifier(features,label,..) as (feature,weight)
FROM train
) t -- map-only task
GROUP BY feature; -- shuffled to reducers
This query runs in parallel on Hadoop/Spark cluster
ApacheCon North America 2018
21
Apache Hivemall Digdag (or Airflow)
digdag.io
Whatever you like
ApacheCon North America 2018

Idea behind Apache Hivemall

  • 1.
    LT: Idea behindApache Hivemall Makoto YUI <myui@apache.org> Principal Engineer, 1ApacheCon North America 2018
  • 2.
    2 Running machine learningon massive data stored on data warehouse Make It! ApacheCon North America 2018 Suppose …
  • 3.
    3 Running machine learningon massive data stored on data warehouse Scalability? Data movement? Tool? ApacheCon North America 2018 Concerns:
  • 4.
    Approach #1 4 Data warehouse Data preprocessing MachineLearning Typical Data Scientist’s Solution Small data? ApacheCon North America 2018
  • 5.
    5 Data warehouse Data preprocessing Machine Learning Approach#2 Data Engineer’s Solution ApacheCon North America 2018
  • 6.
    6 Q: Is Dataframea great idea for data (pre-)processing? ApacheCon North America 2018
  • 7.
    7 Q: Do youlike it? (for production-ready data preprocessing) p Yes p No p Maybe ApacheCon North America 2018 I like it for simple data processing
  • 8.
    8 Q: Do youreally like it? (for messy real-world data preprocessing) p Yes p No p Maybe ApacheCon North America 2018
  • 9.
    9 Real-world ML pipelines(could be more complex) Join Extract Feature Datasource #1 Datasource #2 Datasource #3 Extract Feature Feature Scaling Feature Hashing Feature Engineering Feature Selection Train by Logistic Regression Train by RandomForest Train by Factorization Machines Ensemble Evaluate Predict ApacheCon North America 2018
  • 10.
    10 Q: Have youever seen/write hundreds-thousands lines of preprocessing in Dataframe? ApacheCon North America 2018
  • 11.
    11 Q. Fun toplay with it? (scala/python coding for trivial things) Do you write testing codes? IMPO, notebook codes are error-prone for production uses ApacheCon North America 2018
  • 12.
    My Suggestion 12 Data warehouse Data preprocessing MachineLearning + Scalability + Durability/Stability + Functionalities (UDFs, JSON, Windowing functions) Push more works back to DB where data resides (including some ML logics) One size does not fit all though ... ApacheCon North America 2018
  • 13.
    Machine Learning in SQLqueries 13 ApacheCon North America 2018
  • 14.
    BigQuery ML atGoogle I/O 2018 14 https://ai.googleblog.com/2018/07/machine-learning-in-google-bigquery.html ApacheCon North America 2018
  • 15.
    15 Could I useML-in-SQL in my cluster? ApacheCon North America 2018
  • 16.
    16 Open-source Machine LearningSolution for SQL-on-Hadoop https://hivemall.apache.org (incubating) ApacheCon North America 2018
  • 17.
    17 HiveQL SparkSQL/Dataframe APIPig Latin Hivemall is a multi/cross platform ML library that provides rich set of functions ApacheCon North America 2018
  • 18.
    18 Thank you! Followus @ApacheHivemall Check out our talk tomorrow 16:40~ at ballroom J Mentors wanted ApacheCon North America 2018
  • 19.
  • 20.
    20 CREATE TABLE modelAS SELECT feature, -- reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT train_classifier(features,label,..) as (feature,weight) FROM train ) t -- map-only task GROUP BY feature; -- shuffled to reducers This query runs in parallel on Hadoop/Spark cluster ApacheCon North America 2018
  • 21.
    21 Apache Hivemall Digdag(or Airflow) digdag.io Whatever you like ApacheCon North America 2018