Idea behind Apache Hivemall

LT: Idea behind Apache Hivemall
Makoto YUI <myui@apache.org>
Principal Engineer,
1ApacheCon North America 2018

2
Running machine learning on massive data stored on data warehouse
Make
It!
ApacheCon North America 2018
Suppose …

3
Running machine learning on massive data stored on data warehouse
Scalability? Data movement? Tool?
Concerns:

Approach #1
4
Data warehouse
Data
preprocessing
Machine Learning
Typical Data Scientist’s Solution
Small data?

5
Data warehouse
Data
preprocessing
Machine Learning
Approach #2 Data Engineer’s Solution

6
Q: Is Dataframe a great idea
for data (pre-)processing?

7
Q: Do you like it?
(for production-ready data preprocessing)
p Yes
p No
p Maybe
I like it for simple data
processing

8
Q: Do you really like it?
(for messy real-world data preprocessing)
p Yes
p No
p Maybe

9
Real-world ML pipelines (could be more complex)
Join
Extract Feature
Datasource
#1
Datasource
#2
Datasource
#3
Extract Feature
Feature Scaling
Feature Hashing
Feature Engineering
Feature Selection
Train by
Logistic Regression
Train by
RandomForest
Train by
Factorization Machines
Ensemble
Evaluate
Predict

10
Q: Have you ever seen/write
hundreds-thousands lines of
preprocessing in Dataframe?

11
Q. Fun to play with it?
(scala/python coding for trivial things)
Do you write testing codes?
IMPO, notebook codes are error-prone for production uses

My Suggestion
12
Data warehouse
Data
preprocessing
Machine Learning
+ Scalability
+ Durability/Stability
+ Functionalities
(UDFs, JSON, Windowing functions)
Push more works back to
DB where data resides
(including some ML logics)
One size does not fit all though ...

Machine Learning
in SQL queries
13

BigQuery ML at Google I/O 2018
14
https://ai.googleblog.com/2018/07/machine-learning-in-google-bigquery.html

15
Could I use ML-in-SQL in my cluster?

16
Open-source Machine Learning Solution
for SQL-on-Hadoop
https://hivemall.apache.org (incubating)

17
HiveQL SparkSQL/Dataframe API Pig Latin
Hivemall is a multi/cross platform ML library
that provides rich set of functions

18
Thank you! Follow us @ApacheHivemall
Check out our talk tomorrow 16:40~
at ballroom J
Mentors wanted

ApacheCon North America 2018 19

20
CREATE TABLE model AS
SELECT
feature, -- reducers perform model averaging in parallel
avg(weight) as weight
FROM (
SELECT train_classifier(features,label,..) as (feature,weight)
FROM train
) t -- map-only task
GROUP BY feature; -- shuffled to reducers
This query runs in parallel on Hadoop/Spark cluster

21
Apache Hivemall Digdag (or Airflow)
digdag.io
Whatever you like

Idea behind Apache Hivemall

More Related Content

What's hot

Similar to Idea behind Apache Hivemall

More from Makoto Yui

Recently uploaded

Idea behind Apache Hivemall