Apache Hivemall and my OSS experience

Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Principal Engineer
Makoto Yui @myui
Apache Hivemall and
my OSS experience

Plan of the talk
1. Introduction to myself, company, and my OSS experience
2. Introduction to Apache Hivemall
3. Apache Incubation process
• What’s Hivemall? - project overview
• Deep dive into the feature of Hivemall
• Open-source use-cases and internal use-cases at Treasure Data
• Who am I? What interested me in open source?
• Treasure Data? How my company cope with OSS?
• How Hivemall get into the ASF incubator? Why ASF?
• What’s required and the hardest part of it?
• What is the most overlooked part of the incubation process
• Lesson learned from my (on-going) ASF incubation experience
5m
15m
5m

Copyright 1995-2018 Arm Limited (or its aﬃliates). All rights reserved.
About Me: Makoto Yui @myui
○ Leading the development of Apache Hivemall (incubating at ASF)
○ ML Engineer with DB system research background
● Developing ML features (and underlying systems) at SaaS company
■ Joined to Treasure Data in April 2015 (4 years ago)
■ Working at Tokyo branch of a Silicon Valley company (Acquired by Arm on July 2018)
● Ph.D (CS) in 2009 at NAIST
■ majored in Parallel Database Systems and XML native database systems
(e.g., non-blocking lock-free DB buffer management at ICDE 2010)
● As a DB researcher
■ Postdoc at CWI (MonetDB team in CWI Amsterdam; columnar in-memory DB pioneer)
■ 5 years at AIST (National research institute in Tsukuba) as a Senior Researcher
● Past and the current Interest
■ Query+FP Language → Parallel DB → In-database Analytics (OLAP++) → Scalable Machine Learning (now) → ?

My OSS history
○ Big fan of OSS since undergraduate student
● I was using Redhat Linux on my laptop
○ Intern at small startup
● FreeBSD 4.2+, PostgreSQL 6.4+~7.x, PHP 4, and plain old-C
● PostgreSQL and Glib (not glib) was my favorite project
○ XpSQL at Gborg (first OSS for me)
● Founded by Gov fund for young software engineers
● My Bachelor thesis in 2003:
Building a multi-functional XML database environment using RDBMS

○ Linux movement when I was undergraduate student
○ Interested in well-designed code of Postgres
● ❤ Data structure and algorithms for big data
■ b+-tree was much interesting than binary tree (in-memory) for me
○ Communication with other excellent engineers from other organization
○ More interested in library development than application development
○ Good for Carrier development (hard to ﬁnd jobs with no github repos)
○ Why not OSS?
● No so many excellent talents in a single organization for library development
● Developers prefer standard OSS libraries (avoid vendor/company lock-in) in general
👍 for Open for Closed
What interested you in open source?

Arm Treasure Data
Company Profile

Confidential © Arm 2017Confidential © Arm 2017Confidential © Arm 2017
Thank You!
Danke!
Merci!
谢谢!
ありがとう!
Gracias!
Kiitos!
THE TREASURE DATA JOURNEY
• Predictive and personalized
marketing at-scale
• Enterprise focused
• Data unified at scale
• Data analytics pipeline as a service
2011 2016
Open Source Creator Customer Data Platform
Cloud Data Analytics Platform
• Founded at SV by OSS enthusiasts
• Fluentd founders: 2 million+ users
2012
Acquired by Arm
2018

Treasure Data Founders
Hironobu Yoshikawa
CEO & Co-Founder
Open source business veteran
Kazuki Ohta
CTO & Co-Founder
Founder of world’s largest
Hadoop Group
Sadayuki Furuhashi
Engineer & Co-Founder
MessagePack, Fluentd Inventor

Engineers are actively contributing to Presto, Hive, Hadoop, Rails, Ruby, React among others.

We Open-source! TD invented ..
Streaming log collector Bulk data import/export Efficient binary serialization
Machine learning on Hadoop Workflow EngineEmbedded version of Fluentd

Confidential © Arm 2017Conﬁdential © Arm 2017Conﬁdential © Arm 2017
Thank You!
Danke!
Merci!
谢谢!
ありがとう!
Gracias!
Kiitos!
Customer Data Platform (CDP) as a Service
A Customer Data Platform is a marketer-controlled integrated customer
database that can support coordinated programs across multiple channels.
Data Collection Insights, Segmentation, Syndication Campaign Execution

Conﬁdential © Arm 2017Confidential © Arm 2017Confidential © Arm 2017
Thank You!
Danke!
Merci!
谢谢!
ありがとう!
Gracias!
Kiitos!
AUDIENCE BUILDER AND SEGMENTATION FOR DIGITAL
MARKETERS
AUDIENCE BUILDER SEGMENTATION & ACTIVATION

Confidential © Arm 2017Confidential © Arm 2017Conﬁdential © Arm 2017
Thank You!
Danke!
Merci!
谢谢!
ありがとう!
Gracias!
Kiitos!
PREDICTIVE SEGMENTATION
Atop the foundation of uniﬁed
customer data, you can
leverage our machine learning
technology + experts to build
Predictive Customer Scoring,
identifying high-value prospects
at scale based on algorithms.
Powered by

Plan of the talk
1. Introduction to myself, company, and my OSS experience
2. Introduction to Apache Hivemall
3. Apache Incubation process
• What’s Hivemall? - project overview
• Deep dive into the feature of Hivemall
• Open-source use-cases and internal use-cases at Treasure Data
• Who am I? What interested me in open source?
• Treasure Data? How my company cope with OSS?
• How Hivemall get into the ASF incubator? Why ASF?
• What’s required and the hardest part of it?
• What is the most overlooked part of the incubation process
• Lesson learned from my (on-going) ASF incubation experience

Project overview – Apache Hivemall
○ Scalable Machine Learning library for Apache Hive/Spark/Pig
○ Initially released in 2014 when I was a researcher at AIST
● Infoworld Bossie Awards 2014: The best open source big data tools
● Talked at Hadoop summit 2014 (got lots of attention)
● 500+ github stars and 150+ folks
● 15 contributors when joining before ASF incubator
○ Incubating since Sept 2016
● Recruited mentors from Hortonworks and Databricks, Microsoft, and Pivotal
● Contributors from Treasure Data, NTT, and other individuals
○ Planning to graduate incubator in 2020
● Needs more ASF release and external contributions (community growth)

BigQuery ML at Google I/O 2018
17
h"ps://ai.googleblog.com/2018/07/machine-learning-in-google-bigquery.html
Hadoop Conf Japan - Mar 14, 2019

18
Open-source Machine Learning Solution
for SQL-on-Hadoop
hivemall.apache.org (incubating)

19
HiveQ
L
SparkSQL/Dataframe
API
Pig
Latin
Hivemall is a multi/cross platform ML library
that provides rich set of functions

Hivemall on Apache Hive
20Hadoop Conf Japan - Mar 14, 2019

Hivemall on Apache Spark Dataframe

Hivemall on SparkSQL

Hivemall on Apache Pig

Online Predic:on by Apache Streaming

New in v0.5.2 – Brickhouse UDFs
Hadoop Conf Japan - Mar 14, 2019 25
JSON
Hyper
LogLog

Field-aware Factoriza:on Machines

Okapi BM25 term weighting

28
SELECT train_xgboost_classifier(features, label) as (model_id, model)
FROM training_data
XGBoost support in Hivemall (beta version)
SELECT rowed, AVG(predicted) as predicted
FROM (
-- predict with each model
SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted)
-- join each test record with each model
FROM xgboost_models CROSS JOIN test_data_with_id
) t
GROUP BY rowid;

2018/2/17 HackerTackle 29
Real-world ML pipelines (could be more complex)
Join
Extract Feature
Datasourc
e
#1
Datasourc
e
#2
Datasourc
e
#3
Extract Feature
Feature Scaling
Feature Hashing
Feature Engineering
Feature Selection
Train by
Logis9c Regression
Train by
RandomForest
Train by
Factorization Machines
Ensemble
Evaluate
Predict

302018/2/17 HackerTackle
Hivemall Digdag

Machine Learning Workflow using Digdag

Machine Learning Workﬂow using Digdag

33 © 2018 Arm Limited
Meal-kit company: Churn prediction
A/B
Web
Mobile
Machine Learning
Direct campaign
Indirect approach
COUPON
Reward Follow
up call
UI
renewal
Best practice
User attributes
Inﬂow sources
Activity logs
Complaints
Active services
Source
data

Insurance company: Call center optimization
Web
Mobile
Machine Learning
User attributes
Inflow sources
Activity logs
Interest insurances
Contact histories
Insurance
applications
Maturity date
Closing probability
list
89%
42%
12%
Increase sales
commision!

Retailor: Inventory optimization
Deterministic distribution by heuristics

Other Industry use cases of Hivemall
Klout – influencer marketing
bit.ly/klout-hivemall
36
bit.ly/2whJCQj
T-mobile.au

○ Personal projects never live long :-(
○ Got a lot of attention at Hadoop summit 2014
● Hortonworks developer recommended Hivemall to join ASF incubator
● Apache Pig developer was evaluating Hivemall (required him as initial project member)
○ Apache is a trusted brand for developers
● ASF’s meritocracy model
● Apache way: open governance, community over code
○ ASF is a natural choice
● Hivemall runs on the top of ASF hadoop ecosystem
How Hivemall get into Apache Incubator? Why ASF?

○ Recruiting Champion/mentors
● Need to recruit ASF veteran who knows ASF incubation process well
■ Our CTO is a friend of Roman (who is prev Incubator Chair) and introduced him
● Big company hire ASF member(s) for incubating a project
● If your company has 2-3+ ASF members, the process would be more smooth
https://wiki.apache.org/incubator/HivemallProposal
○ ASF member’s assist/vote is mandatory in the incubation process
● release votes requires three +1 from ASF members
● project setup, mentor sign-oﬀ for project report
● toward graduation process
Hardest part for Apache Incubator
- Mentors can be unresponded over time (e.g., due to job role
change)
- Volunteering is limited (without $$ possibilities)
- Most graduated projects are developed mainly by company hired
engineers (Cloudera/Hortornworks/IBM ...) with external
contribution (small patches)

○ Community building
● Not mandatory for Incubator projects but expected to build active community
● Usually, company-backed engineers are working hard for developing core features
■ <user@> mailing list is not required. <dev@> is the place for discussion in Incubator
■ not so many active developers
● Meetup(s)
■ Held 4 meetups in Tokyo in total
■ location problems (better to have one in US; lack of connections)
○ Overlooked cost of ASF incubation
● Release process is restricted by incubator policy (e.g., votes and license inspection)
● Time spent for incubation process
■ incubation report, a project status page, project page
■ artifact distribution procedures
Hardest part for Apache Incubator (for us)

Lesson learned: Engineering trends changes rapidly
Postmortem: Better to incubate early and graduate soon
2004 was peak.. 2016 was too late to join ASF incubator
Apart from frameworks, standalone library has more long life cycle

OSS project (apart from Hivemall)
Full-fledged B+-tree written in pure Java (to appear)
https://github.com/myui/btree4j
b+-tree is widely known data structure but there are no good library
as OSS. LSM-tree and Mass-tree is based on B+-tree
Extracted as a library from my past work on XML native DB
Currently, preparing to project page and performance comparison

Thank You!
Danke!
Merci!
谢谢!
Gracias!
Kiitos!

Apache Hivemall and my OSS experience

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Hivemall and my OSS experience

Similar to Apache Hivemall and my OSS experience (20)

More from Makoto Yui

More from Makoto Yui (20)

Recently uploaded

Recently uploaded (20)

Apache Hivemall and my OSS experience