SlideShare a Scribd company logo
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Principal Engineer
Makoto Yui @myui
Apache Hivemall and
my OSS experience
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Plan of the talk
1. Introduction to myself, company, and my OSS experience
2. Introduction to Apache Hivemall
3. Apache Incubation process
• What’s Hivemall? - project overview
• Deep dive into the feature of Hivemall
• Open-source use-cases and internal use-cases at Treasure Data
• Who am I? What interested me in open source?
• Treasure Data? How my company cope with OSS?
• How Hivemall get into the ASF incubator? Why ASF?
• What’s required and the hardest part of it?
• What is the most overlooked part of the incubation process
• Lesson learned from my (on-going) ASF incubation experience
5m
15m
5m
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
About Me: Makoto Yui @myui
○ Leading the development of Apache Hivemall (incubating at ASF)
○ ML Engineer with DB system research background
● Developing ML features (and underlying systems) at SaaS company
■ Joined to Treasure Data in April 2015 (4 years ago)
■ Working at Tokyo branch of a Silicon Valley company (Acquired by Arm on July 2018)
● Ph.D (CS) in 2009 at NAIST
■ majored in Parallel Database Systems and XML native database systems
(e.g., non-blocking lock-free DB buffer management at ICDE 2010)
● As a DB researcher
■ Postdoc at CWI (MonetDB team in CWI Amsterdam; columnar in-memory DB pioneer)
■ 5 years at AIST (National research institute in Tsukuba) as a Senior Researcher
● Past and the current Interest
■ Query+FP Language → Parallel DB → In-database Analytics (OLAP++) → Scalable Machine Learning (now) → ?
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
My OSS history
○ Big fan of OSS since undergraduate student
● I was using Redhat Linux on my laptop
○ Intern at small startup
● FreeBSD 4.2+, PostgreSQL 6.4+~7.x, PHP 4, and plain old-C
● PostgreSQL and Glib (not glib) was my favorite project
○ XpSQL at Gborg (first OSS for me)
● Founded by Gov fund for young software engineers
● My Bachelor thesis in 2003:
Building a multi-functional XML database environment using RDBMS
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
○ Linux movement when I was undergraduate student
○ Interested in well-designed code of Postgres
● ❤ Data structure and algorithms for big data
■ b+-tree was much interesting than binary tree (in-memory) for me
○ Communication with other excellent engineers from other organization
○ More interested in library development than application development
○ Good for Carrier development (hard to find jobs with no github repos)
○ Why not OSS?
● No so many excellent talents in a single organization for library development
● Developers prefer standard OSS libraries (avoid vendor/company lock-in) in general
👍 for Open for Closed
What interested you in open source?
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Arm Treasure Data
Company Profile
Confidential © Arm 2017Confidential © Arm 2017Confidential © Arm 2017
Thank You!
Danke!
Merci!
谢谢!
ありがとう!
Gracias!
Kiitos!
THE TREASURE DATA JOURNEY
• Predictive and personalized
marketing at-scale
• Enterprise focused
• Data unified at scale
• Data analytics pipeline as a service
2011 2016
Open Source Creator Customer Data Platform
Cloud Data Analytics Platform
• Founded at SV by OSS enthusiasts
• Fluentd founders: 2 million+ users
2012
Acquired by Arm
2018
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Treasure Data Founders
Hironobu Yoshikawa
CEO & Co-Founder
Open source business veteran
Kazuki Ohta
CTO & Co-Founder
Founder of world’s largest
Hadoop Group
Sadayuki Furuhashi
Engineer & Co-Founder
MessagePack, Fluentd Inventor
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Engineers are actively contributing to Presto, Hive, Hadoop, Rails, Ruby, React among others.
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
We Open-source! TD invented ..
Streaming log collector Bulk data import/export Efficient binary serialization
Machine learning on Hadoop Workflow EngineEmbedded version of Fluentd
Confidential © Arm 2017Confidential © Arm 2017Confidential © Arm 2017
Thank You!
Danke!
Merci!
谢谢!
ありがとう!
Gracias!
Kiitos!
Customer Data Platform (CDP) as a Service
A Customer Data Platform is a marketer-controlled integrated customer
database that can support coordinated programs across multiple channels.
Data Collection Insights, Segmentation, Syndication Campaign Execution
Confidential © Arm 2017Confidential © Arm 2017Confidential © Arm 2017
Thank You!
Danke!
Merci!
谢谢!
ありがとう!
Gracias!
Kiitos!
AUDIENCE BUILDER AND SEGMENTATION FOR DIGITAL
MARKETERS
AUDIENCE BUILDER SEGMENTATION & ACTIVATION
Confidential © Arm 2017Confidential © Arm 2017Confidential © Arm 2017
Thank You!
Danke!
Merci!
谢谢!
ありがとう!
Gracias!
Kiitos!
PREDICTIVE SEGMENTATION
Atop the foundation of unified
customer data, you can
leverage our machine learning
technology + experts to build
Predictive Customer Scoring,
identifying high-value prospects
at scale based on algorithms.
Powered by
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Plan of the talk
1. Introduction to myself, company, and my OSS experience
2. Introduction to Apache Hivemall
3. Apache Incubation process
• What’s Hivemall? - project overview
• Deep dive into the feature of Hivemall
• Open-source use-cases and internal use-cases at Treasure Data
• Who am I? What interested me in open source?
• Treasure Data? How my company cope with OSS?
• How Hivemall get into the ASF incubator? Why ASF?
• What’s required and the hardest part of it?
• What is the most overlooked part of the incubation process
• Lesson learned from my (on-going) ASF incubation experience
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Project overview – Apache Hivemall
○ Scalable Machine Learning library for Apache Hive/Spark/Pig
○ Initially released in 2014 when I was a researcher at AIST
● Infoworld Bossie Awards 2014: The best open source big data tools
● Talked at Hadoop summit 2014 (got lots of attention)
● 500+ github stars and 150+ folks
● 15 contributors when joining before ASF incubator
○ Incubating since Sept 2016
● Recruited mentors from Hortonworks and Databricks, Microsoft, and Pivotal
● Contributors from Treasure Data, NTT, and other individuals
○ Planning to graduate incubator in 2020
● Needs more ASF release and external contributions (community growth)
BigQuery ML at Google I/O 2018
17
h"ps://ai.googleblog.com/2018/07/machine-learning-in-google-bigquery.html
Hadoop Conf Japan - Mar 14, 2019
18
Open-source Machine Learning Solution
for SQL-on-Hadoop
Hadoop Conf Japan - Mar 14, 2019
hivemall.apache.org (incubating)
19
HiveQ
L
SparkSQL/Dataframe
API
Pig
Latin
Hivemall is a multi/cross platform ML library
that provides rich set of functions
Hadoop Conf Japan - Mar 14, 2019
Hivemall on Apache Hive
20Hadoop Conf Japan - Mar 14, 2019
Hivemall on Apache Spark Dataframe
21Hadoop Conf Japan - Mar 14, 2019
Hivemall on SparkSQL
22Hadoop Conf Japan - Mar 14, 2019
Hivemall on Apache Pig
23Hadoop Conf Japan - Mar 14, 2019
Online Predic:on by Apache Streaming
24Hadoop Conf Japan - Mar 14, 2019
New in v0.5.2 – Brickhouse UDFs
Hadoop Conf Japan - Mar 14, 2019 25
JSON
Hyper
LogLog
Field-aware Factoriza:on Machines
Hadoop Conf Japan - Mar 14, 2019 26
Hadoop Conf Japan - Mar 14, 2019 27
Okapi BM25 term weighting
28
SELECT train_xgboost_classifier(features, label) as (model_id, model)
FROM training_data
XGBoost support in Hivemall (beta version)
SELECT rowed, AVG(predicted) as predicted
FROM (
-- predict with each model
SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted)
-- join each test record with each model
FROM xgboost_models CROSS JOIN test_data_with_id
) t
GROUP BY rowid;
Hadoop Conf Japan - Mar 14, 2019
2018/2/17 HackerTackle 29
Real-world ML pipelines (could be more complex)
Join
Extract Feature
Datasourc
e
#1
Datasourc
e
#2
Datasourc
e
#3
Extract Feature
Feature Scaling
Feature Hashing
Feature Engineering
Feature Selection
Train by
Logis9c Regression
Train by
RandomForest
Train by
Factorization Machines
Ensemble
Evaluate
Predict
302018/2/17 HackerTackle
Hivemall Digdag
312018/2/17 HackerTackle
Machine Learning Workflow using Digdag
322018/2/17 HackerTackle
Machine Learning Workflow using Digdag
33 © 2018 Arm Limited
Meal-kit company: Churn prediction
A/B
Web
Mobile
Machine Learning
Direct campaign
Indirect approach
COUPON
Reward Follow
up call
UI
renewal
Best practice
User attributes
Inflow sources
Activity logs
Complaints
Active services
Source
data
34 © 2018 Arm Limited
Insurance company: Call center optimization
Web
Mobile
Machine Learning
User attributes
Inflow sources
Activity logs
Interest insurances
Contact histories
Insurance
applications
Maturity date
Closing probability
list
89%
42%
12%
Increase sales
commision!
35 © 2018 Arm Limited
Retailor: Inventory optimization
Deterministic distribution by heuristics
2018/2/17 HackerTackle
Other Industry use cases of Hivemall
Klout – influencer marketing
bit.ly/klout-hivemall
36
bit.ly/2whJCQj
T-mobile.au
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Plan of the talk
1. Introduction to myself, company, and my OSS experience
2. Introduction to Apache Hivemall
3. Apache Incubation process
• What’s Hivemall? - project overview
• Deep dive into the feature of Hivemall
• Open-source use-cases and internal use-cases at Treasure Data
• Who am I? What interested me in open source?
• Treasure Data? How my company cope with OSS?
• How Hivemall get into the ASF incubator? Why ASF?
• What’s required and the hardest part of it?
• What is the most overlooked part of the incubation process
• Lesson learned from my (on-going) ASF incubation experience
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
○ Personal projects never live long :-(
○ Got a lot of attention at Hadoop summit 2014
● Hortonworks developer recommended Hivemall to join ASF incubator
● Apache Pig developer was evaluating Hivemall (required him as initial project member)
○ Apache is a trusted brand for developers
● ASF’s meritocracy model
● Apache way: open governance, community over code
○ ASF is a natural choice
● Hivemall runs on the top of ASF hadoop ecosystem
How Hivemall get into Apache Incubator? Why ASF?
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
○ Recruiting Champion/mentors
● Need to recruit ASF veteran who knows ASF incubation process well
■ Our CTO is a friend of Roman (who is prev Incubator Chair) and introduced him
● Big company hire ASF member(s) for incubating a project
● If your company has 2-3+ ASF members, the process would be more smooth
https://wiki.apache.org/incubator/HivemallProposal
○ ASF member’s assist/vote is mandatory in the incubation process
● release votes requires three +1 from ASF members
● project setup, mentor sign-off for project report
● toward graduation process
Hardest part for Apache Incubator
- Mentors can be unresponded over time (e.g., due to job role
change)
- Volunteering is limited (without $$ possibilities)
- Most graduated projects are developed mainly by company hired
engineers (Cloudera/Hortornworks/IBM ...) with external
contribution (small patches)
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
○ Community building
● Not mandatory for Incubator projects but expected to build active community
● Usually, company-backed engineers are working hard for developing core features
■ <user@> mailing list is not required. <dev@> is the place for discussion in Incubator
■ not so many active developers
● Meetup(s)
■ Held 4 meetups in Tokyo in total
■ location problems (better to have one in US; lack of connections)
○ Overlooked cost of ASF incubation
● Release process is restricted by incubator policy (e.g., votes and license inspection)
● Time spent for incubation process
■ incubation report, a project status page, project page
■ artifact distribution procedures
Hardest part for Apache Incubator (for us)
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Lesson learned: Engineering trends changes rapidly
Postmortem: Better to incubate early and graduate soon
2004 was peak.. 2016 was too late to join ASF incubator
Apart from frameworks, standalone library has more long life cycle
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
OSS project (apart from Hivemall)
Full-fledged B+-tree written in pure Java (to appear)
https://github.com/myui/btree4j
b+-tree is widely known data structure but there are no good library
as OSS. LSM-tree and Mass-tree is based on B+-tree
Extracted as a library from my past work on XML native DB
Currently, preparing to project page and performance comparison
Thank You!
Danke!
Merci!
谢谢!
Gracias!
Kiitos!
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.

More Related Content

What's hot

The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
Julien Le Dem
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Esther Vasiete
 
AgensGraph: a Multi-model Graph Database based on PostgreSql
AgensGraph: a Multi-model Graph Database based on PostgreSqlAgensGraph: a Multi-model Graph Database based on PostgreSql
AgensGraph: a Multi-model Graph Database based on PostgreSql
Kisung Kim
 
Berlin buzzwords 2018 TensorFlow on Hops
Berlin buzzwords 2018 TensorFlow on HopsBerlin buzzwords 2018 TensorFlow on Hops
Berlin buzzwords 2018 TensorFlow on Hops
Jim Dowling
 
Apache Drill (ver. 0.2)
Apache Drill (ver. 0.2)Apache Drill (ver. 0.2)
Apache Drill (ver. 0.2)
Camuel Gilyadov
 
How To Visualize Graphs
How To Visualize GraphsHow To Visualize Graphs
How To Visualize Graphs
Jean Ihm
 
The MADlib Analytics Library
The MADlib Analytics Library The MADlib Analytics Library
The MADlib Analytics Library
EMC
 
DBpedia Japanese
DBpedia JapaneseDBpedia Japanese
DBpedia Japanese
Fumihiro Kato
 
The Bitter Lesson of ML Pipelines
The Bitter Lesson of ML Pipelines The Bitter Lesson of ML Pipelines
The Bitter Lesson of ML Pipelines
Jim Dowling
 
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
Jean Ihm
 
Build Knowledge Graphs with Oracle RDF to Extract More Value from Your Data
Build Knowledge Graphs with Oracle RDF to Extract More Value from Your DataBuild Knowledge Graphs with Oracle RDF to Extract More Value from Your Data
Build Knowledge Graphs with Oracle RDF to Extract More Value from Your Data
Jean Ihm
 
Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17
Makoto Yui
 
NumPy Roadmap presentation at NumFOCUS Forum
NumPy Roadmap presentation at NumFOCUS ForumNumPy Roadmap presentation at NumFOCUS Forum
NumPy Roadmap presentation at NumFOCUS Forum
Ralf Gommers
 
Distributed TensorFlow on Hops (Papis London, April 2018)
Distributed TensorFlow on Hops (Papis London, April 2018)Distributed TensorFlow on Hops (Papis London, April 2018)
Distributed TensorFlow on Hops (Papis London, April 2018)
Jim Dowling
 
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
DataWorks Summit/Hadoop Summit
 
Jfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocksJfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocks
Jim Dowling
 
DIY Analytics with Apache Spark
DIY Analytics with Apache SparkDIY Analytics with Apache Spark
DIY Analytics with Apache Spark
Adam Roberts
 
Data Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodelsData Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodelsWes McKinney
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQL
DataWorks Summit
 

What's hot (20)

The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
 
AgensGraph: a Multi-model Graph Database based on PostgreSql
AgensGraph: a Multi-model Graph Database based on PostgreSqlAgensGraph: a Multi-model Graph Database based on PostgreSql
AgensGraph: a Multi-model Graph Database based on PostgreSql
 
Berlin buzzwords 2018 TensorFlow on Hops
Berlin buzzwords 2018 TensorFlow on HopsBerlin buzzwords 2018 TensorFlow on Hops
Berlin buzzwords 2018 TensorFlow on Hops
 
Apache Drill (ver. 0.2)
Apache Drill (ver. 0.2)Apache Drill (ver. 0.2)
Apache Drill (ver. 0.2)
 
How To Visualize Graphs
How To Visualize GraphsHow To Visualize Graphs
How To Visualize Graphs
 
The MADlib Analytics Library
The MADlib Analytics Library The MADlib Analytics Library
The MADlib Analytics Library
 
DBpedia Japanese
DBpedia JapaneseDBpedia Japanese
DBpedia Japanese
 
The Bitter Lesson of ML Pipelines
The Bitter Lesson of ML Pipelines The Bitter Lesson of ML Pipelines
The Bitter Lesson of ML Pipelines
 
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
 
Build Knowledge Graphs with Oracle RDF to Extract More Value from Your Data
Build Knowledge Graphs with Oracle RDF to Extract More Value from Your DataBuild Knowledge Graphs with Oracle RDF to Extract More Value from Your Data
Build Knowledge Graphs with Oracle RDF to Extract More Value from Your Data
 
Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17
 
NumPy Roadmap presentation at NumFOCUS Forum
NumPy Roadmap presentation at NumFOCUS ForumNumPy Roadmap presentation at NumFOCUS Forum
NumPy Roadmap presentation at NumFOCUS Forum
 
Distributed TensorFlow on Hops (Papis London, April 2018)
Distributed TensorFlow on Hops (Papis London, April 2018)Distributed TensorFlow on Hops (Papis London, April 2018)
Distributed TensorFlow on Hops (Papis London, April 2018)
 
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
 
Jfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocksJfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocks
 
Numba lightning
Numba lightningNumba lightning
Numba lightning
 
DIY Analytics with Apache Spark
DIY Analytics with Apache SparkDIY Analytics with Apache Spark
DIY Analytics with Apache Spark
 
Data Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodelsData Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodels
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQL
 

Similar to Apache Hivemall and my OSS experience

Managing Machine Learning workflows on Treasure Data
Managing Machine Learning workflows on Treasure DataManaging Machine Learning workflows on Treasure Data
Managing Machine Learning workflows on Treasure Data
Aki Ariga
 
Oracle Data Science Platform
Oracle Data Science PlatformOracle Data Science Platform
Oracle Data Science Platform
Oracle Developers
 
Open Source AI - News and examples
Open Source AI - News and examplesOpen Source AI - News and examples
Open Source AI - News and examples
Luciano Resende
 
Keynote at Converge 2019
Keynote at Converge 2019Keynote at Converge 2019
Keynote at Converge 2019
Travis Oliphant
 
SpagoBI 5 Demo Day and Workshop : Technology Applications and Uses
SpagoBI 5 Demo Day and Workshop : Technology Applications and UsesSpagoBI 5 Demo Day and Workshop : Technology Applications and Uses
SpagoBI 5 Demo Day and Workshop : Technology Applications and Uses
SpagoWorld
 
Apache AGE and the synergy effect in the combination of Postgres and NoSQL
 Apache AGE and the synergy effect in the combination of Postgres and NoSQL Apache AGE and the synergy effect in the combination of Postgres and NoSQL
Apache AGE and the synergy effect in the combination of Postgres and NoSQL
EDB
 
20181019 code.talks graph_analytics_k_patenge
20181019 code.talks graph_analytics_k_patenge20181019 code.talks graph_analytics_k_patenge
20181019 code.talks graph_analytics_k_patenge
Karin Patenge
 
GraphPipe - Blazingly Fast Machine Learning Inference by Vish Abrams
GraphPipe - Blazingly Fast Machine Learning Inference by Vish AbramsGraphPipe - Blazingly Fast Machine Learning Inference by Vish Abrams
GraphPipe - Blazingly Fast Machine Learning Inference by Vish Abrams
Oracle Developers
 
Db tech show - hivemall
Db tech show - hivemallDb tech show - hivemall
Db tech show - hivemallMakoto Yui
 
Seminole County Teach In 2017: Crooms Acadamy of Information Technology
Seminole County Teach In 2017: Crooms Acadamy of Information TechnologySeminole County Teach In 2017: Crooms Acadamy of Information Technology
Seminole County Teach In 2017: Crooms Acadamy of Information Technology
Ed Burns
 
IBM Developer Model Asset eXchange
IBM Developer Model Asset eXchangeIBM Developer Model Asset eXchange
IBM Developer Model Asset eXchange
Nick Pentreath
 
Hadoopsummit16 myui
Hadoopsummit16 myuiHadoopsummit16 myui
Hadoopsummit16 myui
Makoto Yui
 
28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines
Timothy Spann
 
Presto: SQL-on-Anything. Netherlands Hadoop User Group Meetup
Presto: SQL-on-Anything. Netherlands Hadoop User Group MeetupPresto: SQL-on-Anything. Netherlands Hadoop User Group Meetup
Presto: SQL-on-Anything. Netherlands Hadoop User Group Meetup
Wojciech Biela
 
Conf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python ProcessorsConf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python Processors
Timothy Spann
 
SIM RTP Meeting - So Who's Using Open Source Anyway?
SIM RTP Meeting - So Who's Using Open Source Anyway?SIM RTP Meeting - So Who's Using Open Source Anyway?
SIM RTP Meeting - So Who's Using Open Source Anyway?
Alex Meadows
 
From Hadoop to Enterprise Data Warehouse
From Hadoop to Enterprise Data WarehouseFrom Hadoop to Enterprise Data Warehouse
From Hadoop to Enterprise Data Warehouse
Bui Ha
 
S2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real WorldS2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real World
Sean Roberts
 
Maintainable Machine Learning Products
Maintainable Machine Learning ProductsMaintainable Machine Learning Products
Maintainable Machine Learning Products
Andrew Musselman
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
Wes McKinney
 

Similar to Apache Hivemall and my OSS experience (20)

Managing Machine Learning workflows on Treasure Data
Managing Machine Learning workflows on Treasure DataManaging Machine Learning workflows on Treasure Data
Managing Machine Learning workflows on Treasure Data
 
Oracle Data Science Platform
Oracle Data Science PlatformOracle Data Science Platform
Oracle Data Science Platform
 
Open Source AI - News and examples
Open Source AI - News and examplesOpen Source AI - News and examples
Open Source AI - News and examples
 
Keynote at Converge 2019
Keynote at Converge 2019Keynote at Converge 2019
Keynote at Converge 2019
 
SpagoBI 5 Demo Day and Workshop : Technology Applications and Uses
SpagoBI 5 Demo Day and Workshop : Technology Applications and UsesSpagoBI 5 Demo Day and Workshop : Technology Applications and Uses
SpagoBI 5 Demo Day and Workshop : Technology Applications and Uses
 
Apache AGE and the synergy effect in the combination of Postgres and NoSQL
 Apache AGE and the synergy effect in the combination of Postgres and NoSQL Apache AGE and the synergy effect in the combination of Postgres and NoSQL
Apache AGE and the synergy effect in the combination of Postgres and NoSQL
 
20181019 code.talks graph_analytics_k_patenge
20181019 code.talks graph_analytics_k_patenge20181019 code.talks graph_analytics_k_patenge
20181019 code.talks graph_analytics_k_patenge
 
GraphPipe - Blazingly Fast Machine Learning Inference by Vish Abrams
GraphPipe - Blazingly Fast Machine Learning Inference by Vish AbramsGraphPipe - Blazingly Fast Machine Learning Inference by Vish Abrams
GraphPipe - Blazingly Fast Machine Learning Inference by Vish Abrams
 
Db tech show - hivemall
Db tech show - hivemallDb tech show - hivemall
Db tech show - hivemall
 
Seminole County Teach In 2017: Crooms Acadamy of Information Technology
Seminole County Teach In 2017: Crooms Acadamy of Information TechnologySeminole County Teach In 2017: Crooms Acadamy of Information Technology
Seminole County Teach In 2017: Crooms Acadamy of Information Technology
 
IBM Developer Model Asset eXchange
IBM Developer Model Asset eXchangeIBM Developer Model Asset eXchange
IBM Developer Model Asset eXchange
 
Hadoopsummit16 myui
Hadoopsummit16 myuiHadoopsummit16 myui
Hadoopsummit16 myui
 
28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines
 
Presto: SQL-on-Anything. Netherlands Hadoop User Group Meetup
Presto: SQL-on-Anything. Netherlands Hadoop User Group MeetupPresto: SQL-on-Anything. Netherlands Hadoop User Group Meetup
Presto: SQL-on-Anything. Netherlands Hadoop User Group Meetup
 
Conf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python ProcessorsConf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python Processors
 
SIM RTP Meeting - So Who's Using Open Source Anyway?
SIM RTP Meeting - So Who's Using Open Source Anyway?SIM RTP Meeting - So Who's Using Open Source Anyway?
SIM RTP Meeting - So Who's Using Open Source Anyway?
 
From Hadoop to Enterprise Data Warehouse
From Hadoop to Enterprise Data WarehouseFrom Hadoop to Enterprise Data Warehouse
From Hadoop to Enterprise Data Warehouse
 
S2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real WorldS2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real World
 
Maintainable Machine Learning Products
Maintainable Machine Learning ProductsMaintainable Machine Learning Products
Maintainable Machine Learning Products
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 

More from Makoto Yui

Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6
Makoto Yui
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
Makoto Yui
 
Idea behind Apache Hivemall
Idea behind Apache HivemallIdea behind Apache Hivemall
Idea behind Apache Hivemall
Makoto Yui
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
Makoto Yui
 
What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0
Makoto Yui
 
What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0
Makoto Yui
 
Revisiting b+-trees
Revisiting b+-treesRevisiting b+-trees
Revisiting b+-trees
Makoto Yui
 
Incubating Apache Hivemall
Incubating Apache HivemallIncubating Apache Hivemall
Incubating Apache Hivemall
Makoto Yui
 
Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17
Makoto Yui
 
Apache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, MiamiApache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, Miami
Makoto Yui
 
機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会
Makoto Yui
 
Podling Hivemall in the Apache Incubator
Podling Hivemall in the Apache IncubatorPodling Hivemall in the Apache Incubator
Podling Hivemall in the Apache Incubator
Makoto Yui
 
Dots20161029 myui
Dots20161029 myuiDots20161029 myui
Dots20161029 myui
Makoto Yui
 
HadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myuiHadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myui
Makoto Yui
 
3rd Hivemall meetup
3rd Hivemall meetup3rd Hivemall meetup
3rd Hivemall meetup
Makoto Yui
 
Recommendation 101 using Hivemall
Recommendation 101 using HivemallRecommendation 101 using Hivemall
Recommendation 101 using Hivemall
Makoto Yui
 
Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016
Makoto Yui
 
Introduction to Hivemall
Introduction to HivemallIntroduction to Hivemall
Introduction to Hivemall
Makoto Yui
 
Tdtechtalk20160425myui
Tdtechtalk20160425myuiTdtechtalk20160425myui
Tdtechtalk20160425myui
Makoto Yui
 
Tdtechtalk20160330myui
Tdtechtalk20160330myuiTdtechtalk20160330myui
Tdtechtalk20160330myui
Makoto Yui
 

More from Makoto Yui (20)

Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
 
Idea behind Apache Hivemall
Idea behind Apache HivemallIdea behind Apache Hivemall
Idea behind Apache Hivemall
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
 
What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0
 
What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0
 
Revisiting b+-trees
Revisiting b+-treesRevisiting b+-trees
Revisiting b+-trees
 
Incubating Apache Hivemall
Incubating Apache HivemallIncubating Apache Hivemall
Incubating Apache Hivemall
 
Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17
 
Apache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, MiamiApache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, Miami
 
機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会
 
Podling Hivemall in the Apache Incubator
Podling Hivemall in the Apache IncubatorPodling Hivemall in the Apache Incubator
Podling Hivemall in the Apache Incubator
 
Dots20161029 myui
Dots20161029 myuiDots20161029 myui
Dots20161029 myui
 
HadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myuiHadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myui
 
3rd Hivemall meetup
3rd Hivemall meetup3rd Hivemall meetup
3rd Hivemall meetup
 
Recommendation 101 using Hivemall
Recommendation 101 using HivemallRecommendation 101 using Hivemall
Recommendation 101 using Hivemall
 
Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016
 
Introduction to Hivemall
Introduction to HivemallIntroduction to Hivemall
Introduction to Hivemall
 
Tdtechtalk20160425myui
Tdtechtalk20160425myuiTdtechtalk20160425myui
Tdtechtalk20160425myui
 
Tdtechtalk20160330myui
Tdtechtalk20160330myuiTdtechtalk20160330myui
Tdtechtalk20160330myui
 

Recently uploaded

Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
seandesed
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
BrazilAccount1
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
BrazilAccount1
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 

Recently uploaded (20)

Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 

Apache Hivemall and my OSS experience

  • 1. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Principal Engineer Makoto Yui @myui Apache Hivemall and my OSS experience
  • 2. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Plan of the talk 1. Introduction to myself, company, and my OSS experience 2. Introduction to Apache Hivemall 3. Apache Incubation process • What’s Hivemall? - project overview • Deep dive into the feature of Hivemall • Open-source use-cases and internal use-cases at Treasure Data • Who am I? What interested me in open source? • Treasure Data? How my company cope with OSS? • How Hivemall get into the ASF incubator? Why ASF? • What’s required and the hardest part of it? • What is the most overlooked part of the incubation process • Lesson learned from my (on-going) ASF incubation experience 5m 15m 5m
  • 3. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. About Me: Makoto Yui @myui ○ Leading the development of Apache Hivemall (incubating at ASF) ○ ML Engineer with DB system research background ● Developing ML features (and underlying systems) at SaaS company ■ Joined to Treasure Data in April 2015 (4 years ago) ■ Working at Tokyo branch of a Silicon Valley company (Acquired by Arm on July 2018) ● Ph.D (CS) in 2009 at NAIST ■ majored in Parallel Database Systems and XML native database systems (e.g., non-blocking lock-free DB buffer management at ICDE 2010) ● As a DB researcher ■ Postdoc at CWI (MonetDB team in CWI Amsterdam; columnar in-memory DB pioneer) ■ 5 years at AIST (National research institute in Tsukuba) as a Senior Researcher ● Past and the current Interest ■ Query+FP Language → Parallel DB → In-database Analytics (OLAP++) → Scalable Machine Learning (now) → ?
  • 4. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. My OSS history ○ Big fan of OSS since undergraduate student ● I was using Redhat Linux on my laptop ○ Intern at small startup ● FreeBSD 4.2+, PostgreSQL 6.4+~7.x, PHP 4, and plain old-C ● PostgreSQL and Glib (not glib) was my favorite project ○ XpSQL at Gborg (first OSS for me) ● Founded by Gov fund for young software engineers ● My Bachelor thesis in 2003: Building a multi-functional XML database environment using RDBMS
  • 5. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. ○ Linux movement when I was undergraduate student ○ Interested in well-designed code of Postgres ● ❤ Data structure and algorithms for big data ■ b+-tree was much interesting than binary tree (in-memory) for me ○ Communication with other excellent engineers from other organization ○ More interested in library development than application development ○ Good for Carrier development (hard to find jobs with no github repos) ○ Why not OSS? ● No so many excellent talents in a single organization for library development ● Developers prefer standard OSS libraries (avoid vendor/company lock-in) in general 👍 for Open for Closed What interested you in open source?
  • 6. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Arm Treasure Data Company Profile
  • 7. Confidential © Arm 2017Confidential © Arm 2017Confidential © Arm 2017 Thank You! Danke! Merci! 谢谢! ありがとう! Gracias! Kiitos! THE TREASURE DATA JOURNEY • Predictive and personalized marketing at-scale • Enterprise focused • Data unified at scale • Data analytics pipeline as a service 2011 2016 Open Source Creator Customer Data Platform Cloud Data Analytics Platform • Founded at SV by OSS enthusiasts • Fluentd founders: 2 million+ users 2012 Acquired by Arm 2018
  • 8. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Treasure Data Founders Hironobu Yoshikawa CEO & Co-Founder Open source business veteran Kazuki Ohta CTO & Co-Founder Founder of world’s largest Hadoop Group Sadayuki Furuhashi Engineer & Co-Founder MessagePack, Fluentd Inventor
  • 9. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Engineers are actively contributing to Presto, Hive, Hadoop, Rails, Ruby, React among others.
  • 10. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. We Open-source! TD invented .. Streaming log collector Bulk data import/export Efficient binary serialization Machine learning on Hadoop Workflow EngineEmbedded version of Fluentd
  • 11.
  • 12. Confidential © Arm 2017Confidential © Arm 2017Confidential © Arm 2017 Thank You! Danke! Merci! 谢谢! ありがとう! Gracias! Kiitos! Customer Data Platform (CDP) as a Service A Customer Data Platform is a marketer-controlled integrated customer database that can support coordinated programs across multiple channels. Data Collection Insights, Segmentation, Syndication Campaign Execution
  • 13. Confidential © Arm 2017Confidential © Arm 2017Confidential © Arm 2017 Thank You! Danke! Merci! 谢谢! ありがとう! Gracias! Kiitos! AUDIENCE BUILDER AND SEGMENTATION FOR DIGITAL MARKETERS AUDIENCE BUILDER SEGMENTATION & ACTIVATION
  • 14. Confidential © Arm 2017Confidential © Arm 2017Confidential © Arm 2017 Thank You! Danke! Merci! 谢谢! ありがとう! Gracias! Kiitos! PREDICTIVE SEGMENTATION Atop the foundation of unified customer data, you can leverage our machine learning technology + experts to build Predictive Customer Scoring, identifying high-value prospects at scale based on algorithms. Powered by
  • 15. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Plan of the talk 1. Introduction to myself, company, and my OSS experience 2. Introduction to Apache Hivemall 3. Apache Incubation process • What’s Hivemall? - project overview • Deep dive into the feature of Hivemall • Open-source use-cases and internal use-cases at Treasure Data • Who am I? What interested me in open source? • Treasure Data? How my company cope with OSS? • How Hivemall get into the ASF incubator? Why ASF? • What’s required and the hardest part of it? • What is the most overlooked part of the incubation process • Lesson learned from my (on-going) ASF incubation experience
  • 16. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Project overview – Apache Hivemall ○ Scalable Machine Learning library for Apache Hive/Spark/Pig ○ Initially released in 2014 when I was a researcher at AIST ● Infoworld Bossie Awards 2014: The best open source big data tools ● Talked at Hadoop summit 2014 (got lots of attention) ● 500+ github stars and 150+ folks ● 15 contributors when joining before ASF incubator ○ Incubating since Sept 2016 ● Recruited mentors from Hortonworks and Databricks, Microsoft, and Pivotal ● Contributors from Treasure Data, NTT, and other individuals ○ Planning to graduate incubator in 2020 ● Needs more ASF release and external contributions (community growth)
  • 17. BigQuery ML at Google I/O 2018 17 h"ps://ai.googleblog.com/2018/07/machine-learning-in-google-bigquery.html Hadoop Conf Japan - Mar 14, 2019
  • 18. 18 Open-source Machine Learning Solution for SQL-on-Hadoop Hadoop Conf Japan - Mar 14, 2019 hivemall.apache.org (incubating)
  • 19. 19 HiveQ L SparkSQL/Dataframe API Pig Latin Hivemall is a multi/cross platform ML library that provides rich set of functions Hadoop Conf Japan - Mar 14, 2019
  • 20. Hivemall on Apache Hive 20Hadoop Conf Japan - Mar 14, 2019
  • 21. Hivemall on Apache Spark Dataframe 21Hadoop Conf Japan - Mar 14, 2019
  • 22. Hivemall on SparkSQL 22Hadoop Conf Japan - Mar 14, 2019
  • 23. Hivemall on Apache Pig 23Hadoop Conf Japan - Mar 14, 2019
  • 24. Online Predic:on by Apache Streaming 24Hadoop Conf Japan - Mar 14, 2019
  • 25. New in v0.5.2 – Brickhouse UDFs Hadoop Conf Japan - Mar 14, 2019 25 JSON Hyper LogLog
  • 26. Field-aware Factoriza:on Machines Hadoop Conf Japan - Mar 14, 2019 26
  • 27. Hadoop Conf Japan - Mar 14, 2019 27 Okapi BM25 term weighting
  • 28. 28 SELECT train_xgboost_classifier(features, label) as (model_id, model) FROM training_data XGBoost support in Hivemall (beta version) SELECT rowed, AVG(predicted) as predicted FROM ( -- predict with each model SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted) -- join each test record with each model FROM xgboost_models CROSS JOIN test_data_with_id ) t GROUP BY rowid; Hadoop Conf Japan - Mar 14, 2019
  • 29. 2018/2/17 HackerTackle 29 Real-world ML pipelines (could be more complex) Join Extract Feature Datasourc e #1 Datasourc e #2 Datasourc e #3 Extract Feature Feature Scaling Feature Hashing Feature Engineering Feature Selection Train by Logis9c Regression Train by RandomForest Train by Factorization Machines Ensemble Evaluate Predict
  • 33. 33 © 2018 Arm Limited Meal-kit company: Churn prediction A/B Web Mobile Machine Learning Direct campaign Indirect approach COUPON Reward Follow up call UI renewal Best practice User attributes Inflow sources Activity logs Complaints Active services Source data
  • 34. 34 © 2018 Arm Limited Insurance company: Call center optimization Web Mobile Machine Learning User attributes Inflow sources Activity logs Interest insurances Contact histories Insurance applications Maturity date Closing probability list 89% 42% 12% Increase sales commision!
  • 35. 35 © 2018 Arm Limited Retailor: Inventory optimization Deterministic distribution by heuristics
  • 36. 2018/2/17 HackerTackle Other Industry use cases of Hivemall Klout – influencer marketing bit.ly/klout-hivemall 36 bit.ly/2whJCQj T-mobile.au
  • 37. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Plan of the talk 1. Introduction to myself, company, and my OSS experience 2. Introduction to Apache Hivemall 3. Apache Incubation process • What’s Hivemall? - project overview • Deep dive into the feature of Hivemall • Open-source use-cases and internal use-cases at Treasure Data • Who am I? What interested me in open source? • Treasure Data? How my company cope with OSS? • How Hivemall get into the ASF incubator? Why ASF? • What’s required and the hardest part of it? • What is the most overlooked part of the incubation process • Lesson learned from my (on-going) ASF incubation experience
  • 38. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. ○ Personal projects never live long :-( ○ Got a lot of attention at Hadoop summit 2014 ● Hortonworks developer recommended Hivemall to join ASF incubator ● Apache Pig developer was evaluating Hivemall (required him as initial project member) ○ Apache is a trusted brand for developers ● ASF’s meritocracy model ● Apache way: open governance, community over code ○ ASF is a natural choice ● Hivemall runs on the top of ASF hadoop ecosystem How Hivemall get into Apache Incubator? Why ASF?
  • 39. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. ○ Recruiting Champion/mentors ● Need to recruit ASF veteran who knows ASF incubation process well ■ Our CTO is a friend of Roman (who is prev Incubator Chair) and introduced him ● Big company hire ASF member(s) for incubating a project ● If your company has 2-3+ ASF members, the process would be more smooth https://wiki.apache.org/incubator/HivemallProposal ○ ASF member’s assist/vote is mandatory in the incubation process ● release votes requires three +1 from ASF members ● project setup, mentor sign-off for project report ● toward graduation process Hardest part for Apache Incubator - Mentors can be unresponded over time (e.g., due to job role change) - Volunteering is limited (without $$ possibilities) - Most graduated projects are developed mainly by company hired engineers (Cloudera/Hortornworks/IBM ...) with external contribution (small patches)
  • 40. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. ○ Community building ● Not mandatory for Incubator projects but expected to build active community ● Usually, company-backed engineers are working hard for developing core features ■ <user@> mailing list is not required. <dev@> is the place for discussion in Incubator ■ not so many active developers ● Meetup(s) ■ Held 4 meetups in Tokyo in total ■ location problems (better to have one in US; lack of connections) ○ Overlooked cost of ASF incubation ● Release process is restricted by incubator policy (e.g., votes and license inspection) ● Time spent for incubation process ■ incubation report, a project status page, project page ■ artifact distribution procedures Hardest part for Apache Incubator (for us)
  • 41. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Lesson learned: Engineering trends changes rapidly Postmortem: Better to incubate early and graduate soon 2004 was peak.. 2016 was too late to join ASF incubator Apart from frameworks, standalone library has more long life cycle
  • 42. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. OSS project (apart from Hivemall) Full-fledged B+-tree written in pure Java (to appear) https://github.com/myui/btree4j b+-tree is widely known data structure but there are no good library as OSS. LSM-tree and Mass-tree is based on B+-tree Extracted as a library from my past work on XML native DB Currently, preparing to project page and performance comparison
  • 43. Thank You! Danke! Merci! 谢谢! Gracias! Kiitos! Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.