Podling Hivemall in the Apache Incubator

Podling Hivemall in the Apache
Incubator
Research Engineer
Makoto YUI @myui
<myui@treasure-data.com>
12016/11/08 Apache Hadoop Meetup at CWT 2016

2016/11/08 Apache Hadoop Meetup at CWT 2016 2
Hivemall entered Apache Incubator
on Sept 13, 2016 🎉
hivemall.incubator.apache.org
@ApacheHivemall

• Makoto Yui <Treasure Data>
• Takeshi Yamamuro <NTT>
Ø Hivemall on Apache Spark
• Daniel Dai <Hortonworks>
Ø Hivemall on Apache Pig
Ø Apache Pig PMC member
• Tsuyoshi Ozawa <NTT>
ØApache Hadoop PMC member
• Kai Sasaki <Treasure Data>
3
Initial committers

Champion
Nominated Mentors
4
Project mentors
• Reynold Xin <Databricks, ASF member>
Apache Spark PMC member
• Markus Weimer <Microsoft, ASF member>
Apache REEF PMC member
• Xiangrui Meng <Databricks, ASF member>
Apache Spark PMC member
• Roman Shaposhnik <Pivotal, ASF member>
Apache Bigtop/Incubator PMC member

What is Apache Hivemall
Scalable machine learning library
built as a collection of Hive UDFs
Multi/Cross
platform Versatile Scalable Ease-of-use

Hivemall is easy and scalable …
Classification with Mahout
CREATE TABLE lr_model AS
SELECT
feature, -- reducers perform model averaging in
parallel
avg(weight) as weight
FROM (
SELECT logress(features,label,..) as (feature,weight)
FROM train
) t -- map-only task
GROUP BY feature; -- shuffled to reducers
ML made easy for SQL developers
Born to be parallel and scalable
This SQL query automatically runs in
parallel on Hadoop cluster
Ease-of-use
Scalable

Hivemall is a multi/cross-platform
ML library
HiveQL SparkSQL/Dataframe API Pig Latin
Hivemall is Multi/Cross platform ..
Multi/Cross
platform
prediction models built by Hive can be used from Spark, and
conversely, prediction models build by Spark can be used from Hive

Hivemall on Apache Hive

Hivemall on Apache Spark Dataframe

Hivemall on SparkSQL

Hivemall on Apache Pig

Versatile
Hivemall is a Versatile library ..
ü Hivemall is not only for Machine
Learning
ü Hivemall provides bunch of generic
utility functions (e.g., top-k, NLP)
Each organization has own sets
of UDFs for data preprocessing!
Don’t Repeat Yourself!
Don’t Repeat Yourself!

Conclusion and Takeaway
Hivemall is a machine learning library that is …
We welcome your contributions to Apache Hivemall J
Multi/Cross
platform
Versatile Scalable Ease-of-use
hivemall.incubator.apache.org

Podling Hivemall in the Apache Incubator

More Related Content

What's hot

Viewers also liked

Similar to Podling Hivemall in the Apache Incubator

More from Makoto Yui

Recently uploaded

Podling Hivemall in the Apache Incubator