Podling Hivemall	in	the	Apache	
Incubator
Research	Engineer
Makoto	YUI	@myui
<myui@treasure-data.com>
12016/11/08	Apache	Hadoop	Meetup	at	CWT	2016
2016/11/08	Apache	Hadoop	Meetup	at	CWT	2016 2
Hivemall	entered	Apache	Incubator	
on	Sept	13,	2016	 🎉
hivemall.incubator.apache.org
@ApacheHivemall
• Makoto	Yui	<Treasure	Data>
• Takeshi	Yamamuro <NTT>
Ø Hivemall	on	Apache	Spark
• Daniel	Dai	<Hortonworks>
Ø Hivemall	on	Apache	Pig	
Ø Apache	Pig	PMC	member
• Tsuyoshi	Ozawa	<NTT>
ØApache	Hadoop	PMC	member
• Kai	Sasaki	<Treasure	Data>
3
Initial	committers
2016/11/08	Apache	Hadoop	Meetup	at	CWT	2016
Champion
Nominated	Mentors
4
Project	mentors
• Reynold	Xin	<Databricks,	ASF	member>
Apache	Spark	PMC	member
• Markus	Weimer	<Microsoft,	ASF	member>
Apache	REEF	PMC	member
• Xiangrui Meng <Databricks,	ASF	member>
Apache	Spark	PMC	member
• Roman	Shaposhnik <Pivotal,	ASF	member>
Apache	Bigtop/Incubator	PMC	member
2016/11/08	Apache	Hadoop	Meetup	at	CWT	2016
What	is	Apache	Hivemall
Scalable	machine	learning	library	
built	as	a	collection	of	Hive	UDFs
52016/11/08	Apache	Hadoop	Meetup	at	CWT	2016
Multi/Cross	
platform Versatile Scalable Ease-of-use
Hivemall	is	easy	and	scalable	…
Classification	with	Mahout
CREATE	TABLE	lr_model AS
SELECT
feature,	-- reducers	perform	model	averaging	in	
parallel
avg(weight)	as	weight
FROM	(
SELECT	logress(features,label,..)	as	(feature,weight)
FROM	train
)	t	-- map-only	task
GROUP	BY	feature;	-- shuffled	to	reducers
ML	made	easy	for	SQL	developers
Born	to	be	parallel	and	scalable
This	SQL	query	automatically	runs	in	
parallel	on	Hadoop	cluster
62016/11/08	Apache	Hadoop	Meetup	at	CWT	2016
Ease-of-use
Scalable
2016/11/08	Apache	Hadoop	Meetup	at	CWT	2016 7
Hivemall	is	a	multi/cross-platform
ML	library
HiveQL SparkSQL/Dataframe API Pig	Latin
Hivemall	is	Multi/Cross	platform	..
Multi/Cross	
platform
prediction	models	built	by	Hive	can	be	used	from	Spark,	and	
conversely,	prediction	models	build	by	Spark	can	be	used	from	Hive
2016/11/08	Apache	Hadoop	Meetup	at	CWT	2016 8
Hivemall	on	Apache	Hive
2016/11/08	Apache	Hadoop	Meetup	at	CWT	2016 9
Hivemall	on	Apache	Spark	Dataframe
2016/11/08	Apache	Hadoop	Meetup	at	CWT	2016 10
Hivemall	on	SparkSQL
2016/11/08	Apache	Hadoop	Meetup	at	CWT	2016 11
Hivemall	on	Apache	Pig
2016/11/08	Apache	Hadoop	Meetup	at	CWT	2016 12
Versatile
Hivemall	is	a	Versatile	library	..
Ăź Hivemall	is	not	only	for	Machine	
Learning
Ăź Hivemall	provides	bunch	of	generic	
utility	functions	(e.g.,	top-k,	NLP)
Each	organization	has	own	sets	
of	UDFs	for	data	preprocessing!
Don’t	Repeat	Yourself!
Don’t	Repeat	Yourself!
Conclusion	and	Takeaway
Hivemall	is	a	machine	learning	library	that	is	…
2016/11/08	Apache	Hadoop	Meetup	at	CWT	2016 13
We	welcome	your	contributions	to	Apache	Hivemall	J
Multi/Cross	
platform
Versatile Scalable Ease-of-use
hivemall.incubator.apache.org

Podling Hivemall in the Apache Incubator