Apache	Hivemall:	Scalable	machine	learning	
library	for	Apache	Hive/Spark/Pig
1).	Research	Engineer,	Treasure	Data
Makoto	YUI	@myui
2017/5/16	Apache	BigData North	America	'17,	Miami
2).	Research	Engineer,	NTT
Takashi	Yamamuro @maropu
@ApacheHivemall
Plan	of	the	talk
1. Introduction	to	Hivemall
2. Hivemall	on	Spark
2017/5/16	Apache	BigData	North	America	'17,	Miami
2017/5/16	Apache	BigData	North	America	'17,	Miami
Hivemall	entered	Apache	Incubator	
on	Sept	13,	2016	 🎉
Since	then,	we	invited	2	contributors	as	new	committers	(a	
committer	has	been	voted	as	PPMC). Currently,	we	are	working	
toward	the	first	Apache	release	on	this	Q2.
hivemall.incubator.apache.org
What	is	Apache	Hivemall
Scalable	machine	learning	library	built	
as	a	collection	of	Hive	UDFs
2017/5/16	Apache	BigData	North	America	'17,	Miami
Multi/Cross	
platform VersatileScalableEase-of-use
Hivemall	is	easy	and	scalable	…
ML	made	easy	for	SQL	developers
Born	to	be	parallel	and	scalable
2017/5/16	Apache	BigData	North	America	'17,	Miami
Ease-of-use
Scalable
CREATE	TABLE	lr_model AS
SELECT
feature,	-- reducers	perform	model	averaging	in	parallel
avg(weight)	as	weight
FROM	(
SELECT	logress(features,label,..)	as	(feature,weight)
FROM	train
)	t	-- map-only	task
GROUP	BY	feature;	-- shuffled	to	reducers
This	query	automatically	runs	in	parallel	on	Hadoop
2017/5/16	Apache	BigData	North	America	'17,	Miami
Hivemall	is	a	multi/cross-platform ML	library
HiveQL SparkSQL/Dataframe API Pig	Latin
Hivemall	is	Multi/Cross	platform	..
Multi/Cross	
platform
prediction	models	built	by	Hive	can	be	used	from	Spark,	and	conversely,	
prediction	models	build	by	Spark	can	be	used	from	Hive
Hadoop	HDFS
MapReduce
(MRv1)
Hivemall
Apache	YARN
Apache	Tez
DAG	processing
Machine Learning
Query Processing
Parallel Data
Processing Framework
Resource Management
Distributed File System
Cloud Storage
SparkSQL
Apache	Spark
MESOS
Hive Pig
MLlib
Hivemall’s Technology	Stack
2017/5/16	Apache	BigData	North	America	'17,	Miami
Amazon	S3
2017/5/16	Apache	BigData	North	America	'17,	Miami
Hivemall	on	Apache	Hive
2017/5/16	Apache	BigData	North	America	'17,	Miami
Hivemall	on	Apache	Spark	Dataframe
2017/5/16	Apache	BigData	North	America	'17,	Miami
Hivemall	on	SparkSQL
2017/5/16	Apache	BigData	North	America	'17,	Miami
Hivemall	on	Apache	Pig
2017/5/16	Apache	BigData	North	America	'17,	Miami
Online	Prediction	by	Apache	Streaming
2017/5/16	Apache	BigData	North	America	'17,	Miami
Versatile
Hivemall	is	a	Versatile	library	..
ü Not	only	for	Machine	Learning
ü provides	a	bunch	of	generic	utility	functions
Each	organization	has	own	sets	of	
UDFs	for	data	preprocessing
Don’t	Repeat	Yourself!
Don’t	Repeat	Yourself!
2017/5/16	Apache	BigData	North	America	'17,	Miami
Hivemall	generic	functions
Array	and	Map Bit	and	compress String	and	NLP
We	welcome	contributing	your	generic	UDFs	to	Hivemall!
Geo	Spatial
Top-k	processing
> TF/IDF
> TILE
> MAP_URL
2017/5/16	Apache	BigData	North	America	'17,	Miami
Map	tiling	functions
2017/5/16	Apache	BigData	North	America	'17,	Miami
Tile(lat,lon,zoom)
= xtile(lon,zoom) + ytile(lat,zoom) * 2^n
Map	tiling	functions
Zoom=10
Zoom=15
2017/5/16	Apache	BigData	North	America	'17,	Miami
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k	query	processing
student class score
3 a 90
2 a 80
1 b 70
6 b 60
List	top-2	students	for	each	class
2017/5/16	Apache	BigData	North	America	'17,	Miami
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k	query	processing
List	top-2	students	for	each	class
SELECT	*	FROM	(		
SELECT		
*,	
rank()	over	(partition	by	class	order	by	score	desc)
as	rank
FROM	table
)	t
WHERE	rank	<=	2
RANK	over()	query	does	not	finishes	in	24	hours L
where	20	million	MOOCs	classes	and	avg	1,000	students	in	each	classes
2017/5/16	Apache	BigData	North	America	'17,	Miami
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k	query	processing
List	top-2	students	for	each	class
SELECT	
each_top_k(
2,	class,	score,	
class,	student		
)	as	(rank,	score,	class,	student)
FROM (
SELECT	*	FROM	table
DISTRIBUTE	BY	class	SORT	BY	class
)	t
EACH_TOP_K	finishes	in	2	hours J
2017/5/16	Apache	BigData	North	America	'17,	Miami
Top-k	query	processing	by	RANK	OVER()
partition	by	class
Node	1
Sort	by	class,	score
rank	over()
rank	>=	2
2017/5/16	Apache	BigData	North	America	'17,	Miami
Top-k	query	processing	by	EACH_TOP_K
distributed	by	class
Node	1
Sort	by	class
each_top_k
OUTPUT	only	K	
items
2017/5/16	Apache	BigData	North	America	'17,	Miami
Comparison	between	RANK	and	EACH_TOP_K
distributed	by	class
Sort	by	class
each_top_k
Sort	by	class,	score
rank	over()
rank	>=	2
SORTING IS	HEAVY
NEED	TO	
PROCESS	ALL
OUTPUT	only	K	
items
Each_top_k is	very	efficient	where	the	number	of	class	is	large
Bounded	Priority	Queue
is	utilized
List	of	Supported	Algorithms
Classification	
✓ Perceptron
✓ Passive	Aggressive	(PA,	PA1,	PA2)
✓ Confidence	Weighted	(CW)
✓ Adaptive	Regularization	of	Weight	
Vectors	(AROW)
✓ Soft	Confidence	Weighted	(SCW)
✓ AdaGrad+RDA
✓ Factorization	Machines
✓ RandomForest	Classification
2017/5/16	Apache	BigData	North	America	'17,	Miami
Regression
✓Logistic	Regression	(SGD)
✓AdaGrad (logistic	loss)
✓AdaDELTA (logistic	loss)
✓PA	Regression
✓AROW	Regression
✓Factorization	Machines
✓RandomForest	Regression
SCW is a good first choice
Try RandomForest if SCW does not
work
Logistic regression is good for getting a
probability of a positive class
Factorization Machines is good where
features are sparse and categorical ones
RandomForest	in	Hivemall
Ensemble	of	Decision	Trees
2017/5/16	Apache	BigData	North	America	'17,	Miami
Training	of	RandomForest
2017/5/16	Apache	BigData	North	America	'17,	Miami
Prediction	of	RandomForest
2017/5/16	Apache	BigData	North	America	'17,	Miami
Supported	Algorithms	for	Recommendation
2017/5/16	Apache	BigData	North	America	'17,	Miami
K-Nearest	Neighbor
✓ Minhash and	b-Bit	Minhash
(LSH	variant)
✓ Similarity	Search	on	Vector	
Space
(Euclid/Cosine/Jaccard/Angular)
Matrix	Completion
✓ Matrix	Factorization
✓ Factorization	Machines	
(regression)
each_top_k function	of	Hivemall	is	useful	for	
recommending	top-k	items
Other	Supported	Algorithms
2017/5/16	Apache	BigData	North	America	'17,	Miami
Feature	Engineering
✓Feature	Hashing
✓Feature	Scaling
(normalization,	z-score)	
✓ Feature	Binning	
✓ TF-IDF	vectorizer
✓ Polynomial	Expansion
✓ Amplifier
NLP
✓Basic	Englist text	Tokenizer	
✓Japanese	Tokenizer
Evaluation	metrics
✓AUC,	nDCG,	logloss,	precision	
recall@K,	and	etc
2017/5/16	Apache	BigData	North	America	'17,	Miami
Feature	Engineering	– Feature	Hashing
2017/5/16	Apache	BigData	North	America	'17,	Miami
Feature	Engineering	– Feature	Hashing
2017/5/16	Apache	BigData	North	America	'17,	Miami
Feature	Engineering	– Feature	Binning
Maps	quantitative	variables	to	fixed	number	of	
bins	based	on	quantiles/distribution
Map	Ages	into	3	bins
2017/5/16	Apache	BigData	North	America	'17,	Miami
Feature	Selection	– Signal	Noise	Ratio
2017/5/16	Apache	BigData	North	America	'17,	Miami
Evaluation	Metrics
Other	Supported	Features
2017/5/16	Apache	BigData	North	America	'17,	Miami
Anomaly	Detection
✓Local	Outlier	Factor	(LoF)
✓ChangeFinder
Clustering	/	Topic	models
✓Online	mini-batch	LDA
✓Online	mini-batch	PLSA
Change	Point	Detection
✓ChangeFinder
✓Singular	Spectrum	
Transformation
Efficient	algorithm	for	finding	change	point	and	outliers	from	
timeseries data
2017/5/16	Apache	BigData	North	America	'17,	Miami
J.	Takeuchi	and	K.	Yamanishi,	“A	Unifying	Framework	for	Detecting	Outliers	and	Change	Points	from	Time	Series,” IEEE		transactions	on	
Knowledge	and	Data	Engineering,	pp.482-492,	2006.
Anomaly/Change-point	Detection	by	ChangeFinder
Take	this…
Anomaly/Change-point	Detection	by	ChangeFinder
2017/5/16	Apache	BigData	North	America	'17,	Miami
2017/5/16	Apache	BigData	North	America	'17,	Miami
Anomaly/Change-point	Detection	by	ChangeFinder
…and	do	this!
Efficient	algorithm	for	finding	change	point	and	outliers	from	
timeseries data
2017/5/16	Apache	BigData	North	America	'17,	Miami
Anomaly/Change-point	Detection	by	ChangeFinder
J.	Takeuchi	and	K.	Yamanishi,	“A	Unifying	Framework	for	Detecting	Outliers	and	Change	Points	from	Time	Series,” IEEE		transactions	on	
Knowledge	and	Data	Engineering,	pp.482-492,	2006.
Efficient	algorithm	for	finding	change	point	and	outliers	from	
timeseries data
2017/5/16	Apache	BigData	North	America	'17,	Miami
Anomaly/Change-point	Detection	by	ChangeFinder
J.	Takeuchi	and	K.	Yamanishi,	“A	Unifying	Framework	for	Detecting	Outliers	and	Change	Points	from	Time	Series,” IEEE		transactions	on	
Knowledge	and	Data	Engineering,	pp.482-492,	2006.
2017/5/16	Apache	BigData	North	America	'17,	Miami
• T.	Ide	and	K.	Inoue,	"Knowledge	Discovery	from	Heterogeneous	Dynamic	Systems	using	Change-Point	
Correlations",	Proc.	SDM,	2005T.	
• T.	Ide	and	K.	Tsuda,	"Change-point	detection	using	Krylov	subspace	learning",	Proc.	SDM,	2007.
Change-point	detection	by	Singular	Spectrum	Transformation
2017/5/16	Apache	BigData	North	America	'17,	Miami
Online	mini-batch	LDA
ü XGBoost Integration
ü Sparse	vector	support	in	RandomForests
ü Docker	support	(for	evaluation)
ü Field-aware	Factorization	Machines
ü Generalized	Linear	Model
• Optimizer	framework	including	ADAM
• L1/L2	regularization
2017/5/16	Apache	BigData	North	America	'17,	Miami
Other	new	features	in	development
Copyright©2016 NTT corp. All Rights Reserved.
Hivemall on
44Copyright©2016 NTT corp. All Rights Reserved.
Who am I
• Takeshi Yamamuro
• NTT corp. in Japan
• OSS activities
• Apache Hivemall PPMC
• Apache Spark contributer
45Copyright©2016 NTT corp. All Rights Reserved.
Whatʼs Spark?
• Distributed data analytics engine,
generalizing Map Reduce
46Copyright©2016 NTT corp. All Rights Reserved.
Whatʼs Spark?
• 1. Unified Engine
• support end-to-end APs, e.g., MLlib and Streaming
• 2. High-level APIs
• easy-to-use, rich optimization
• 3. Integrate broadly
• storages, libraries, ...
47Copyright©2016 NTT corp. All Rights Reserved.
• Hivemall wrapper for Spark
• Wrapper implementations for DataFrame/SQL
• + some utilities for easy-to-use in Spark
• The wrapper makes you...
• run most of Hivemall functions in Spark
• try examples easily in your laptop
• improve some function performance in Spark
Whatʼs Hivemall on Spark?
48Copyright©2016 NTT corp. All Rights Reserved.
• Hivemall already has many fascinating ML
algorithms and useful utilities
• + High barriers to add newer algorithms in MLlib
Whyʼs Hivemall on Spark?
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
49Copyright©2016 NTT corp. All Rights Reserved.
• Most of Hivemall functions supported in Spark
v2.0 and v2.1
Current Status
- To compile for Spark v2.1
$ git clone https://github.com/apache/incubator-hivemall
$ cd incubator-hivemall
$ mvn package –Pspark-2.1–DskipTests
$ ls target/*spark*
target/hivemall-spark-2.1_2.11-XXX-with-dependencies.jar
...
50Copyright©2016 NTT corp. All Rights Reserved.
• Most of Hivemall functions supported in Spark
v2.0 and v2.1
Current Status
- To compile for Spark v2.0
$ git clone https://github.com/apache/incubator-hivemall
$ cd incubator-hivemall
$ mvn package –Pspark-2.0–DskipTests
$ ls target/*spark*
target/hivemall-spark-2.0_2.11-XXX-with-dependencies.jar
...
51Copyright©2016 NTT corp. All Rights Reserved.
• 1. Fetch training and test data
• 2. Load these data in Spark
• 3. Build a model
• 4. Do predictions
4 Step Example
52Copyright©2016 NTT corp. All Rights Reserved.
• E2006 tfidf regression dataset
• http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/dat
asets/regression.html#E2006-tfidf
1. Fetch training and test data
$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
regression/E2006.train.bz2
$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
regression/E2006.test.bz2
53Copyright©2016 NTT corp. All Rights Reserved.
2. Load data in Spark
// Download Spark-v2.1 and launch a spark-shell with Hivemall
$ <HIVEMALL_HOME>/bin/spark-shell
// Create DataFrame from the bzipʼd libsvm-formatted file
scala> val trainDf = spark.read.format("libsvm”).load(“E2006.train.bz2")
scala> trainDf.printSchema
root
|-- label: double (nullable = true)
|-- features: vector (nullable = true)
54Copyright©2016 NTT corp. All Rights Reserved.
2. Load data in Spark (Detailed)
0.000357499151147113 6066:0.0007932706219604
8 6069:0.000311377727123504 6070:0.0003067549
34580457 6071:0.000276992485786437 6072:0.000
39663531098024 6074:0.00039663531098024 6075
:0.00032548335…
trainDf
Partition1 Partition2 Partition3 PartitionN
…
…
…
Load in parallel because
bzip2 is splittable
55Copyright©2016 NTT corp. All Rights Reserved.
3. Build a model - DataFrame
scala> paste:
val modelDf = trainDf.train_logregr($"features", $"label")
.groupBy("feature”)
.agg("weight" -> "avg”)
56Copyright©2016 NTT corp. All Rights Reserved.
3. Build a model - SQL
scala> trainDf.createOrReplaceTempView("TrainTable")
scala> paste:
val modelDf = sql("""
| SELECT feature, AVG(weight) AS weight
| FROM (
| SELECT train_logregr(features, label)
| AS (feature, weight)
| FROM TrainTable
| )
| GROUP BY feature
""".stripMargin)
57Copyright©2016 NTT corp. All Rights Reserved.
4. Do predictions - DataFrame
scala> paste:
val df = testDf.select(rowid(), $"features")
.explode_vector($"features")
.cache
# Do predictions
df.join(modelDf, df("feature") === model("feature"), "LEFT_OUTER")
.groupBy("rowid")
.avg(sigmoid(sum($"weight" * $"value")))
58Copyright©2016 NTT corp. All Rights Reserved.
4. Do predictions - SQL
scala> modelDf.createOrReplaceTempView(”ModelTable")
scala> df.createOrReplaceTempView(”TestTable”)
scala> paste:
sql("""
| SELECT rowid, sigmoid(value * weight) AS predicted
| FROM TrainTable t
| LEFT OUTER JOIN ModelTable m
| ON t.feature = m.feature
| GROUP BY rowid
""".stripMargin)
59Copyright©2016 NTT corp. All Rights Reserved.
• Top-K Join
• Join two relations and compute top-K for each group
Add New Features in Spark
scala> paste:
val topkDf = leftDf.join(rightDf, “group” :: Nil, “INNER”)
.select(leftDf("group"), (leftDf(“x”) + rightDf(“y”)).as(“score”)
.withColumn(
"rank",
rank().over(partitionBy($"group”).orderBy($"score".desc))
)
.where($"rank" <= topK)
Use Spark vanilla APIs
60Copyright©2016 NTT corp. All Rights Reserved.
• Top-K Join
• Join two relations and compute top-K for each group
Add New Features in Spark
scala> paste:
val topkDf = leftDf.join(rightDf, “group” :: Nil, “INNER”)
.select(leftDf("group"), (leftDf(“x”) + rightDf(“y”)).as(“score”)
.withColumn(
"rank", rank().over(partitionBy($"group”).orderBy($"score".desc))
)
.where($"rank" <= topK)
1. Join leftDf with rightDf
2. Sort data in each group
3. Filter top-K data
Use Spark vanilla APIs
61Copyright©2016 NTT corp. All Rights Reserved.
• Top-K Join
• Join two relations and compute top-K for each group
Add New Features in Spark
scala> paste:
val topkDf = leftDf.top_k_join(
lit(topK), rightDf, leftDf(“group”) === rightDf(“group”),
(leftDf(“x”) + rightDf(“y”)).as(“score”)
)
Use a fused API implemented in Hivemall
62Copyright©2016 NTT corp. All Rights Reserved.
• How top_k_join works?
Add New Features in Spark
group x
group y
leftDf
rightDf
・・・
63Copyright©2016 NTT corp. All Rights Reserved.
• How top_k_join works?
Add New Features in Spark
group x
group y
・・・
・・・
K-length
priority queue
Compute top-K rows
by using a priority queue
leftDf
rightDf
64Copyright©2016 NTT corp. All Rights Reserved.
• How top_k_join works?
Add New Features in Spark
group x
group y
・・・
・・・
K-length
priority queue
Compute top-K rows
by using a priority queue
Only join top-K rowsleftDf
rightDf
65Copyright©2016 NTT corp. All Rights Reserved.
• Codegenʼd top_k_join for fast processing
• Spark internally generates Java code from a built
physical plan, and compiles/executes it
Add New Features in Spark
Spark Planner (Catalyst) Overview
66Copyright©2016 NTT corp. All Rights Reserved.
• Codegenʼd top_k_join for fast processing
• Spark internally generates Java code from a built
physical plan, and compiles/executes it
Add New Features in Spark
Codegen
A Physical Plan
67Copyright©2016 NTT corp. All Rights Reserved.
• Codegenʼd top_k_join for fast processing
• Spark internally generates Java code from a built
physical plan, and compiles/executes it
Add New Features in Spark
scala> topkDf.explain
== Physical Plan ==
*ShuffledHashJoinTopK 1, [group#10], [group#27]
:- Exchange hashpartitioning(group#10, 200)
: +- LocalTableScan [group#10, x#11]
+- Exchange hashpartitioning(group#27, 200)
+- LocalTableScan [group#27, y#28]
ʻ*ʼ in the head means a codegenʼd plan
68Copyright©2016 NTT corp. All Rights Reserved.
• Benchmark Result
Add New Features in Spark
~13 times faster than vanilla APIs!!
69Copyright©2016 NTT corp. All Rights Reserved.
• Support more useful functions
• Spark only implements naive and basic functions in
terms of usability and maintainability
• e.g.,)
• flatten - flatten a nested schema into flat one
• from_csv/to_csv – interconversion of CSV strings and structured data with schemas
• ...
• See more in the Hivemall user guide
• https://hivemall.incubator.apache.org /userguide /spark /misc /misc.html
Add New Features in Spark
70Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs
• Hivemall heavily depends on them
• ex.1) Compute a sigmoid function
Improve Some Functions in Spark
scala> val sigmoidFunc = (d: Double) => 1.0 / (1.0 + Math.exp(-d))
scala> val sparkUdf = functions.udf(sigmoidFunc)
scala> df.select(sparkUdf($“value”))
71Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs
• Hivemall heavily depends on them
• ex.1) Compute a sigmoid function
Improve Some Functions in Spark
scala> val hiveUdf = HivemallOps.sigmoid
scala> df.select(hiveUdf($“value”))
72Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs
• Hivemall heavily depends on them
• ex.1) Compute a sigmoid function
Improve Some Functions in Spark
73Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs
• Hivemall heavily depends on them
• ex.2) Compute top-k for each key group
Improve Some Functions in Spark
scala> paste:
df.withColumn(
“rank”,
rank().over(Window.partitionBy($"key").orderBy($"score".desc)
) .where($"rank" <= topK)
74Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs
• Hivemall heavily depends on them
• ex.2) Compute top-k for each key group
• Fixed the overhead issue for each_top_k
• See pr#353: “Implement EachTopK as a generator expression“ in Spark
Improve Some Functions in Spark
scala> df.each_top_k(topK, “key”, “score”, “value”)
75Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs
• Hivemall heavily depends on them
• ex.2) Compute top-k for each key group
Improve Some Functions in Spark
~4 times faster than rank!!
76Copyright©2016 NTT corp. All Rights Reserved.
• supports under development
• fast implementation of the gradient tree boosting
• widely used in Kaggle competitions
• This integration will make you...
• load built models and predict in parallel
• build multiple models in parallel
for cross-validation
Support 3rd Party Library in Spark
Conclusion	and	Takeaway
Hivemall	is	a	multi/cross-platform ML	library
providing	a	collection	of	machine	learning	algorithms	as	Hive	UDFs/UDTFs
2017/5/16	Apache	BigData	North	America	'17,	Miami
We	welcome	your	contributions	to	Apache	Hivemall	J
HiveQL SparkSQL/Dataframe API Pig	Latin
2017/5/16	Apache	BigData	North	America	'17,	Miami
Any	feature	request	or	questions?

Apache Hivemall @ Apache BigData '17, Miami