Introduction	to	
New	features	and	Use	cases
of	Hivemall
Research	Engineer
Makoto	YUI	@myui
<myui@treasure-data.com>
1
2016/03/30
Treasure	Data	Techtalk	
http://eventdots.jp/event/583226
Ø 2015.04	Joined	Treasure	Data,	Inc.
1st Research	Engineer	in	Treasure	Data
Ø 2010.04-2015.03	Senior	Researcher	at	National	
Institute	of	Advanced	Industrial	Science	and	
Technology,	Japan.	
Ø 2009.03	Ph.D.	in	Computer	Science	from	NAIST
Ø TD登山部部長
Ø 部員3名(うち幽霊部員1名)
Who	am		I	?
2
Ø 2015.04	Joined	Treasure	Data,	Inc.
1st Research	Engineer	in	Treasure	Data
Ø 2010.04-2015.03	Senior	Researcher	at	National	
Institute	of	Advanced	Industrial	Science	and	
Technology,	Japan.	
Ø 2009.03	Ph.D.	in	Computer	Science	from	NAIST
Ø TD登山部部長
Ø 部員3名(うち幽霊部員1名)
Who	am		I	?
3
4
Announcement
We	finally	replaced	the	Logo	of	
Hivemall	J
5
Story	of	Hivemall	Logo
6
Story	of	Hivemall	Logo
7
Hadoop
Logos	of	Hadoop-related	Products
8
Hadoop Hive
Logos	of	Hadoop-related	Products
9
Hadoop Hive Hivemall
Logos	of	Hadoop-related	Products
10
Logos	of	Hadoop-related	Products
Hadoop Hive Hivemall
11
We						Open	Source
12
他製品連携
SQL
Server
CRM
RDBMS
Appログ
センサー
Webログ
ERP
バッチ型
分析
アドホック型
分析
API
ODBC
JDBC
PUSH
Treasure	Agent
分析ツール連携
データ可視化・共有
Treasure	Data	Collectors
組込み
Embulk
モバイルSDK
JS	SDK
Treasure	Data	supports	ML-as-a-Service
Machine	Learning
Agenda
1. Introduction	to	Hivemall
2. Industrial	use	cases
3. How	to	use	Hivemall
4. Development	roadmap
13
What	is	Hivemall
Scalable	machine	learning	library	built	
as	a	collection	of	Hive	UDFs,	licensed	
under	the	Apache	License	v2
14
https://github.com/myui/hivemall
What	is	Hivemall
Hadoop	HDFS
MapReduce
(MR v1)
Hive /	PIG
Hivemall
Apache	YARN
Apache	Tez	
DAG	processing
MR	v2
Machine	Learning
Query	Processing
Parallel	Data	
Processing	Framework
Resource	Management
Distributed	File	System
15
Scalable	machine	learning	library	built	as	a	collection	of	
Hive	UDFs,	licensed	under	the	Apache	License	v2
Hivemall’s Vision:	ML	on	SQL
Classification	with	Mahout
CREATE	TABLE	lr_model	AS
SELECT
feature,	-- reducers	perform	model	averaging	in	
parallel
avg(weight)	as	weight
FROM	(
SELECT	logress(features,label,..)	as	(feature,weight)
FROM	train
)	t	-- map-only	task
GROUP	BY	feature;	-- shuffled	to	reducers
✓Machine	Learning	made	easy	for	SQL	
developers	(ML	for	the	rest	of	us)
✓Interactive	and	Stable	APIs	w/ SQL	abstraction
This	SQL	query	automatically	runs	in	
parallel	on	Hadoop	
16
List	of	Features	in	Hivemall	v0.3.x
Classification(both	
binary- and	multi-class)
✓ Perceptron
✓ Passive	Aggressive	(PA)
✓ Confidence	Weighted	(CW)
✓ Adaptive	Regularization	of	
Weight	Vectors	(AROW)
✓ Soft	Confidence	Weighted	
(SCW)
✓AdaGrad+RDA
Regression
✓Logistic	Regression	(SGD)
✓PA	Regression
✓AROW	Regression
✓AdaGrad
✓AdaDELTA
kNN and	Recommendation
✓ Minhash and	b-Bit	Minhash
(LSH	variant)
✓ Similarity	 Search	using	K-NN
(Euclid/Cosine/Jaccard/Angular)
✓ Matrix	Factorization
Feature	engineering
✓ Feature	Hashing
✓ Feature	Scaling
(normalization,	 z-score)	
✓TF-IDF	vectorizer
✓Polynomial	Expansion
Anomaly	Detection
✓ Local	Outlier	Factor
Top-k	query	processing
17
Features supported	in	Hivemall	v0.4.0
18
1.RandomForest
• classification,	regression
2.Factorization	Machine
• classification,	regression	(factorization)
Features supported	in	Hivemall	v0.4.1-alpha
19
1. NLP	Tokenizer (形態素解析)
• Kuromoji
2. Mini-batch	Gradient	Descent
3. RandomForest scalability	Improvements
Treasure	Data	is	operating
Hivemall	v0.4.1-alpha.6
The	above	feature	are	already	supported
Agenda
1. Introduction	to	Hivemall
2. Industrial	use	cases
3. How	to	use	Hivemall
4. Development	roadmap
20
Ø CTR	prediction	of	Ad	click	logs
•Freakout Inc.	and	more
•Replaced	Spark	MLlib	w/	Hivemall	at	company	X
Industry	use	cases	of	Hivemall
21
http://www.slideshare.net/masakazusano75/sano-hmm-20150512
22
ØGender	prediction	of	Ad	click	logs
•Scaleout Inc.
http://eventdots.jp/eventreport/458208
Industry	use	cases	of	Hivemall
23
Industry	use	cases	of	Hivemall
Ø Value	prediction	of	Real	estates
•Livesense
http://www.slideshare.net/y-ken/real-estate-tech-with-hivemall
24Source:	http://itnp.net/article/2016/02/18/2286.html
Industry	use	cases	of	Hivemall
25
ØChurn	Detection
•OISIX
Industry	use	cases	of	Hivemall
http://www.slideshare.net/TaisukeFukawa/hivemall-meetup-vol2-oisix
26
会員サービスの解約予測
•10万人の会員による定期購
買が会社全体の売上、利益を
左右するが、解約リスクのあ
る会員を事前に把握、防止す
る策を欠いていた
•統計の専門知識無しで機械学習
•解約予測リストへのポイント付
与により解約率が半減
•解約リスクを伴う施策、イベン
トを炙り出すと同時に、非解約
者の特徴的な行動も把握可能に
•リスク度合いに応じて UI を変
更するなど間接的なサービス改
善も実現
•機械学習を行い、過去1ヶ月間
のデータをもとに未来1ヶ月間
に解約する可能性の高い顧客リ
ストを作成
•具体的には、学習用テーブル作
成 -> 正規化 -> 学習モデル作成
-> ロジスティック回帰の各ス
テップをTD + Hivemall を用い
てクエリで簡便に実現
Web
Mobile
属性情報
行動ログ
クレーム情報
流入元
利用サービス情報
直接施策
間接施策
ポイント付与 ケアコール
成功体験への誘導UI	変更
予測に使うデータ
27
ØRecommendation
•Portal	site
Industry	use	cases	of	Hivemall
Agenda
1. Introduction	to	Hivemall
2. Industrial	use	cases
3. How	to	use	Hivemall
4. Development	roadmap
28
29
RandomForest	in	Hivemall	v0.4
Ensemble	of	Decision	Trees
30
RandomForest	in	Hivemall	v0.4
Ensemble	of	Decision	Trees
31
Training	of	RandomForest
32
Prediction	of	RandomForest
Out-of-bag	tests	and	Variable	Importance	
33
34
Out-of-bag	tests	and	Variable	Importance
Recommendation
Rating	prediction	of	a	Matrix	
Can	be	applied	for	user/Item	Recommendation
35
36
Matrix	Factorization
Factorize	a	matrix	
into	a	product	of	matrices
having	k-latent	factor
37
Training	of	Matrix	Factorization
Support iterative training using local disk cache
38
Prediction	of	Matrix	Factorization
39
Factorization	Machines
Matrix	Factorization
40
Factorization	Machines
Context	information	(e.g.,	time)	
can	be	considered
Source:	http://www.ismll.uni-hildesheim.de/pub/pdfs/Rendle2010FM.pdf
41
Training	data	for	Factorization	Machines
Each	Feature	takes	LibSVM-like	format	<feature[:weight]>
42
Training	of	Factorization	Machines
43
Prediction	of	Factorization	Machines
44
Feature	Engineering	functions
45
Feature	Engineering	functions
Agenda
1. Introduction	to	Hivemall
2. Industrial	use	cases
3. How	to	use	Hivemall
4. Development	roadmap
46
Features to	be	supported	in	Hivemall	v0.4.1
47
1. NLP	Tokenizer (形態素解析)
• Kuromoji integration	was	requested	by	Company	R
2. Mini-batch	Gradient	Descent
3. RandomForest scalability	Improvements
4. Recommendation	for	Implicit	Feedback	Dataset
• Useful	where	only	positive-only	feedback	is	available
• BPR:	Bayesian	Personalized	Ranking	from	Implicit	Feedback,	
Proc.	UAI,	2009.
Planned	to	release	v0.4.1	in	April.
Features to	be	supported	in	Hivemall	v0.4.2
48
1.	Gradient	Tree	Boosting
• classifier,	regression
• based	on	Smile
https://github.com/haifengl/smile/
Features to	be	supported	in	Hivemall	v0.4.2
49
1.	Gradient	Tree	Boosting
• classifier,	regression
• based	on	Smile
https://github.com/haifengl/smile/
2.	Field-aware	Factorization	Machine
• classification,	regression	(factorization)
Planned	to	release	v0.4.1	in	June
Features to	be	supported	in	Hivemall	v0.5
50
1. Mix	server	on	Apache	YARN
• Service	for	parameter	sharing	among	workers
学習器1
学習器2
学習器N
パラメタ
交換
学習
モデル
分割された訓練例
データ並列
データ並列
Features to	be	supported	in	Hivemall	v0.5
51
1. Mix	server	on	Apache	YARN
• Service	for	parameter	sharing	among	worker
2. Online	LDA
• topic	modeling,	clustering
3. XGBoost Integration
4.Generalized	Linear	Model
• Ridge/Elastic	net/Lasso	regularization
• Supports	various	loss	functions
5. Alternating	Direction	Method	of	Multipliers	
(ADMM)	convex	optimization
6. T-sne Dimension	Reduction
52
Analytics	Workflow
Machine	learning	workflows	can	be	simplified	
using	our	new	workflow	engine,	named	Digdag
+main:
+prepare:
_parallel: true
+train:
td>: ./tasks/train_join.sql
+test:
td>: ./tasks/test_join.sql
+quantify:
td>: ./tasks/train_quantify.sql
+model_test_quantify:
_parallel: true
+model:
td>: ./tasks/make_model.sql
+test_quantify:
td>: ./tasks/test_quantify.sql
+pred:
td>: ./tasks/prediction.sql
CLI	version	will	be	released	soon.	
Stay	tuned!
Conclusion	and	Takeaway
53
Hivemall	provides	a	collection	of	machine	
learning	algorithms	as	Hive	UDFs/UDTFs
Hivemall’s Positioning
Treasure	Data	provides	ML-as-a-Service	
using	Hivemall
Major	development	leaps	in	v0.4
More	will	follow	in	v0.4.1	and	later
• For	SQL	users	that	need	ML
• Easy-of-use	and	scalability	in	mind
• Random	Forest
• Factorization	Machine
54
Blog	article	about	Hivemall
http://blog-jp.treasuredata.com/
TD,	Hivemall,	Jupyter,	Pandas-TDを使ってKaggleの
課題を解くシリーズ
55
We	support	machine	learning	in	Cloud
Any	feature	request?	Or,	questions?

Tdtechtalk20160330myui