Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Introduction	to	
New	features	and	Use	cases
of	Hivemall
Research	Engineer
Makoto	YUI	@myui
<myui@treasure-data.com>
1
2016...
Ø 2015.04	Joined	Treasure	Data,	Inc.
1st Research	Engineer	in	Treasure	Data
Ø 2010.04-2015.03	Senior	Researcher	at	Nationa...
Ø 2015.04	Joined	Treasure	Data,	Inc.
1st Research	Engineer	in	Treasure	Data
Ø 2010.04-2015.03	Senior	Researcher	at	Nationa...
4
Announcement
We	finally	replaced	the	Logo	of	
Hivemall	J
5
Story	of	Hivemall	Logo
6
Story	of	Hivemall	Logo
7
Hadoop
Logos	of	Hadoop-related	Products
8
Hadoop Hive
Logos	of	Hadoop-related	Products
9
Hadoop Hive Hivemall
Logos	of	Hadoop-related	Products
10
Logos	of	Hadoop-related	Products
Hadoop Hive Hivemall
11
We						Open	Source
12
他製品連携
SQL
Server
CRM
RDBMS
Appログ
センサー
Webログ
ERP
バッチ型
分析
アドホック型
分析
API
ODBC
JDBC
PUSH
Treasure	Agent
分析ツール連携
データ可視化・共有
T...
Agenda
1. Introduction	to	Hivemall
2. Industrial	use	cases
3. How	to	use	Hivemall
4. Development	roadmap
13
What	is	Hivemall
Scalable	machine	learning	library	built	
as	a	collection	of	Hive	UDFs,	licensed	
under	the	Apache	License...
What	is	Hivemall
Hadoop	HDFS
MapReduce
(MR v1)
Hive /	PIG
Hivemall
Apache	YARN
Apache	Tez	
DAG	processing
MR	v2
Machine	Le...
Hivemall’s Vision:	ML	on	SQL
Classification	with	Mahout
CREATE	TABLE	lr_model	AS
SELECT
feature,	-- reducers	perform	model...
List	of	Features	in	Hivemall	v0.3.x
Classification(both	
binary- and	multi-class)
✓ Perceptron
✓ Passive	Aggressive	(PA)
✓...
Features supported	in	Hivemall	v0.4.0
18
1.RandomForest
• classification,	regression
2.Factorization	Machine
• classificat...
Features supported	in	Hivemall	v0.4.1-alpha
19
1. NLP	Tokenizer (形態素解析)
• Kuromoji
2. Mini-batch	Gradient	Descent
3. Rando...
Agenda
1. Introduction	to	Hivemall
2. Industrial	use	cases
3. How	to	use	Hivemall
4. Development	roadmap
20
Ø CTR	prediction	of	Ad	click	logs
•Freakout Inc.	and	more
•Replaced	Spark	MLlib	w/	Hivemall	at	company	X
Industry	use	case...
22
ØGender	prediction	of	Ad	click	logs
•Scaleout Inc.
http://eventdots.jp/eventreport/458208
Industry	use	cases	of	Hivemall
23
Industry	use	cases	of	Hivemall
Ø Value	prediction	of	Real	estates
•Livesense
http://www.slideshare.net/y-ken/real-estat...
24Source:	http://itnp.net/article/2016/02/18/2286.html
Industry	use	cases	of	Hivemall
25
ØChurn	Detection
•OISIX
Industry	use	cases	of	Hivemall
http://www.slideshare.net/TaisukeFukawa/hivemall-meetup-vol2-ois...
26
会員サービスの解約予測
•10万人の会員による定期購
買が会社全体の売上、利益を
左右するが、解約リスクのあ
る会員を事前に把握、防止す
る策を欠いていた
•統計の専門知識無しで機械学習
•解約予測リストへのポイント付
与により解約率が半...
27
ØRecommendation
•Portal	site
Industry	use	cases	of	Hivemall
Agenda
1. Introduction	to	Hivemall
2. Industrial	use	cases
3. How	to	use	Hivemall
4. Development	roadmap
28
29
RandomForest	in	Hivemall	v0.4
Ensemble	of	Decision	Trees
30
RandomForest	in	Hivemall	v0.4
Ensemble	of	Decision	Trees
31
Training	of	RandomForest
32
Prediction	of	RandomForest
Out-of-bag	tests	and	Variable	Importance	
33
34
Out-of-bag	tests	and	Variable	Importance
Recommendation
Rating	prediction	of	a	Matrix	
Can	be	applied	for	user/Item	Recommendation
35
36
Matrix	Factorization
Factorize	a	matrix	
into	a	product	of	matrices
having	k-latent	factor
37
Training	of	Matrix	Factorization
Support iterative training using local disk cache
38
Prediction	of	Matrix	Factorization
39
Factorization	Machines
Matrix	Factorization
40
Factorization	Machines
Context	information	(e.g.,	time)	
can	be	considered
Source:	http://www.ismll.uni-hildesheim.de/p...
41
Training	data	for	Factorization	Machines
Each	Feature	takes	LibSVM-like	format	<feature[:weight]>
42
Training	of	Factorization	Machines
43
Prediction	of	Factorization	Machines
44
Feature	Engineering	functions
45
Feature	Engineering	functions
Agenda
1. Introduction	to	Hivemall
2. Industrial	use	cases
3. How	to	use	Hivemall
4. Development	roadmap
46
Features to	be	supported	in	Hivemall	v0.4.1
47
1. NLP	Tokenizer (形態素解析)
• Kuromoji integration	was	requested	by	Company	R
...
Features to	be	supported	in	Hivemall	v0.4.2
48
1.	Gradient	Tree	Boosting
• classifier,	regression
• based	on	Smile
https:/...
Features to	be	supported	in	Hivemall	v0.4.2
49
1.	Gradient	Tree	Boosting
• classifier,	regression
• based	on	Smile
https:/...
Features to	be	supported	in	Hivemall	v0.5
50
1. Mix	server	on	Apache	YARN
• Service	for	parameter	sharing	among	workers
学習...
Features to	be	supported	in	Hivemall	v0.5
51
1. Mix	server	on	Apache	YARN
• Service	for	parameter	sharing	among	worker
2. ...
52
Analytics	Workflow
Machine	learning	workflows	can	be	simplified	
using	our	new	workflow	engine,	named	Digdag
+main:
+pr...
Conclusion	and	Takeaway
53
Hivemall	provides	a	collection	of	machine	
learning	algorithms	as	Hive	UDFs/UDTFs
Hivemall’s Po...
54
Blog	article	about	Hivemall
http://blog-jp.treasuredata.com/
TD,	Hivemall,	Jupyter,	Pandas-TDを使ってKaggleの
課題を解くシリーズ
55
We	support	machine	learning	in	Cloud
Any	feature	request?	Or,	questions?
Upcoming SlideShare
Loading in …5
×

Tdtechtalk20160330myui

1,880 views

Published on

Talk about Hivemall at Treasure Data Tech Talk on 2016/03/30
http://eventdots.jp/event/583226

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Tdtechtalk20160330myui

  1. 1. Introduction to New features and Use cases of Hivemall Research Engineer Makoto YUI @myui <myui@treasure-data.com> 1 2016/03/30 Treasure Data Techtalk http://eventdots.jp/event/583226
  2. 2. Ø 2015.04 Joined Treasure Data, Inc. 1st Research Engineer in Treasure Data Ø 2010.04-2015.03 Senior Researcher at National Institute of Advanced Industrial Science and Technology, Japan. Ø 2009.03 Ph.D. in Computer Science from NAIST Ø TD登山部部長 Ø 部員3名(うち幽霊部員1名) Who am I ? 2
  3. 3. Ø 2015.04 Joined Treasure Data, Inc. 1st Research Engineer in Treasure Data Ø 2010.04-2015.03 Senior Researcher at National Institute of Advanced Industrial Science and Technology, Japan. Ø 2009.03 Ph.D. in Computer Science from NAIST Ø TD登山部部長 Ø 部員3名(うち幽霊部員1名) Who am I ? 3
  4. 4. 4 Announcement We finally replaced the Logo of Hivemall J
  5. 5. 5 Story of Hivemall Logo
  6. 6. 6 Story of Hivemall Logo
  7. 7. 7 Hadoop Logos of Hadoop-related Products
  8. 8. 8 Hadoop Hive Logos of Hadoop-related Products
  9. 9. 9 Hadoop Hive Hivemall Logos of Hadoop-related Products
  10. 10. 10 Logos of Hadoop-related Products Hadoop Hive Hivemall
  11. 11. 11 We Open Source
  12. 12. 12 他製品連携 SQL Server CRM RDBMS Appログ センサー Webログ ERP バッチ型 分析 アドホック型 分析 API ODBC JDBC PUSH Treasure Agent 分析ツール連携 データ可視化・共有 Treasure Data Collectors 組込み Embulk モバイルSDK JS SDK Treasure Data supports ML-as-a-Service Machine Learning
  13. 13. Agenda 1. Introduction to Hivemall 2. Industrial use cases 3. How to use Hivemall 4. Development roadmap 13
  14. 14. What is Hivemall Scalable machine learning library built as a collection of Hive UDFs, licensed under the Apache License v2 14 https://github.com/myui/hivemall
  15. 15. What is Hivemall Hadoop HDFS MapReduce (MR v1) Hive / PIG Hivemall Apache YARN Apache Tez DAG processing MR v2 Machine Learning Query Processing Parallel Data Processing Framework Resource Management Distributed File System 15 Scalable machine learning library built as a collection of Hive UDFs, licensed under the Apache License v2
  16. 16. Hivemall’s Vision: ML on SQL Classification with Mahout CREATE TABLE lr_model AS SELECT feature, -- reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t -- map-only task GROUP BY feature; -- shuffled to reducers ✓Machine Learning made easy for SQL developers (ML for the rest of us) ✓Interactive and Stable APIs w/ SQL abstraction This SQL query automatically runs in parallel on Hadoop 16
  17. 17. List of Features in Hivemall v0.3.x Classification(both binary- and multi-class) ✓ Perceptron ✓ Passive Aggressive (PA) ✓ Confidence Weighted (CW) ✓ Adaptive Regularization of Weight Vectors (AROW) ✓ Soft Confidence Weighted (SCW) ✓AdaGrad+RDA Regression ✓Logistic Regression (SGD) ✓PA Regression ✓AROW Regression ✓AdaGrad ✓AdaDELTA kNN and Recommendation ✓ Minhash and b-Bit Minhash (LSH variant) ✓ Similarity Search using K-NN (Euclid/Cosine/Jaccard/Angular) ✓ Matrix Factorization Feature engineering ✓ Feature Hashing ✓ Feature Scaling (normalization, z-score) ✓TF-IDF vectorizer ✓Polynomial Expansion Anomaly Detection ✓ Local Outlier Factor Top-k query processing 17
  18. 18. Features supported in Hivemall v0.4.0 18 1.RandomForest • classification, regression 2.Factorization Machine • classification, regression (factorization)
  19. 19. Features supported in Hivemall v0.4.1-alpha 19 1. NLP Tokenizer (形態素解析) • Kuromoji 2. Mini-batch Gradient Descent 3. RandomForest scalability Improvements Treasure Data is operating Hivemall v0.4.1-alpha.6 The above feature are already supported
  20. 20. Agenda 1. Introduction to Hivemall 2. Industrial use cases 3. How to use Hivemall 4. Development roadmap 20
  21. 21. Ø CTR prediction of Ad click logs •Freakout Inc. and more •Replaced Spark MLlib w/ Hivemall at company X Industry use cases of Hivemall 21 http://www.slideshare.net/masakazusano75/sano-hmm-20150512
  22. 22. 22 ØGender prediction of Ad click logs •Scaleout Inc. http://eventdots.jp/eventreport/458208 Industry use cases of Hivemall
  23. 23. 23 Industry use cases of Hivemall Ø Value prediction of Real estates •Livesense http://www.slideshare.net/y-ken/real-estate-tech-with-hivemall
  24. 24. 24Source: http://itnp.net/article/2016/02/18/2286.html Industry use cases of Hivemall
  25. 25. 25 ØChurn Detection •OISIX Industry use cases of Hivemall http://www.slideshare.net/TaisukeFukawa/hivemall-meetup-vol2-oisix
  26. 26. 26 会員サービスの解約予測 •10万人の会員による定期購 買が会社全体の売上、利益を 左右するが、解約リスクのあ る会員を事前に把握、防止す る策を欠いていた •統計の専門知識無しで機械学習 •解約予測リストへのポイント付 与により解約率が半減 •解約リスクを伴う施策、イベン トを炙り出すと同時に、非解約 者の特徴的な行動も把握可能に •リスク度合いに応じて UI を変 更するなど間接的なサービス改 善も実現 •機械学習を行い、過去1ヶ月間 のデータをもとに未来1ヶ月間 に解約する可能性の高い顧客リ ストを作成 •具体的には、学習用テーブル作 成 -> 正規化 -> 学習モデル作成 -> ロジスティック回帰の各ス テップをTD + Hivemall を用い てクエリで簡便に実現 Web Mobile 属性情報 行動ログ クレーム情報 流入元 利用サービス情報 直接施策 間接施策 ポイント付与 ケアコール 成功体験への誘導UI 変更 予測に使うデータ
  27. 27. 27 ØRecommendation •Portal site Industry use cases of Hivemall
  28. 28. Agenda 1. Introduction to Hivemall 2. Industrial use cases 3. How to use Hivemall 4. Development roadmap 28
  29. 29. 29 RandomForest in Hivemall v0.4 Ensemble of Decision Trees
  30. 30. 30 RandomForest in Hivemall v0.4 Ensemble of Decision Trees
  31. 31. 31 Training of RandomForest
  32. 32. 32 Prediction of RandomForest
  33. 33. Out-of-bag tests and Variable Importance 33
  34. 34. 34 Out-of-bag tests and Variable Importance
  35. 35. Recommendation Rating prediction of a Matrix Can be applied for user/Item Recommendation 35
  36. 36. 36 Matrix Factorization Factorize a matrix into a product of matrices having k-latent factor
  37. 37. 37 Training of Matrix Factorization Support iterative training using local disk cache
  38. 38. 38 Prediction of Matrix Factorization
  39. 39. 39 Factorization Machines Matrix Factorization
  40. 40. 40 Factorization Machines Context information (e.g., time) can be considered Source: http://www.ismll.uni-hildesheim.de/pub/pdfs/Rendle2010FM.pdf
  41. 41. 41 Training data for Factorization Machines Each Feature takes LibSVM-like format <feature[:weight]>
  42. 42. 42 Training of Factorization Machines
  43. 43. 43 Prediction of Factorization Machines
  44. 44. 44 Feature Engineering functions
  45. 45. 45 Feature Engineering functions
  46. 46. Agenda 1. Introduction to Hivemall 2. Industrial use cases 3. How to use Hivemall 4. Development roadmap 46
  47. 47. Features to be supported in Hivemall v0.4.1 47 1. NLP Tokenizer (形態素解析) • Kuromoji integration was requested by Company R 2. Mini-batch Gradient Descent 3. RandomForest scalability Improvements 4. Recommendation for Implicit Feedback Dataset • Useful where only positive-only feedback is available • BPR: Bayesian Personalized Ranking from Implicit Feedback, Proc. UAI, 2009. Planned to release v0.4.1 in April.
  48. 48. Features to be supported in Hivemall v0.4.2 48 1. Gradient Tree Boosting • classifier, regression • based on Smile https://github.com/haifengl/smile/
  49. 49. Features to be supported in Hivemall v0.4.2 49 1. Gradient Tree Boosting • classifier, regression • based on Smile https://github.com/haifengl/smile/ 2. Field-aware Factorization Machine • classification, regression (factorization) Planned to release v0.4.1 in June
  50. 50. Features to be supported in Hivemall v0.5 50 1. Mix server on Apache YARN • Service for parameter sharing among workers 学習器1 学習器2 学習器N パラメタ 交換 学習 モデル 分割された訓練例 データ並列 データ並列
  51. 51. Features to be supported in Hivemall v0.5 51 1. Mix server on Apache YARN • Service for parameter sharing among worker 2. Online LDA • topic modeling, clustering 3. XGBoost Integration 4.Generalized Linear Model • Ridge/Elastic net/Lasso regularization • Supports various loss functions 5. Alternating Direction Method of Multipliers (ADMM) convex optimization 6. T-sne Dimension Reduction
  52. 52. 52 Analytics Workflow Machine learning workflows can be simplified using our new workflow engine, named Digdag +main: +prepare: _parallel: true +train: td>: ./tasks/train_join.sql +test: td>: ./tasks/test_join.sql +quantify: td>: ./tasks/train_quantify.sql +model_test_quantify: _parallel: true +model: td>: ./tasks/make_model.sql +test_quantify: td>: ./tasks/test_quantify.sql +pred: td>: ./tasks/prediction.sql CLI version will be released soon. Stay tuned!
  53. 53. Conclusion and Takeaway 53 Hivemall provides a collection of machine learning algorithms as Hive UDFs/UDTFs Hivemall’s Positioning Treasure Data provides ML-as-a-Service using Hivemall Major development leaps in v0.4 More will follow in v0.4.1 and later • For SQL users that need ML • Easy-of-use and scalability in mind • Random Forest • Factorization Machine
  54. 54. 54 Blog article about Hivemall http://blog-jp.treasuredata.com/ TD, Hivemall, Jupyter, Pandas-TDを使ってKaggleの 課題を解くシリーズ
  55. 55. 55 We support machine learning in Cloud Any feature request? Or, questions?

×