Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Hivemall:	Scalable	Machine	
Learning	Library	for	Apache	Hive
Research	Engineer
Makoto	YUI	@myui
<myui@treasure-data.com>
1...
2
3
External
Integrations
SQL
Server
CRM
RDBMS
App log
Sensor
Apache log
ERP
Hive
Batch
Adhoc
Presto
API
ODBC
JDBC
PUSH
Treasu...
1. What is Hivemall (short intro.)
2. Why Hivemall (motivations etc.)
3. Hivemall Internals
4. How to use Hivemall
Agenda
What is Hivemall
Scalable machine learning library built as a collection of Hive
UDFs, licensed under the Apache License v...
Won	IDG’s	InfoWorld	2014
Bossie Awards 2014: The best open source big data tools
InfoWorld's top picks in distributeddata ...
Classification
✓ Perceptron
✓ Passive	Aggressive	(PA,	PA1,	
PA2)
✓ Confidence	Weighted	(CW)
✓ Adaptive	Regularization	of	
...
List of supported Algorithms
Classification	
✓ Perceptron
✓ Passive	Aggressive	(PA,	PA1,	
PA2)
✓ Confidence	Weighted	(CW)
...
List of Algorithms for Recommendation
10
K-Nearest	Neighbor
✓ Minhash and	b-Bit	Minhash
(LSH	variant)
✓ Similarity	Search	...
Other Supported Algorithms
11
Anomaly	Detection
✓ Local	Outlier	Factor	(LoF)
Feature	Engineering
✓Feature	Hashing
✓Feature...
Ø CTR prediction of Ad click logs
• Algorithm: Logistic regression
• Freakout Inc. and more
Ø Gender prediction of Ad clic...
1. What is Hivemall (short intro.)
2. Why Hivemall (motivations etc.)
3. Hivemall Internals
4. How to use Hivemall
Agenda
Why	Hivemall
1. In	my	experience	working	on	ML,	I	used	Hive	
for	preprocessing	and	Python	(scikit-learn	etc.)	
for	ML.	Thi...
How	I	used	to	do	ML	projects	before	Hivemall
Given	raw	data	stored	on	Hadoop	HDFS
Raw
Data
HDFS
S3 Feature	Vector
height:1...
How	I	used	to	do	ML	projects	before	Hivemall
Given	raw	data	stored	on	Hadoop	HDFS
Raw
Data
HDFS
S3 Feature	Vector
height:1...
How	I	used	to	do	ML	projects	before	Hivemall
Given	raw	data	stored	on	Hadoop	HDFS
Raw
Data
HDFS
S3 Feature	Vector
height:1...
How	I	used	to	do	ML	before	Hivemall
Given	raw	data	stored	on	Hadoop	HDFS
Raw
Data
HDFS
S3 Feature	Vector
height:173cm
weig...
Framework User	interface
Mahout Java	API	Programming
Spark	MLlib/MLI Scala	API	programming
Scala	Shell	(REPL)
H2O R	progra...
Hivemall’s Vision:	ML	on	SQL
Classification	with	Mahout
CREATE	TABLE	lr_model	AS
SELECT
feature,	-- reducers	perform	model...
21
Hivemall	on	Apache	Spark
Installation	is	very	easy	as	follows:
$	spark-shell	--packages	maropu:hivemall-spark:0.0.6
1. What is Hivemall
2. Why Hivemall
3. Hivemall Internals
4. How to use Hivemall
Agenda
Implemented	machine	learning	algorithms	as	
User-Defined	Table	generating	Functions	(UDTFs)
How	Hivemall	works	in	training...
Alternative	Approach	in	Hivemall
Hivemall	provides	the amplify UDTF	to	enumerate	
iteration	effects	in	machine	learning	wi...
1. What is Hivemall
2. Why Hivemall
3. Hivemall Internals
4. How to use Hivemall
Agenda
How	to	use	Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature	
Vector
Feature	Vector
Label
Data	...
Create external table e2006tfidf_train(
rowid int,
label float,
features ARRAY<STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMI...
How	to	use	Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature	
Vector
Feature	Vector
Label
Featu...
create view e2006tfidf_train_scaled
as
select
rowid,
rescale(target,${min_label},${max_label})
as label,
features
from
e20...
How	to	use	Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature	
Vector
Feature	Vector
Label
Train...
How	to	use	Hivemall	- Training
CREATE TABLE lr_model AS
SELECT
feature,
avg(weight) as weight
FROM (
SELECT logress(featur...
How	to	use	Hivemall	- Training
CREATE TABLE news20b_cw_model1 AS
SELECT
feature,
voted_avg(weight) as weight
FROM
(SELECT
...
How	to	use	Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature	
Vector
Feature	Vector
Label
Predi...
How	to	use	Hivemall	- Prediction
CREATE TABLE lr_predict
as
SELECT
t.rowid,
sigmoid(sum(m.weight)) as prob
FROM
testing_ex...
Real-time	prediction
Machine
Learning
Batch Training on Hadoop
Online Prediction on RDBMS
Prediction
Model
Label
Feature	
...
Conclusion
Hivemall	provides	a	collection	of	machine	
learning	algorithms	as	Hive	UDFs/UDTFs
36
Ø For	SQL	users	that	need	...
Thank you!
Makoto	YUI	- Research	engineer	/	Treasure	Data
twitter:	@myui
37
Download Hivemall from
bit.ly/hivemall
Upcoming SlideShare
Loading in …5
×

Introduction to Hivemall

1,702 views

Published on

Slide introducing Hivemall

Published in: Data & Analytics
  • Be the first to comment

Introduction to Hivemall

  1. 1. Hivemall: Scalable Machine Learning Library for Apache Hive Research Engineer Makoto YUI @myui <myui@treasure-data.com> 1 bit.ly/hivemall
  2. 2. 2
  3. 3. 3
  4. 4. External Integrations SQL Server CRM RDBMS App log Sensor Apache log ERP Hive Batch Adhoc Presto API ODBC JDBC PUSH Treasure Agent BI tools Data analysis Treasure Data Collectors Embedded Embulk Mobile SDK JS SDK Treasure Data Cloud Service Machine Learning 900,000 Records stored per sec.
  5. 5. 1. What is Hivemall (short intro.) 2. Why Hivemall (motivations etc.) 3. Hivemall Internals 4. How to use Hivemall Agenda
  6. 6. What is Hivemall Scalable machine learning library built as a collection of Hive UDFs, licensed under the Apache License v2 Hadoop HDFS MapReduce (MRv1) Hivemall Apache YARN Apache Tez DAG processing Machine Learning Query Processing Parallel Data Processing Framework Resource Management Distributed File System SparkSQL Apache Spark MESOS Hive Pig MLlib
  7. 7. Won IDG’s InfoWorld 2014 Bossie Awards 2014: The best open source big data tools InfoWorld's top picks in distributeddata processing, data analytics,machine learning,NoSQL databases,and the Hadoop ecosystem (awarded along w/ Spark, Tez, Jupyter notebook, Pandas, Impala, Kafka) bit.ly/hivemall-award
  8. 8. Classification ✓ Perceptron ✓ Passive Aggressive (PA, PA1, PA2) ✓ Confidence Weighted (CW) ✓ Adaptive Regularization of Weight Vectors (AROW) ✓ Soft Confidence Weighted (SCW) ✓ AdaGrad+RDA ✓ Factorization Machines ✓ RandomForest Classification 8 Regression ✓Logistic Regression (SGD) ✓PA Regression ✓AROW Regression ✓AdaGrad(logistic loss) ✓AdaDELTA (logistic loss) ✓Factorization Machines ✓RandomForest Regression List of supported Algorithms
  9. 9. List of supported Algorithms Classification ✓ Perceptron ✓ Passive Aggressive (PA, PA1, PA2) ✓ Confidence Weighted (CW) ✓ Adaptive Regularization of Weight Vectors (AROW) ✓ Soft Confidence Weighted (SCW) ✓ AdaGrad+RDA ✓ Factorization Machines ✓ RandomForest Classification 9 Regression ✓Logistic Regression (SGD) ✓AdaGrad(logistic loss) ✓AdaDELTA (logistic loss) ✓PA Regression ✓AROW Regression ✓Factorization Machines ✓RandomForest Regression SCW is a good first choice Try RandomForest if SCW does not work Logistic regression is good for getting a probability of a positive class Factorization Machines is good where features are sparse and categorical ones
  10. 10. List of Algorithms for Recommendation 10 K-Nearest Neighbor ✓ Minhash and b-Bit Minhash (LSH variant) ✓ Similarity Search on Vector Space (Euclid/Cosine/Jaccard/Angular) Matrix Completion ✓ Matrix Factorization ✓ Factorization Machines (regression) each_top_k function of Hivemall is useful for recommending top-k items
  11. 11. Other Supported Algorithms 11 Anomaly Detection ✓ Local Outlier Factor (LoF) Feature Engineering ✓Feature Hashing ✓Feature Scaling (normalization, z-score) ✓ TF-IDF vectorizer ✓ Polynomial Expansion (Feature Pairing) ✓ Amplifier NLP ✓Basic Englist text Tokenizer ✓Japanese Tokenizer (Kuromoji)
  12. 12. Ø CTR prediction of Ad click logs • Algorithm: Logistic regression • Freakout Inc. and more Ø Gender prediction of Ad click logs • Algorithm: Classification • Scaleout Inc. Ø Churn Detection • Algorithm: Regression • OISIX and more Ø Item/User recommendation • Algorithm: Recommendation (Matrix Factorization / kNN) • Adtech Companies, ISP portal, and more Ø Value prediction of Real estates • Algorithm: Regression • Livesense Industry use cases of Hivemall 12
  13. 13. 1. What is Hivemall (short intro.) 2. Why Hivemall (motivations etc.) 3. Hivemall Internals 4. How to use Hivemall Agenda
  14. 14. Why Hivemall 1. In my experience working on ML, I used Hive for preprocessing and Python (scikit-learn etc.) for ML. This was INEFFICIENT and ANNOYING. Also, Python is not as scalable as Hive. 2. Why not run ML algorithms inside Hive? Less components to manage and more scalable. That’s why I build Hivemall.
  15. 15. How I used to do ML projects before Hivemall Given raw data stored on Hadoop HDFS Raw Data HDFS S3 Feature Vector height:173cm weight:60kg age:34 gender: man … Extract-Transform-Load Machine Learning file
  16. 16. How I used to do ML projects before Hivemall Given raw data stored on Hadoop HDFS Raw Data HDFS S3 Feature Vector height:173cm weight:60kg age:34 gender: man … Extract-Transform-Load file Need to do expensive data preprocessing (Joins, Filtering, and Formatting of Data that does not fit in memory) Machine Learning
  17. 17. How I used to do ML projects before Hivemall Given raw data stored on Hadoop HDFS Raw Data HDFS S3 Feature Vector height:173cm weight:60kg age:34 gender: man … Extract-Transform-Load file Do not scale Have to learn R/Python APIs
  18. 18. How I used to do ML before Hivemall Given raw data stored on Hadoop HDFS Raw Data HDFS S3 Feature Vector height:173cm weight:60kg age:34 gender: man … Extract-Transform-Load Does not meet my needs In terms of its scalability, ML algorithms, and usability I ❤ scalable SQL query
  19. 19. Framework User interface Mahout Java API Programming Spark MLlib/MLI Scala API programming Scala Shell (REPL) H2O R programming GUI Cloudera Oryx Http REST API programming Vowpal Wabbit (w/ Hadoop streaming) C++ API programming Command Line Survey on existing ML frameworks Existing distributed machine learning frameworks are NOT easy to use
  20. 20. Hivemall’s Vision: ML on SQL Classification with Mahout CREATE TABLE lr_model AS SELECT feature, -- reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t -- map-only task GROUP BY feature; -- shuffled to reducers ✓Machine Learning made easy for SQL developers (ML for the rest of us) ✓Interactive and Stable APIs w/ SQL abstraction This SQL query automatically runs in parallel on Hadoop
  21. 21. 21 Hivemall on Apache Spark Installation is very easy as follows: $ spark-shell --packages maropu:hivemall-spark:0.0.6
  22. 22. 1. What is Hivemall 2. Why Hivemall 3. Hivemall Internals 4. How to use Hivemall Agenda
  23. 23. Implemented machine learning algorithms as User-Defined Table generating Functions (UDTFs) How Hivemall works in training +1, <1,2> .. +1, <1,7,9> -1, <1,3, 9> .. +1, <3,8> tuple <label, array<features>> tuple<feature, weights> Prediction model UDTF Relation <feature, weights> param-mix param-mix Training table Shuffle by feature train train ● Resulting prediction model is a relation of feature and its weight ● # of mapper and reducers are configurable UDTF is a function that returns a relation Parallelism is Powerful
  24. 24. Alternative Approach in Hivemall Hivemall provides the amplify UDTF to enumerate iteration effects in machine learning without several MapReduce steps SET hivevar:xtimes=3; CREATE VIEW training_x3 as SELECT * FROM ( SELECT amplify(${xtimes}, *) as (rowid, label, features) FROM training ) t CLUSTER BY rand()
  25. 25. 1. What is Hivemall 2. Why Hivemall 3. Hivemall Internals 4. How to use Hivemall Agenda
  26. 26. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Data preparation 26
  27. 27. Create external table e2006tfidf_train( rowid int, label float, features ARRAY<STRING> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“ STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train'; How to use Hivemall - Data preparation Define a Hive table for training/testing data 27
  28. 28. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Feature Engineering 28
  29. 29. create view e2006tfidf_train_scaled as select rowid, rescale(target,${min_label},${max_label}) as label, features from e2006tfidf_train; Applying a Min-Max Feature Normalization How to use Hivemall - Feature Engineering Transforming a label value to a value between 0.0 and 1.0 29
  30. 30. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Training 30
  31. 31. How to use Hivemall - Training CREATE TABLE lr_model AS SELECT feature, avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t GROUP BY feature Training by logistic regression map-only task to learn a prediction model Shuffle map-outputs to reduces by feature Reducers perform model averaging in parallel 31
  32. 32. How to use Hivemall - Training CREATE TABLE news20b_cw_model1 AS SELECT feature, voted_avg(weight) as weight FROM (SELECT train_cw(features,label) as (feature,weight) FROM news20b_train ) t GROUP BY feature Training of Confidence Weighted Classifier Vote to use negative or positive weights for avg +0.7, +0.3, +0.2, -0.1, +0.7 Training for the CW classifier 32
  33. 33. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Prediction 33
  34. 34. How to use Hivemall - Prediction CREATE TABLE lr_predict as SELECT t.rowid, sigmoid(sum(m.weight)) as prob FROM testing_exploded t LEFT OUTER JOIN lr_model m ON (t.feature = m.feature) GROUP BY t.rowid Prediction is done by LEFT OUTER JOIN between test data and prediction model No need to load the entire model into memory 34
  35. 35. Real-time prediction Machine Learning Batch Training on Hadoop Online Prediction on RDBMS Prediction Model Label Feature Vector Feature Vector Label Export prediction model 35 bit.ly/hivemall-rtp
  36. 36. Conclusion Hivemall provides a collection of machine learning algorithms as Hive UDFs/UDTFs 36 Ø For SQL users that need ML Ø For whom already using Hive Ø Easy-of-use and scalability in mind Do not require coding, packaging, compiling or introducing a new programming language or APIs. Hivemall’s Positioning
  37. 37. Thank you! Makoto YUI - Research engineer / Treasure Data twitter: @myui 37 Download Hivemall from bit.ly/hivemall

×