Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig

Hivemall: Scalable machine learning
library for Apache Hive/Spark/Pig
1). Research Engineer, Treasure Data
Makoto YUI @myui
2016/10/26 Hadoop Summit '16, Tokyo
2). Research Engineer, NTT
Takashi Yamamuro @maropu
#hivemall

Plan of the talk
1. Introduction to Hivemall
2. Hivemall on Spark
2016/10/26 Hadoop Summit '16, Tokyo 2

Hivemall enters Apache Incubator
on Sept 13, 2016 🎉
http://incubator.apache.org/projects/hivemall.html
@ApacheHivemall

• Makoto Yui <Treasure Data>
• Takeshi Yamamuro <NTT>
 Hivemall on Apache Spark
• Daniel Dai <Hortonworks>
 Hivemall on Apache Pig
 Apache Pig PMC member
• Tsuyoshi Ozawa <NTT>
Apache Hadoop PMC member
• Kai Sasaki <Treasure Data>
Initial committers

Champion
Nominated Mentors
Project mentors
• Reynold Xin <Databricks, ASF member>
Apache Spark PMC member
• Markus Weimer <Microsoft, ASF member>
Apache REEF PMC member
• Xiangrui Meng <Databricks, ASF member>
Apache Spark PMC member
• Roman Shaposhnik <Pivotal, ASF member>
Apache Bigtop/Incubator PMC member

What is Apache Hivemall
Scalable machine learning library built
as a collection of Hive UDFs
2016/10/26 Hadoop Summit '16, Tokyo
github.com/myui/hivemall
hivemall.incubator.apache.org

Hivemall’s Vision: ML on SQL
Classification with Mahout
CREATE TABLE lr_model AS
SELECT
feature, -- reducers perform model averaging in parallel
avg(weight) as weight
FROM (
SELECT logress(features,label,..) as (feature,weight)
FROM train
) t -- map-only task
GROUP BY feature; -- shuffled to reducers
✓ Machine Learning made easy for SQL developers
✓ Interactive and Stable APIs w/ SQL abstraction
This SQL query automatically runs in
parallel on Hadoop

Hadoop HDFS
MapReduce
(MRv1)
Hivemall
Apache YARN
Apache Tez
DAG processing
Machine Learning
Query Processing
Parallel Data
Processing Framework
Resource Management
Distributed File System
Cloud Storage
SparkSQL
Apache Spark
MESOS
Hive Pig
MLlib
Hivemall’s Technology Stack
Amazon S3

List of supported Algorithms
Classification
✓ Perceptron
✓ Passive Aggressive (PA, PA1, PA2)
✓ Confidence Weighted (CW)
✓ Adaptive Regularization of Weight
Vectors (AROW)
✓ Soft Confidence Weighted (SCW)
✓ AdaGrad+RDA
✓ Factorization Machines
✓ RandomForest Classification
Regression
✓Logistic Regression (SGD)
✓AdaGrad (logistic loss)
✓AdaDELTA (logistic loss)
✓PA Regression
✓AROW Regression
✓Factorization Machines
✓RandomForest Regression
SCW is a good first choice
Try RandomForest if SCW does not work
Logistic regression is good for getting a probability of
a positive class
Factorization Machines is good where features are
sparse and categorical ones

List of Algorithms for Recommendation
K-Nearest Neighbor
✓ Minhash and b-Bit Minhash
(LSH variant)
✓ Similarity Search on Vector
Space
(Euclid/Cosine/Jaccard/Angular)
Matrix Completion
✓ Matrix Factorization
✓ Factorization Machines
(regression)
each_top_k function of Hivemall is useful for
recommending top-k items

student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k query processing
student class score
3 a 90
2 a 80
1 b 70
6 b 60
List top-2 students for each class

student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
SELECT * FROM (
SELECT
*,
rank() over (partition by class order by score desc)
as rank
FROM table
) t
WHERE rank <= 2

student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
SELECT
each_top_k(
2, class, score,
class, student
) as (rank, score, class, student)
FROM (
SELECT * FROM table
DISTRIBUTE BY class SORT BY class
) t

Top-k query processing by RANK OVER()
partition by class
Node 1
Sort by class, score
rank over()
rank >= 2

Top-k query processing by EACH_TOP_K
distributed by class
Node 1
Sort by class
each_top_k
OUTPUT only K
items

Comparison between RANK and EACH_TOP_K
distributed by class
Sort by class
each_top_k
Sort by class, score
rank over()
rank >= 2
SORTING IS HEAVY
NEED TO
PROCESS ALL
OUTPUT only K
items
Each_top_k is very efficient where the number of class is large
Bounded Priority Queue
is utilized

Performance reported by TD customer
•1,000 students in each class
•20 million classes
RANK over() query does not finishes in 24 hours 
EACH_TOP_K finishes in 2 hours 
Refer for detail
https://speakerdeck.com/kaky0922/hivemall-meetup-20160908

Other Supported Algorithms
Anomaly Detection
✓ Local Outlier Factor (LoF)
Feature Engineering
✓Feature Hashing
✓Feature Scaling
(normalization, z-score)
✓ TF-IDF vectorizer
✓ Polynomial Expansion
(Feature Pairing)
✓ Amplifier
NLP
✓Basic Englist text Tokenizer
✓Japanese Tokenizer
(Kuromoji)

Industry use cases of Hivemall
CTR prediction of Ad click logs
• Algorithm: Logistic regression
• Freakout Inc., Smartnews, and more
Gender prediction of Ad click logs
• Algorithm: Classification
• Scaleout Inc.
http://www.slideshare.net/eventdotsjp/hivemall

Item/User recommendation
• Algorithm: Recommendation
• Wish.com, GMO pepabo
Problem: Recommendation using hot-item is hard in hand-crafted
product market because each creator sells few single items (will
soon become out-of-stock)
minne.com

Value prediction of Real estates
• Algorithm: Regression
• Livesense

User score calculation
• Algrorithm: Regression
• Klout
bit.ly/klout-hivemall
Influencer marketing
klout.com

Efficient algorithm for finding change point and outliers from
timeseries data
J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting Outliers and Change Points from Time Series,” IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.
Anomaly/Change-point Detection by ChangeFinder

Efficient algorithm for finding change point and outliers from
timeseries data
Anomaly/Change-point Detection by ChangeFinder
J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting Outliers and Change Points from Time Series,” IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.

• T. Ide and K. Inoue, "Knowledge Discovery from Heterogeneous Dynamic Systems using Change-Point
Correlations", Proc. SDM, 2005T.
• T. Ide and K. Tsuda, "Change-point detection using Krylov subspace learning", Proc. SDM, 2007.
Change-point detection by Singular Spectrum Transformation

Evaluation Metrics

Feature Engineering – Feature Binning
Maps quantitative variables to fixed number of
bins based on quantiles/distribution
Map Ages into 3 bins

Feature Selection – Signal Noise Ratio

Feature Selection – Chi-Square

Feature Transformation – Onehot encoding
Maps a categorical variable to a
unique number starting from 1

 Spark 2.0 support
 XGBoost Integration
 Field-aware Factorization Machines
 Generalized Linear Model
• Optimizer framework including ADAM
• L1/L2 regularization
Other new features to come

Copyright©2016 NTT corp. All Rights Reserved.
Hivemall on
Takeshi Yamamuro @ NTT
Hadoop Summit 2016
Tokyo, Japan

34Copyright©2016 NTT corp. All Rights Reserved.
Who am I?

What’s Spark?
• 1. Unified Engine
• support end-to-end APs, e.g., MLlib and Streaming
• 2. High-level APIs
• easy-to-use, rich optimization
• 3. Integrate broadly
• storages, libraries, ...

• Hivemall wrapper for Spark
• Wrapper implementations for DataFrame/SQL
• + some utilities for easy-to-use in Spark
• The wrapper makes you...
• run most of Hivemall functions in Spark
• try examples easily in your laptop
• improve some function performance in Spark
What’s Hivemall on Spark?

• Hivemall already has many fascinating ML
algorithms and useful utilities
• High barriers to add newer algorithms in MLlib
Why’s Hivemall on Spark?
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

• Most of Hivemall functions supported in Spark
v1.6 and v2.0
Current Status
- For Spark v2.0
$ git clone https://github.com/myui/hivemall
$ cd hivemall
$ mvn package –Pspark-2.0 –DskipTests
$ ls target/*spark*
target/hivemall-spark-2.0_2.11-XXX-with-dependencies.jar
...

• Most of Hivemall functions supported in Spark
v1.6 and v2.0
Current Status
- For Spark v2.0
$ git clone https://github.com/myui/hivemall
$ cd hivemall
$ mvn package –Pspark-2.0 –DskipTests
$ ls target/*spark*
target/hivemall-spark-2.0_2.11-XXX-with-dependencies.jar
...
-Pspark-1.6 for Spark v1.6

• 1. Download a Spark binary
• 2. Fetch training and test data
• 3. Load these data in Spark
• 4. Build a model
• 5. Do predictions
Running an Example

1. Download a Spark binary Running an Example
http://spark.apache.org/downloads.html

• E2006 tfidf regression dataset
• http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/dat
asets/regression.html#E2006-tfidf
2. Fetch training and test data Running an Example
$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
regression/E2006.train.bz2
$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
regression/E2006.test.bz2

3. Load data in Spark Running an Example
$ <SPARK_HOME>/bin/spark-shell
-conf spark.jars=hivemall-spark-2.0_2.11-XXX-with-dependencies.jar
// Create DataFrame from the bzip’d libsvm-formatted file
scala> val trainDf = sqlContext.sparkSession.read.format("libsvm”)
.load(“E2006.train.bz2")
scala> trainDf.printSchema
root
|-- label: double (nullable = true)
|-- features: vector (nullable = true)

3. Load data in Spark Running an Example
0.000357499151147113 6066:0.0007932706219604
8 6069:0.000311377727123504 6070:0.0003067549
34580457 6071:0.000276992485786437 6072:0.000
39663531098024 6074:0.00039663531098024 6075
:0.00032548335…
trainDf
Partition1 Partition2 Partition3 PartitionN
…
…
…
Load in parallel because
bzip2 is splittable

4. Build a model - DataFrame Running an Example
scala> paste:
val modelDf = trainDf.train_logregr($"features", $"label")
.groupBy("feature”)
.agg("weight" -> "avg”)

5. Do predictions - DataFrame Running an Example
scala> paste:
val df = testDf.select(rowid(), $"features")
.explode_vector($"features")
.cache
# Do predictions
df.join(modelDf, df("feature") === model("feature"), "LEFT_OUTER")
.groupBy("rowid")
.avg(sigmoid(sum($"weight" * $"value")))

5. Do predictions - SQL Running an Example
scala> modelDf.createOrReplaceTempView(”ModelTable")
scala> df.createOrReplaceTempView(”TestTable”)
scala> paste:
sql("""
| SELECT rowid, sigmoid(value * weight) AS predicted
| FROM TrainTable t
| LEFT OUTER JOIN ModelTable m
| ON t.feature = m.feature
| GROUP BY rowid
""".stripMargin)

• Spark has overheads to call Hive UD*Fs
• Hivemall heavily depends on them
• ex.1) Compute a sigmoid function
Improve Some Functions in Spark
scala> val sigmoidFunc = (d: Double) => 1.0 / (1.0 + Math.exp(-d))
scala> val sparkUdf = functions.udf(sigmoidFunc)
scala> df.select(sparkUdf($“value”))

scala> val hiveUdf = HivemallOps.sigmoid
scala> df.select(hiveUdf($“value”))

• ex.2) Compute top-k for each key group
scala> paste:
df.withColumn(
“rank”,
rank().over(Window.partitionBy($"key").orderBy($"score".desc)
) .where($"rank" <= topK)

• Fixed the overhead issue for each_top_k
• See pr#353: “Implement EachTopK as a generator expression“ in Spark
scala> df.each_top_k(topK, “key”, “score”, “value”)

~4 times faster than rank!!

• supports under development
• fast implementation of the gradient tree boosting
• widely used in Kaggle competitions
• This integration will make you...
• load built models and predict in parallel
• build multiple models in parallel
for cross-validation
3rd–Party Library Integration

Conclusion and Takeaway
Hivemall provides a collection of machine
learning algorithms as Hive UDFs/UDTFs
We welcome your contributions to Apache Hivemall 

Any feature request or questions?

Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig

Similar to Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig