SlideShare a Scribd company logo
1 of 57
Hivemall: Scalable machine learning
library for Apache Hive/Spark/Pig
1). Research Engineer, Treasure Data
Makoto YUI @myui
2016/10/26 Hadoop Summit '16, Tokyo
2). Research Engineer, NTT
Takashi Yamamuro @maropu
#hivemall
Plan of the talk
1. Introduction to Hivemall
2. Hivemall on Spark
2016/10/26 Hadoop Summit '16, Tokyo 2
2016/10/26 Hadoop Summit '16, Tokyo 3
Hivemall enters Apache Incubator
on Sept 13, 2016 🎉
http://incubator.apache.org/projects/hivemall.html
@ApacheHivemall
• Makoto Yui <Treasure Data>
• Takeshi Yamamuro <NTT>
 Hivemall on Apache Spark
• Daniel Dai <Hortonworks>
 Hivemall on Apache Pig
 Apache Pig PMC member
• Tsuyoshi Ozawa <NTT>
Apache Hadoop PMC member
• Kai Sasaki <Treasure Data>
2016/10/26 Hadoop Summit '16, Tokyo 4
Initial committers
Champion
Nominated Mentors
2016/10/26 Hadoop Summit '16, Tokyo 5
Project mentors
• Reynold Xin <Databricks, ASF member>
Apache Spark PMC member
• Markus Weimer <Microsoft, ASF member>
Apache REEF PMC member
• Xiangrui Meng <Databricks, ASF member>
Apache Spark PMC member
• Roman Shaposhnik <Pivotal, ASF member>
Apache Bigtop/Incubator PMC member
What is Apache Hivemall
Scalable machine learning library built
as a collection of Hive UDFs
2016/10/26 Hadoop Summit '16, Tokyo
github.com/myui/hivemall
hivemall.incubator.apache.org
Hivemall’s Vision: ML on SQL
2016/10/26 Hadoop Summit '16, Tokyo 7
Classification with Mahout
CREATE TABLE lr_model AS
SELECT
feature, -- reducers perform model averaging in parallel
avg(weight) as weight
FROM (
SELECT logress(features,label,..) as (feature,weight)
FROM train
) t -- map-only task
GROUP BY feature; -- shuffled to reducers
✓ Machine Learning made easy for SQL developers
✓ Interactive and Stable APIs w/ SQL abstraction
This SQL query automatically runs in
parallel on Hadoop
Hadoop HDFS
MapReduce
(MRv1)
Hivemall
Apache YARN
Apache Tez
DAG processing
Machine Learning
Query Processing
Parallel Data
Processing Framework
Resource Management
Distributed File System
Cloud Storage
SparkSQL
Apache Spark
MESOS
Hive Pig
MLlib
Hivemall’s Technology Stack
2016/10/26 Hadoop Summit '16, Tokyo 8
Amazon S3
List of supported Algorithms
Classification
✓ Perceptron
✓ Passive Aggressive (PA, PA1, PA2)
✓ Confidence Weighted (CW)
✓ Adaptive Regularization of Weight
Vectors (AROW)
✓ Soft Confidence Weighted (SCW)
✓ AdaGrad+RDA
✓ Factorization Machines
✓ RandomForest Classification
2016/10/26 Hadoop Summit '16, Tokyo 9
Regression
✓Logistic Regression (SGD)
✓AdaGrad (logistic loss)
✓AdaDELTA (logistic loss)
✓PA Regression
✓AROW Regression
✓Factorization Machines
✓RandomForest Regression
SCW is a good first choice
Try RandomForest if SCW does not work
Logistic regression is good for getting a probability of
a positive class
Factorization Machines is good where features are
sparse and categorical ones
List of Algorithms for Recommendation
2016/10/26 Hadoop Summit '16, Tokyo 10
K-Nearest Neighbor
✓ Minhash and b-Bit Minhash
(LSH variant)
✓ Similarity Search on Vector
Space
(Euclid/Cosine/Jaccard/Angular)
Matrix Completion
✓ Matrix Factorization
✓ Factorization Machines
(regression)
each_top_k function of Hivemall is useful for
recommending top-k items
2016/10/26 Hadoop Summit '16, Tokyo 11
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k query processing
student class score
3 a 90
2 a 80
1 b 70
6 b 60
List top-2 students for each class
2016/10/26 Hadoop Summit '16, Tokyo 12
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k query processing
List top-2 students for each class
SELECT * FROM (
SELECT
*,
rank() over (partition by class order by score desc)
as rank
FROM table
) t
WHERE rank <= 2
2016/10/26 Hadoop Summit '16, Tokyo 13
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k query processing
List top-2 students for each class
SELECT
each_top_k(
2, class, score,
class, student
) as (rank, score, class, student)
FROM (
SELECT * FROM table
DISTRIBUTE BY class SORT BY class
) t
2016/10/26 Hadoop Summit '16, Tokyo 14
Top-k query processing by RANK OVER()
partition by class
Node 1
Sort by class, score
rank over()
rank >= 2
2016/10/26 Hadoop Summit '16, Tokyo 15
Top-k query processing by EACH_TOP_K
distributed by class
Node 1
Sort by class
each_top_k
OUTPUT only K
items
2016/10/26 Hadoop Summit '16, Tokyo 16
Comparison between RANK and EACH_TOP_K
distributed by class
Sort by class
each_top_k
Sort by class, score
rank over()
rank >= 2
SORTING IS HEAVY
NEED TO
PROCESS ALL
OUTPUT only K
items
Each_top_k is very efficient where the number of class is large
Bounded Priority Queue
is utilized
Performance reported by TD customer
2016/10/26 Hadoop Summit '16, Tokyo 17
•1,000 students in each class
•20 million classes
RANK over() query does not finishes in 24 hours 
EACH_TOP_K finishes in 2 hours 
Refer for detail
https://speakerdeck.com/kaky0922/hivemall-meetup-20160908
Other Supported Algorithms
2016/10/26 Hadoop Summit '16, Tokyo 18
Anomaly Detection
✓ Local Outlier Factor (LoF)
Feature Engineering
✓Feature Hashing
✓Feature Scaling
(normalization, z-score)
✓ TF-IDF vectorizer
✓ Polynomial Expansion
(Feature Pairing)
✓ Amplifier
NLP
✓Basic Englist text Tokenizer
✓Japanese Tokenizer
(Kuromoji)
Industry use cases of Hivemall
CTR prediction of Ad click logs
• Algorithm: Logistic regression
• Freakout Inc., Smartnews, and more
Gender prediction of Ad click logs
• Algorithm: Classification
• Scaleout Inc.
2016/10/26 Hadoop Summit '16, Tokyo 19
http://www.slideshare.net/eventdotsjp/hivemall
Industry use cases of Hivemall
Item/User recommendation
• Algorithm: Recommendation
• Wish.com, GMO pepabo
2016/10/26 Hadoop Summit '16, Tokyo 20
Problem: Recommendation using hot-item is hard in hand-crafted
product market because each creator sells few single items (will
soon become out-of-stock)
minne.com
Industry use cases of Hivemall
Value prediction of Real estates
• Algorithm: Regression
• Livesense
2016/10/26 Hadoop Summit '16, Tokyo 21
Industry use cases of Hivemall
User score calculation
• Algrorithm: Regression
• Klout
2016/10/26 Hadoop Summit '16, Tokyo 22
bit.ly/klout-hivemall
Influencer marketing
klout.com
2016/10/26 Hadoop Summit '16, Tokyo 23
Efficient algorithm for finding change point and outliers from
timeseries data
2016/10/26 Hadoop Summit '16, Tokyo 24
J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting Outliers and Change Points from Time Series,” IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.
Anomaly/Change-point Detection by ChangeFinder
Efficient algorithm for finding change point and outliers from
timeseries data
2016/10/26 Hadoop Summit '16, Tokyo 25
Anomaly/Change-point Detection by ChangeFinder
J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting Outliers and Change Points from Time Series,” IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.
2016/10/26 Hadoop Summit '16, Tokyo 26
• T. Ide and K. Inoue, "Knowledge Discovery from Heterogeneous Dynamic Systems using Change-Point
Correlations", Proc. SDM, 2005T.
• T. Ide and K. Tsuda, "Change-point detection using Krylov subspace learning", Proc. SDM, 2007.
Change-point detection by Singular Spectrum Transformation
2016/10/26 Hadoop Summit '16, Tokyo 27
Evaluation Metrics
2016/10/26 Hadoop Summit '16, Tokyo 28
Feature Engineering – Feature Binning
Maps quantitative variables to fixed number of
bins based on quantiles/distribution
Map Ages into 3 bins
2016/10/26 Hadoop Summit '16, Tokyo 29
Feature Selection – Signal Noise Ratio
2016/10/26 Hadoop Summit '16, Tokyo 30
Feature Selection – Chi-Square
2016/10/26 Hadoop Summit '16, Tokyo 31
Feature Transformation – Onehot encoding
Maps a categorical variable to a
unique number starting from 1
 Spark 2.0 support
 XGBoost Integration
 Field-aware Factorization Machines
 Generalized Linear Model
• Optimizer framework including ADAM
• L1/L2 regularization
2016/10/26 Hadoop Summit '16, Tokyo 32
Other new features to come
Copyright©2016 NTT corp. All Rights Reserved.
Hivemall on
Takeshi Yamamuro @ NTT
Hadoop Summit 2016
Tokyo, Japan
34Copyright©2016 NTT corp. All Rights Reserved.
Who am I?
35Copyright©2016 NTT corp. All Rights Reserved.
What’s Spark?
• 1. Unified Engine
• support end-to-end APs, e.g., MLlib and Streaming
• 2. High-level APIs
• easy-to-use, rich optimization
• 3. Integrate broadly
• storages, libraries, ...
36Copyright©2016 NTT corp. All Rights Reserved.
• Hivemall wrapper for Spark
• Wrapper implementations for DataFrame/SQL
• + some utilities for easy-to-use in Spark
• The wrapper makes you...
• run most of Hivemall functions in Spark
• try examples easily in your laptop
• improve some function performance in Spark
What’s Hivemall on Spark?
37Copyright©2016 NTT corp. All Rights Reserved.
• Hivemall already has many fascinating ML
algorithms and useful utilities
• High barriers to add newer algorithms in MLlib
Why’s Hivemall on Spark?
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
38Copyright©2016 NTT corp. All Rights Reserved.
• Most of Hivemall functions supported in Spark
v1.6 and v2.0
Current Status
- For Spark v2.0
$ git clone https://github.com/myui/hivemall
$ cd hivemall
$ mvn package –Pspark-2.0 –DskipTests
$ ls target/*spark*
target/hivemall-spark-2.0_2.11-XXX-with-dependencies.jar
...
39Copyright©2016 NTT corp. All Rights Reserved.
• Most of Hivemall functions supported in Spark
v1.6 and v2.0
Current Status
- For Spark v2.0
$ git clone https://github.com/myui/hivemall
$ cd hivemall
$ mvn package –Pspark-2.0 –DskipTests
$ ls target/*spark*
target/hivemall-spark-2.0_2.11-XXX-with-dependencies.jar
...
-Pspark-1.6 for Spark v1.6
40Copyright©2016 NTT corp. All Rights Reserved.
• 1. Download a Spark binary
• 2. Fetch training and test data
• 3. Load these data in Spark
• 4. Build a model
• 5. Do predictions
Running an Example
41Copyright©2016 NTT corp. All Rights Reserved.
1. Download a Spark binary Running an Example
http://spark.apache.org/downloads.html
42Copyright©2016 NTT corp. All Rights Reserved.
• E2006 tfidf regression dataset
• http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/dat
asets/regression.html#E2006-tfidf
2. Fetch training and test data Running an Example
$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
regression/E2006.train.bz2
$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
regression/E2006.test.bz2
43Copyright©2016 NTT corp. All Rights Reserved.
3. Load data in Spark Running an Example
$ <SPARK_HOME>/bin/spark-shell
-conf spark.jars=hivemall-spark-2.0_2.11-XXX-with-dependencies.jar
// Create DataFrame from the bzip’d libsvm-formatted file
scala> val trainDf = sqlContext.sparkSession.read.format("libsvm”)
.load(“E2006.train.bz2")
scala> trainDf.printSchema
root
|-- label: double (nullable = true)
|-- features: vector (nullable = true)
44Copyright©2016 NTT corp. All Rights Reserved.
3. Load data in Spark Running an Example
0.000357499151147113 6066:0.0007932706219604
8 6069:0.000311377727123504 6070:0.0003067549
34580457 6071:0.000276992485786437 6072:0.000
39663531098024 6074:0.00039663531098024 6075
:0.00032548335…
trainDf
Partition1 Partition2 Partition3 PartitionN
…
…
…
Load in parallel because
bzip2 is splittable
45Copyright©2016 NTT corp. All Rights Reserved.
4. Build a model - DataFrame Running an Example
scala> paste:
val modelDf = trainDf.train_logregr($"features", $"label")
.groupBy("feature”)
.agg("weight" -> "avg”)
46Copyright©2016 NTT corp. All Rights Reserved.
4. Build a model - SQL Running an Example
scala> trainDf.createOrReplaceTempView("TrainTable")
scala> paste:
val modelDf = sql("""
| SELECT feature, AVG(weight) AS weight
| FROM (
| SELECT train_logregr(features, label)
| AS (feature, weight)
| FROM TrainTable
| )
| GROUP BY feature
""".stripMargin)
47Copyright©2016 NTT corp. All Rights Reserved.
5. Do predictions - DataFrame Running an Example
scala> paste:
val df = testDf.select(rowid(), $"features")
.explode_vector($"features")
.cache
# Do predictions
df.join(modelDf, df("feature") === model("feature"), "LEFT_OUTER")
.groupBy("rowid")
.avg(sigmoid(sum($"weight" * $"value")))
48Copyright©2016 NTT corp. All Rights Reserved.
5. Do predictions - SQL Running an Example
scala> modelDf.createOrReplaceTempView(”ModelTable")
scala> df.createOrReplaceTempView(”TestTable”)
scala> paste:
sql("""
| SELECT rowid, sigmoid(value * weight) AS predicted
| FROM TrainTable t
| LEFT OUTER JOIN ModelTable m
| ON t.feature = m.feature
| GROUP BY rowid
""".stripMargin)
49Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs
• Hivemall heavily depends on them
• ex.1) Compute a sigmoid function
Improve Some Functions in Spark
scala> val sigmoidFunc = (d: Double) => 1.0 / (1.0 + Math.exp(-d))
scala> val sparkUdf = functions.udf(sigmoidFunc)
scala> df.select(sparkUdf($“value”))
50Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs
• Hivemall heavily depends on them
• ex.1) Compute a sigmoid function
Improve Some Functions in Spark
scala> val hiveUdf = HivemallOps.sigmoid
scala> df.select(hiveUdf($“value”))
51Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs
• Hivemall heavily depends on them
• ex.1) Compute a sigmoid function
Improve Some Functions in Spark
52Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs
• Hivemall heavily depends on them
• ex.2) Compute top-k for each key group
Improve Some Functions in Spark
scala> paste:
df.withColumn(
“rank”,
rank().over(Window.partitionBy($"key").orderBy($"score".desc)
) .where($"rank" <= topK)
53Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs
• Hivemall heavily depends on them
• ex.2) Compute top-k for each key group
• Fixed the overhead issue for each_top_k
• See pr#353: “Implement EachTopK as a generator expression“ in Spark
Improve Some Functions in Spark
scala> df.each_top_k(topK, “key”, “score”, “value”)
54Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs
• Hivemall heavily depends on them
• ex.2) Compute top-k for each key group
Improve Some Functions in Spark
~4 times faster than rank!!
55Copyright©2016 NTT corp. All Rights Reserved.
• supports under development
• fast implementation of the gradient tree boosting
• widely used in Kaggle competitions
• This integration will make you...
• load built models and predict in parallel
• build multiple models in parallel
for cross-validation
3rd–Party Library Integration
Conclusion and Takeaway
Hivemall provides a collection of machine
learning algorithms as Hive UDFs/UDTFs
2016/10/26 Hadoop Summit '16, Tokyo 56
We welcome your contributions to Apache Hivemall 
2016/10/26 Hadoop Summit '16, Tokyo 57
Any feature request or questions?

More Related Content

What's hot

Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit
 
Hivemail: Scalable Machine Learning Library for Apache Hive
Hivemail: Scalable Machine Learning Library for Apache HiveHivemail: Scalable Machine Learning Library for Apache Hive
Hivemail: Scalable Machine Learning Library for Apache HiveDataWorks Summit
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...DataWorks Summit
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big DataDataWorks Summit
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Sparkrhatr
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparktrihug
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudDataWorks Summit/Hadoop Summit
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 

What's hot (20)

Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Hivemail: Scalable Machine Learning Library for Apache Hive
Hivemail: Scalable Machine Learning Library for Apache HiveHivemail: Scalable Machine Learning Library for Apache Hive
Hivemail: Scalable Machine Learning Library for Apache Hive
 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Spark
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
 
Working with the Scalding Type -Safe API
Working with the Scalding Type -Safe APIWorking with the Scalding Type -Safe API
Working with the Scalding Type -Safe API
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 

Viewers also liked

Individual-In-The-Loop (for Ethically Aligned Artificial Intelligence)
Individual-In-The-Loop (for Ethically Aligned Artificial Intelligence)Individual-In-The-Loop (for Ethically Aligned Artificial Intelligence)
Individual-In-The-Loop (for Ethically Aligned Artificial Intelligence)John C. Havens
 
Introduction to Hadoop and Spark (before joining the other talk) and An Overv...
Introduction to Hadoop and Spark (before joining the other talk) and An Overv...Introduction to Hadoop and Spark (before joining the other talk) and An Overv...
Introduction to Hadoop and Spark (before joining the other talk) and An Overv...DataWorks Summit/Hadoop Summit
 
(Kpi summer school 2015) theano tutorial part2
(Kpi summer school 2015) theano tutorial part2(Kpi summer school 2015) theano tutorial part2
(Kpi summer school 2015) theano tutorial part2Serhii Havrylov
 
Path to 400M Members: LinkedIn’s Data Powered Journey
Path to 400M Members: LinkedIn’s Data Powered JourneyPath to 400M Members: LinkedIn’s Data Powered Journey
Path to 400M Members: LinkedIn’s Data Powered JourneyDataWorks Summit/Hadoop Summit
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceDataWorks Summit/Hadoop Summit
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit
 
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...DataWorks Summit/Hadoop Summit
 
Introduction to Hivemall
Introduction to HivemallIntroduction to Hivemall
Introduction to HivemallMakoto Yui
 
(Kpi summer school 2015) theano tutorial part1
(Kpi summer school 2015) theano tutorial part1(Kpi summer school 2015) theano tutorial part1
(Kpi summer school 2015) theano tutorial part1Serhii Havrylov
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemDataWorks Summit/Hadoop Summit
 
A3RT - the details and actual use cases of "Analytics & Artificial intelligen...
A3RT - the details and actual use cases of "Analytics & Artificial intelligen...A3RT - the details and actual use cases of "Analytics & Artificial intelligen...
A3RT - the details and actual use cases of "Analytics & Artificial intelligen...DataWorks Summit/Hadoop Summit
 

Viewers also liked (20)

Individual-In-The-Loop (for Ethically Aligned Artificial Intelligence)
Individual-In-The-Loop (for Ethically Aligned Artificial Intelligence)Individual-In-The-Loop (for Ethically Aligned Artificial Intelligence)
Individual-In-The-Loop (for Ethically Aligned Artificial Intelligence)
 
7 Myths of AI
7 Myths of AI7 Myths of AI
7 Myths of AI
 
The real world use of Big Data to change business
The real world use of Big Data to change businessThe real world use of Big Data to change business
The real world use of Big Data to change business
 
Introduction to Hadoop and Spark (before joining the other talk) and An Overv...
Introduction to Hadoop and Spark (before joining the other talk) and An Overv...Introduction to Hadoop and Spark (before joining the other talk) and An Overv...
Introduction to Hadoop and Spark (before joining the other talk) and An Overv...
 
(Kpi summer school 2015) theano tutorial part2
(Kpi summer school 2015) theano tutorial part2(Kpi summer school 2015) theano tutorial part2
(Kpi summer school 2015) theano tutorial part2
 
Path to 400M Members: LinkedIn’s Data Powered Journey
Path to 400M Members: LinkedIn’s Data Powered JourneyPath to 400M Members: LinkedIn’s Data Powered Journey
Path to 400M Members: LinkedIn’s Data Powered Journey
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
 
To The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid AnalyticsTo The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid Analytics
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Data science lifecycle with Apache Zeppelin
Data science lifecycle with Apache ZeppelinData science lifecycle with Apache Zeppelin
Data science lifecycle with Apache Zeppelin
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
 
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
 
Comparison of Transactional Libraries for HBase
Comparison of Transactional Libraries for HBaseComparison of Transactional Libraries for HBase
Comparison of Transactional Libraries for HBase
 
Introduction to Hivemall
Introduction to HivemallIntroduction to Hivemall
Introduction to Hivemall
 
(Kpi summer school 2015) theano tutorial part1
(Kpi summer school 2015) theano tutorial part1(Kpi summer school 2015) theano tutorial part1
(Kpi summer school 2015) theano tutorial part1
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage Subsystem
 
A3RT - the details and actual use cases of "Analytics & Artificial intelligen...
A3RT - the details and actual use cases of "Analytics & Artificial intelligen...A3RT - the details and actual use cases of "Analytics & Artificial intelligen...
A3RT - the details and actual use cases of "Analytics & Artificial intelligen...
 
Apache NiFi 1.0 in Nutshell
Apache NiFi 1.0 in NutshellApache NiFi 1.0 in Nutshell
Apache NiFi 1.0 in Nutshell
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
 

Similar to Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig

Hadoopsummit16 myui
Hadoopsummit16 myuiHadoopsummit16 myui
Hadoopsummit16 myuiMakoto Yui
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Makoto Yui
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Makoto Yui
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Gyula Fóra
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNHortonworks
 
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopHopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopJim Dowling
 
Scala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusScala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusBoldRadius Solutions
 
Apache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectosApache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectosOpenSistemas
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiSlim Baltagi
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersKumari Surabhi
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraDataStax Academy
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersIntel® Software
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone
 

Similar to Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig (20)

Hadoopsummit16 myui
Hadoopsummit16 myuiHadoopsummit16 myui
Hadoopsummit16 myui
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
 
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopHopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
 
Scala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusScala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadius
 
Apache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectosApache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectos
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
Data streaming
Data streamingData streaming
Data streaming
 

More from DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 

Recently uploaded (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 

Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig

  • 1. Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto YUI @myui 2016/10/26 Hadoop Summit '16, Tokyo 2). Research Engineer, NTT Takashi Yamamuro @maropu #hivemall
  • 2. Plan of the talk 1. Introduction to Hivemall 2. Hivemall on Spark 2016/10/26 Hadoop Summit '16, Tokyo 2
  • 3. 2016/10/26 Hadoop Summit '16, Tokyo 3 Hivemall enters Apache Incubator on Sept 13, 2016 🎉 http://incubator.apache.org/projects/hivemall.html @ApacheHivemall
  • 4. • Makoto Yui <Treasure Data> • Takeshi Yamamuro <NTT>  Hivemall on Apache Spark • Daniel Dai <Hortonworks>  Hivemall on Apache Pig  Apache Pig PMC member • Tsuyoshi Ozawa <NTT> Apache Hadoop PMC member • Kai Sasaki <Treasure Data> 2016/10/26 Hadoop Summit '16, Tokyo 4 Initial committers
  • 5. Champion Nominated Mentors 2016/10/26 Hadoop Summit '16, Tokyo 5 Project mentors • Reynold Xin <Databricks, ASF member> Apache Spark PMC member • Markus Weimer <Microsoft, ASF member> Apache REEF PMC member • Xiangrui Meng <Databricks, ASF member> Apache Spark PMC member • Roman Shaposhnik <Pivotal, ASF member> Apache Bigtop/Incubator PMC member
  • 6. What is Apache Hivemall Scalable machine learning library built as a collection of Hive UDFs 2016/10/26 Hadoop Summit '16, Tokyo github.com/myui/hivemall hivemall.incubator.apache.org
  • 7. Hivemall’s Vision: ML on SQL 2016/10/26 Hadoop Summit '16, Tokyo 7 Classification with Mahout CREATE TABLE lr_model AS SELECT feature, -- reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t -- map-only task GROUP BY feature; -- shuffled to reducers ✓ Machine Learning made easy for SQL developers ✓ Interactive and Stable APIs w/ SQL abstraction This SQL query automatically runs in parallel on Hadoop
  • 8. Hadoop HDFS MapReduce (MRv1) Hivemall Apache YARN Apache Tez DAG processing Machine Learning Query Processing Parallel Data Processing Framework Resource Management Distributed File System Cloud Storage SparkSQL Apache Spark MESOS Hive Pig MLlib Hivemall’s Technology Stack 2016/10/26 Hadoop Summit '16, Tokyo 8 Amazon S3
  • 9. List of supported Algorithms Classification ✓ Perceptron ✓ Passive Aggressive (PA, PA1, PA2) ✓ Confidence Weighted (CW) ✓ Adaptive Regularization of Weight Vectors (AROW) ✓ Soft Confidence Weighted (SCW) ✓ AdaGrad+RDA ✓ Factorization Machines ✓ RandomForest Classification 2016/10/26 Hadoop Summit '16, Tokyo 9 Regression ✓Logistic Regression (SGD) ✓AdaGrad (logistic loss) ✓AdaDELTA (logistic loss) ✓PA Regression ✓AROW Regression ✓Factorization Machines ✓RandomForest Regression SCW is a good first choice Try RandomForest if SCW does not work Logistic regression is good for getting a probability of a positive class Factorization Machines is good where features are sparse and categorical ones
  • 10. List of Algorithms for Recommendation 2016/10/26 Hadoop Summit '16, Tokyo 10 K-Nearest Neighbor ✓ Minhash and b-Bit Minhash (LSH variant) ✓ Similarity Search on Vector Space (Euclid/Cosine/Jaccard/Angular) Matrix Completion ✓ Matrix Factorization ✓ Factorization Machines (regression) each_top_k function of Hivemall is useful for recommending top-k items
  • 11. 2016/10/26 Hadoop Summit '16, Tokyo 11 student class score 1 b 70 2 a 80 3 a 90 4 b 50 5 a 70 6 b 60 Top-k query processing student class score 3 a 90 2 a 80 1 b 70 6 b 60 List top-2 students for each class
  • 12. 2016/10/26 Hadoop Summit '16, Tokyo 12 student class score 1 b 70 2 a 80 3 a 90 4 b 50 5 a 70 6 b 60 Top-k query processing List top-2 students for each class SELECT * FROM ( SELECT *, rank() over (partition by class order by score desc) as rank FROM table ) t WHERE rank <= 2
  • 13. 2016/10/26 Hadoop Summit '16, Tokyo 13 student class score 1 b 70 2 a 80 3 a 90 4 b 50 5 a 70 6 b 60 Top-k query processing List top-2 students for each class SELECT each_top_k( 2, class, score, class, student ) as (rank, score, class, student) FROM ( SELECT * FROM table DISTRIBUTE BY class SORT BY class ) t
  • 14. 2016/10/26 Hadoop Summit '16, Tokyo 14 Top-k query processing by RANK OVER() partition by class Node 1 Sort by class, score rank over() rank >= 2
  • 15. 2016/10/26 Hadoop Summit '16, Tokyo 15 Top-k query processing by EACH_TOP_K distributed by class Node 1 Sort by class each_top_k OUTPUT only K items
  • 16. 2016/10/26 Hadoop Summit '16, Tokyo 16 Comparison between RANK and EACH_TOP_K distributed by class Sort by class each_top_k Sort by class, score rank over() rank >= 2 SORTING IS HEAVY NEED TO PROCESS ALL OUTPUT only K items Each_top_k is very efficient where the number of class is large Bounded Priority Queue is utilized
  • 17. Performance reported by TD customer 2016/10/26 Hadoop Summit '16, Tokyo 17 •1,000 students in each class •20 million classes RANK over() query does not finishes in 24 hours  EACH_TOP_K finishes in 2 hours  Refer for detail https://speakerdeck.com/kaky0922/hivemall-meetup-20160908
  • 18. Other Supported Algorithms 2016/10/26 Hadoop Summit '16, Tokyo 18 Anomaly Detection ✓ Local Outlier Factor (LoF) Feature Engineering ✓Feature Hashing ✓Feature Scaling (normalization, z-score) ✓ TF-IDF vectorizer ✓ Polynomial Expansion (Feature Pairing) ✓ Amplifier NLP ✓Basic Englist text Tokenizer ✓Japanese Tokenizer (Kuromoji)
  • 19. Industry use cases of Hivemall CTR prediction of Ad click logs • Algorithm: Logistic regression • Freakout Inc., Smartnews, and more Gender prediction of Ad click logs • Algorithm: Classification • Scaleout Inc. 2016/10/26 Hadoop Summit '16, Tokyo 19 http://www.slideshare.net/eventdotsjp/hivemall
  • 20. Industry use cases of Hivemall Item/User recommendation • Algorithm: Recommendation • Wish.com, GMO pepabo 2016/10/26 Hadoop Summit '16, Tokyo 20 Problem: Recommendation using hot-item is hard in hand-crafted product market because each creator sells few single items (will soon become out-of-stock) minne.com
  • 21. Industry use cases of Hivemall Value prediction of Real estates • Algorithm: Regression • Livesense 2016/10/26 Hadoop Summit '16, Tokyo 21
  • 22. Industry use cases of Hivemall User score calculation • Algrorithm: Regression • Klout 2016/10/26 Hadoop Summit '16, Tokyo 22 bit.ly/klout-hivemall Influencer marketing klout.com
  • 23. 2016/10/26 Hadoop Summit '16, Tokyo 23
  • 24. Efficient algorithm for finding change point and outliers from timeseries data 2016/10/26 Hadoop Summit '16, Tokyo 24 J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting Outliers and Change Points from Time Series,” IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006. Anomaly/Change-point Detection by ChangeFinder
  • 25. Efficient algorithm for finding change point and outliers from timeseries data 2016/10/26 Hadoop Summit '16, Tokyo 25 Anomaly/Change-point Detection by ChangeFinder J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting Outliers and Change Points from Time Series,” IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006.
  • 26. 2016/10/26 Hadoop Summit '16, Tokyo 26 • T. Ide and K. Inoue, "Knowledge Discovery from Heterogeneous Dynamic Systems using Change-Point Correlations", Proc. SDM, 2005T. • T. Ide and K. Tsuda, "Change-point detection using Krylov subspace learning", Proc. SDM, 2007. Change-point detection by Singular Spectrum Transformation
  • 27. 2016/10/26 Hadoop Summit '16, Tokyo 27 Evaluation Metrics
  • 28. 2016/10/26 Hadoop Summit '16, Tokyo 28 Feature Engineering – Feature Binning Maps quantitative variables to fixed number of bins based on quantiles/distribution Map Ages into 3 bins
  • 29. 2016/10/26 Hadoop Summit '16, Tokyo 29 Feature Selection – Signal Noise Ratio
  • 30. 2016/10/26 Hadoop Summit '16, Tokyo 30 Feature Selection – Chi-Square
  • 31. 2016/10/26 Hadoop Summit '16, Tokyo 31 Feature Transformation – Onehot encoding Maps a categorical variable to a unique number starting from 1
  • 32.  Spark 2.0 support  XGBoost Integration  Field-aware Factorization Machines  Generalized Linear Model • Optimizer framework including ADAM • L1/L2 regularization 2016/10/26 Hadoop Summit '16, Tokyo 32 Other new features to come
  • 33. Copyright©2016 NTT corp. All Rights Reserved. Hivemall on Takeshi Yamamuro @ NTT Hadoop Summit 2016 Tokyo, Japan
  • 34. 34Copyright©2016 NTT corp. All Rights Reserved. Who am I?
  • 35. 35Copyright©2016 NTT corp. All Rights Reserved. What’s Spark? • 1. Unified Engine • support end-to-end APs, e.g., MLlib and Streaming • 2. High-level APIs • easy-to-use, rich optimization • 3. Integrate broadly • storages, libraries, ...
  • 36. 36Copyright©2016 NTT corp. All Rights Reserved. • Hivemall wrapper for Spark • Wrapper implementations for DataFrame/SQL • + some utilities for easy-to-use in Spark • The wrapper makes you... • run most of Hivemall functions in Spark • try examples easily in your laptop • improve some function performance in Spark What’s Hivemall on Spark?
  • 37. 37Copyright©2016 NTT corp. All Rights Reserved. • Hivemall already has many fascinating ML algorithms and useful utilities • High barriers to add newer algorithms in MLlib Why’s Hivemall on Spark? https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
  • 38. 38Copyright©2016 NTT corp. All Rights Reserved. • Most of Hivemall functions supported in Spark v1.6 and v2.0 Current Status - For Spark v2.0 $ git clone https://github.com/myui/hivemall $ cd hivemall $ mvn package –Pspark-2.0 –DskipTests $ ls target/*spark* target/hivemall-spark-2.0_2.11-XXX-with-dependencies.jar ...
  • 39. 39Copyright©2016 NTT corp. All Rights Reserved. • Most of Hivemall functions supported in Spark v1.6 and v2.0 Current Status - For Spark v2.0 $ git clone https://github.com/myui/hivemall $ cd hivemall $ mvn package –Pspark-2.0 –DskipTests $ ls target/*spark* target/hivemall-spark-2.0_2.11-XXX-with-dependencies.jar ... -Pspark-1.6 for Spark v1.6
  • 40. 40Copyright©2016 NTT corp. All Rights Reserved. • 1. Download a Spark binary • 2. Fetch training and test data • 3. Load these data in Spark • 4. Build a model • 5. Do predictions Running an Example
  • 41. 41Copyright©2016 NTT corp. All Rights Reserved. 1. Download a Spark binary Running an Example http://spark.apache.org/downloads.html
  • 42. 42Copyright©2016 NTT corp. All Rights Reserved. • E2006 tfidf regression dataset • http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/dat asets/regression.html#E2006-tfidf 2. Fetch training and test data Running an Example $ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ regression/E2006.train.bz2 $ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ regression/E2006.test.bz2
  • 43. 43Copyright©2016 NTT corp. All Rights Reserved. 3. Load data in Spark Running an Example $ <SPARK_HOME>/bin/spark-shell -conf spark.jars=hivemall-spark-2.0_2.11-XXX-with-dependencies.jar // Create DataFrame from the bzip’d libsvm-formatted file scala> val trainDf = sqlContext.sparkSession.read.format("libsvm”) .load(“E2006.train.bz2") scala> trainDf.printSchema root |-- label: double (nullable = true) |-- features: vector (nullable = true)
  • 44. 44Copyright©2016 NTT corp. All Rights Reserved. 3. Load data in Spark Running an Example 0.000357499151147113 6066:0.0007932706219604 8 6069:0.000311377727123504 6070:0.0003067549 34580457 6071:0.000276992485786437 6072:0.000 39663531098024 6074:0.00039663531098024 6075 :0.00032548335… trainDf Partition1 Partition2 Partition3 PartitionN … … … Load in parallel because bzip2 is splittable
  • 45. 45Copyright©2016 NTT corp. All Rights Reserved. 4. Build a model - DataFrame Running an Example scala> paste: val modelDf = trainDf.train_logregr($"features", $"label") .groupBy("feature”) .agg("weight" -> "avg”)
  • 46. 46Copyright©2016 NTT corp. All Rights Reserved. 4. Build a model - SQL Running an Example scala> trainDf.createOrReplaceTempView("TrainTable") scala> paste: val modelDf = sql(""" | SELECT feature, AVG(weight) AS weight | FROM ( | SELECT train_logregr(features, label) | AS (feature, weight) | FROM TrainTable | ) | GROUP BY feature """.stripMargin)
  • 47. 47Copyright©2016 NTT corp. All Rights Reserved. 5. Do predictions - DataFrame Running an Example scala> paste: val df = testDf.select(rowid(), $"features") .explode_vector($"features") .cache # Do predictions df.join(modelDf, df("feature") === model("feature"), "LEFT_OUTER") .groupBy("rowid") .avg(sigmoid(sum($"weight" * $"value")))
  • 48. 48Copyright©2016 NTT corp. All Rights Reserved. 5. Do predictions - SQL Running an Example scala> modelDf.createOrReplaceTempView(”ModelTable") scala> df.createOrReplaceTempView(”TestTable”) scala> paste: sql(""" | SELECT rowid, sigmoid(value * weight) AS predicted | FROM TrainTable t | LEFT OUTER JOIN ModelTable m | ON t.feature = m.feature | GROUP BY rowid """.stripMargin)
  • 49. 49Copyright©2016 NTT corp. All Rights Reserved. • Spark has overheads to call Hive UD*Fs • Hivemall heavily depends on them • ex.1) Compute a sigmoid function Improve Some Functions in Spark scala> val sigmoidFunc = (d: Double) => 1.0 / (1.0 + Math.exp(-d)) scala> val sparkUdf = functions.udf(sigmoidFunc) scala> df.select(sparkUdf($“value”))
  • 50. 50Copyright©2016 NTT corp. All Rights Reserved. • Spark has overheads to call Hive UD*Fs • Hivemall heavily depends on them • ex.1) Compute a sigmoid function Improve Some Functions in Spark scala> val hiveUdf = HivemallOps.sigmoid scala> df.select(hiveUdf($“value”))
  • 51. 51Copyright©2016 NTT corp. All Rights Reserved. • Spark has overheads to call Hive UD*Fs • Hivemall heavily depends on them • ex.1) Compute a sigmoid function Improve Some Functions in Spark
  • 52. 52Copyright©2016 NTT corp. All Rights Reserved. • Spark has overheads to call Hive UD*Fs • Hivemall heavily depends on them • ex.2) Compute top-k for each key group Improve Some Functions in Spark scala> paste: df.withColumn( “rank”, rank().over(Window.partitionBy($"key").orderBy($"score".desc) ) .where($"rank" <= topK)
  • 53. 53Copyright©2016 NTT corp. All Rights Reserved. • Spark has overheads to call Hive UD*Fs • Hivemall heavily depends on them • ex.2) Compute top-k for each key group • Fixed the overhead issue for each_top_k • See pr#353: “Implement EachTopK as a generator expression“ in Spark Improve Some Functions in Spark scala> df.each_top_k(topK, “key”, “score”, “value”)
  • 54. 54Copyright©2016 NTT corp. All Rights Reserved. • Spark has overheads to call Hive UD*Fs • Hivemall heavily depends on them • ex.2) Compute top-k for each key group Improve Some Functions in Spark ~4 times faster than rank!!
  • 55. 55Copyright©2016 NTT corp. All Rights Reserved. • supports under development • fast implementation of the gradient tree boosting • widely used in Kaggle competitions • This integration will make you... • load built models and predict in parallel • build multiple models in parallel for cross-validation 3rd–Party Library Integration
  • 56. Conclusion and Takeaway Hivemall provides a collection of machine learning algorithms as Hive UDFs/UDTFs 2016/10/26 Hadoop Summit '16, Tokyo 56 We welcome your contributions to Apache Hivemall 
  • 57. 2016/10/26 Hadoop Summit '16, Tokyo 57 Any feature request or questions?