SlideShare a Scribd company logo
1 of 40
Download to read offline
Co-occurrence-based recommendations
with Mahout, Scala & Spark
Sebastian Schelter
@sscdotopen
BigData Beers
05/29/2014
available for free at
http://www.mapr.com/practical-machine-learning
Cooccurrence Analysis
History matrix
// real usecase: load from DFS
// val A = drmFromHDFS(...)
// our toy example
val A = drmParallelize(dense(
(1, 1, 1, 0), // Alice
(1, 0, 1, 0), // Bob
(0, 0, 1, 1)), // Charles
numPartitions = 2)
How often do items co-occur?
How often do items co-occur?
// compute co-occurrence matrix
val C = A.t %*% A
Which cooccurences are interesting?
Which cooccurences are interesting?
// compute some statistics
val interactionsPerItem =
drmBroadcast(A.colSums)
// convert to indicator matrix
val I = C.mapBlock() {
// compute LLR scores from
// cooccurrences and statistics
...
// only keep interesting cooccurrences
...
}
// save indicator matrix
I.writeDrm(...);
Cooccurrence Analysis
prototype available
• MAHOUT-1464 provides full-fledged cooccurrence analysis protoype
– applies selective downsampling to make computation tractable
– support for cross-recommendations in datasets with multiple
interaction types, e.g.
• “people who watch this video also watch those videos”
• “people who enter this search query watch those videos”
– code to run this on the Movielens and Epinions datasets
• future plan: easy indexing of indicator matrix with Apache Solr to allow
for search-as-recommendation deployments
– prototype for MR code already existing at https://github.com/pferrel/solr-recommender
– integration is in the works
Under the covers
Underlying systems
• currently: runtime based on Apache Spark
– fast and expressive cluster
computing system
– general computation graphs,
in-memory primitives, rich API,
interactive shell
• potentially supported in the future:
• Apache Flink (formerly: “Stratosphere”)
• H20
Runtime & Optimization
• Execution is defered, user
composes logical operators
• Computational actions implicitly
trigger optimization (= selection
of physical plan) and execution
• Optimization factors: size of operands, orientation of operands,
partitioning, sharing of computational paths
• e. g.: matrix multiplication:
– 5 physical operators for drmA %*% drmB
– 2 operators for drmA %*% inMemA
– 1 operator for drm A %*% x
– 1 operator for x %*% drmA
val C = A.t %*% A
I.writeDrm(path);
val inMemV =(U %*% M).collect
Optimization Example
• Computation of ATA in example
• Naïve execution
1st pass: transpose A
(requires repartitioning of A)
2nd pass: multiply result with A
(expensive, potentially requires
repartitioning again)
• Logical optimization:
rewrite plan to use specialized
logical operator for
Transpose-Times-Self matrix
multiplication
val C = A.t %*% A
Optimization Example
• Computation of ATA in example
• Naïve execution
1st pass: transpose A
(requires repartitioning of A)
2nd pass: multiply result with A
(expensive, potentially requires
repartitioning again)
• Logical optimization:
rewrite plan to use specialized
logical operator for
Transpose-Times-Self matrix
multiplication
val C = A.t %*% A
Transpose
A
Optimization Example
• Computation of ATA in example
• Naïve execution
1st pass: transpose A
(requires repartitioning of A)
2nd pass: multiply result with A
(expensive, potentially requires
repartitioning again)
• Logical optimization:
rewrite plan to use specialized
logical operator for
Transpose-Times-Self matrix
multiplication
val C = A.t %*% A
Transpose
MatrixMult
A A
C
Optimization Example
• Computation of ATA in example
• Naïve execution
1st pass: transpose A
(requires repartitioning of A)
2nd pass: multiply result with A
(expensive, potentially requires
repartitioning again)
• Logical optimization
Optimizer rewrites plan to use
specialized logical operator for
Transpose-Times-Self matrix
multiplication
val C = A.t %*% A
Transpose
MatrixMult
A A
C
Transpose-
Times-Self
A
C
Tranpose-Times-Self
• Mahout computes ATA via row-outer-product formulation
– executes in a single pass over row-partitioned A




m
i
T
ii
T
aaAA
0
Tranpose-Times-Self
• Mahout computes ATA via row-outer-product formulation
– executes in a single pass over row-partitioned A




m
i
T
ii
T
aaAA
0
A
Tranpose-Times-Self
• Mahout computes ATA via row-outer-product formulation
– executes in a single pass over row-partitioned A




m
i
T
ii
T
aaAA
0
x
AAT
Tranpose-Times-Self
• Mahout computes ATA via row-outer-product formulation
– executes in a single pass over row-partitioned A




m
i
T
ii
T
aaAA
0
x = x
AAT a1• a1•
T
Tranpose-Times-Self
• Mahout computes ATA via row-outer-product formulation
– executes in a single pass over row-partitioned A




m
i
T
ii
T
aaAA
0
x = x + x
AAT a1• a1•
T
a2• a2•
T
Tranpose-Times-Self
• Mahout computes ATA via row-outer-product formulation
– executes in a single pass over row-partitioned A




m
i
T
ii
T
aaAA
0
x = x + +x x
AAT a1• a1•
T
a2• a2•
T
a3• a3•
T
Tranpose-Times-Self
• Mahout computes ATA via row-outer-product formulation
– executes in a single pass over row-partitioned A




m
i
T
ii
T
aaAA
0
x = x + + +x x x
AAT a1• a1•
T
a2• a2•
T
a3• a3•
T
a4• a4•
T
Physical operators for
Transpose-Times-Self
• Two physical operators (concrete implementations)
available for Transpose-Times-Self operation
– standard operator AtA
– operator AtA_slim, specialized
implementation for tall & skinny
matrices
• Optimizer must choose
– currently: depends on user-defined
threshold for number of columns
– ideally: cost based decision, dependent on
estimates of intermediate result sizes
Transpose-
Times-Self
A
C
Physical operators for the
distributed computation of ATA
Physical operator AtA










1100
0101
0111
A
A2
 1100
Physical operator AtA










1100
0101
0111
A1
A
worker 1
worker 2






0101
0111
A2
 1100
Physical operator AtA










1100
0101
0111
A1
A
worker 1
worker 2






0101
0111
for 1st partition
for 1st partition
A2
 1100
Physical operator AtA










1100
0101
0111
A1
A
worker 1
worker 2






0101
0111
 0111
1
1






 1100
0
0






for 1st partition
for 1st partition
A2
 1100
Physical operator AtA










1100
0101
0111
A1
A
worker 1
worker 2






0101
0111
 0111
1
1






 1100
0
0






for 1st partition
for 1st partition
 0101
0
1






A2
 1100
Physical operator AtA










1100
0101
0111
A1
A
worker 1
worker 2






0101
0111
 0111
1
1






 1100
0
0






for 1st partition
for 1st partition
 0101
0
1






for 2nd partition
for 2nd partition
A2
 1100
Physical operator AtA










1100
0101
0111
A1
A
worker 1
worker 2






0101
0111
 0111
1
1






 1100
0
0






for 1st partition
for 1st partition
 0101
0
1






 0111
0
1






for 2nd partition
 1100
1
1






for 2nd partition
A2
 1100
Physical operator AtA










1100
0101
0111
A1
A
worker 1
worker 2






0101
0111
 0111
1
1






 1100
0
0






for 1st partition
for 1st partition
 0101
0
1






 0111
0
1






for 2nd partition
 0101
0
1






 1100
1
1






for 2nd partition
A2
 1100
Physical operator AtA










1100
0101
0111
A1
A
worker 1
worker 2






0101
0111






0111
0111






0000
0000
for 1st partition
for 1st partition






0000
0101






0000
0111
for 2nd partition






0000
0101






1100
1100
for 2nd partition
A2
 1100
Physical operator AtA










1100
0101
0111
A1
A
worker 1
worker 2






0101
0111






0111
0111






0000
0000
for 1st partition
for 1st partition






0000
0101






0000
0111
for 2nd partition






0000
0101






1100
1100
for 2nd partition






0111
0212
worker 3






1100
1312
worker 4
∑
∑
ATA
Physical operator AtA_slim










1100
0101
0111
A
A2
 1100
Physical operator AtA_slim










1100
0101
0111
A1
A
worker 1
worker 2






0101
0111
A2
TA2A2
 1100

















1
11
000
0000
Physical operator AtA_slim










1100
0101
0111
A1
TA1A1
A
worker 1
worker 2






0101
0111

















0
02
011
0212
A2
TA2A2
 1100

















1
11
000
0000
Physical operator AtA_slim










1100
0101
0111
A1
TA1A1
A C = ATA
worker 1
worker 2
A1
TA1 + A2
TA2
driver






0101
0111

















0
02
011
0212














1100
1312
0111
0212
Thank you. Questions?
Overview of Mahout‘s Scala & Spark Bindings:
http://s.apache.org/mahout-spark
Tutorial on playing with Mahout‘s Spark shell
http://s.apache.org/mahout-spark-shell

More Related Content

What's hot

Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink Forward
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
gothicane
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
rhatr
 

What's hot (20)

Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
 
A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)
A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)
A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
Map/Reduce intro
Map/Reduce introMap/Reduce intro
Map/Reduce intro
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkR
 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 
Scaling out logistic regression with Spark
Scaling out logistic regression with SparkScaling out logistic regression with Spark
Scaling out logistic regression with Spark
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
 
Parallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear ModelsParallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear Models
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PC
 

Similar to Co-occurrence Based Recommendations with Mahout, Scala and Spark

Similar to Co-occurrence Based Recommendations with Mahout, Scala and Spark (20)

MatlabIntro (1).ppt
MatlabIntro (1).pptMatlabIntro (1).ppt
MatlabIntro (1).ppt
 
Seminar on MATLAB
Seminar on MATLABSeminar on MATLAB
Seminar on MATLAB
 
4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function
 
Case Study for Plant Layout :: A modern analysis
Case Study for Plant Layout :: A modern analysisCase Study for Plant Layout :: A modern analysis
Case Study for Plant Layout :: A modern analysis
 
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahout
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
 
mc_simulation documentation
mc_simulation documentationmc_simulation documentation
mc_simulation documentation
 
Efficient Model Partitioning for Distributed Model Transformations
Efficient Model Partitioning for Distributed Model TransformationsEfficient Model Partitioning for Distributed Model Transformations
Efficient Model Partitioning for Distributed Model Transformations
 
Java 8
Java 8Java 8
Java 8
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
 
Mat lab workshop
Mat lab workshopMat lab workshop
Mat lab workshop
 
Basic concept of MATLAB.ppt
Basic concept of MATLAB.pptBasic concept of MATLAB.ppt
Basic concept of MATLAB.ppt
 
Matlab Basic Tutorial
Matlab Basic TutorialMatlab Basic Tutorial
Matlab Basic Tutorial
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)
 
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn PipelinesRevolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
 
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
 

More from sscdotopen

New Directions in Mahout's Recommenders
New Directions in Mahout's RecommendersNew Directions in Mahout's Recommenders
New Directions in Mahout's Recommenders
sscdotopen
 
Introduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache MahoutIntroduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache Mahout
sscdotopen
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduce
sscdotopen
 
Latent factor models for Collaborative Filtering
Latent factor models for Collaborative FilteringLatent factor models for Collaborative Filtering
Latent factor models for Collaborative Filtering
sscdotopen
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
sscdotopen
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
sscdotopen
 

More from sscdotopen (8)

Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenders
 
New Directions in Mahout's Recommenders
New Directions in Mahout's RecommendersNew Directions in Mahout's Recommenders
New Directions in Mahout's Recommenders
 
Introduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache MahoutIntroduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache Mahout
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduce
 
Latent factor models for Collaborative Filtering
Latent factor models for Collaborative FilteringLatent factor models for Collaborative Filtering
Latent factor models for Collaborative Filtering
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
 
mahout-cf
mahout-cfmahout-cf
mahout-cf
 

Recently uploaded

Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Recently uploaded (20)

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 

Co-occurrence Based Recommendations with Mahout, Scala and Spark

  • 1. Co-occurrence-based recommendations with Mahout, Scala & Spark Sebastian Schelter @sscdotopen BigData Beers 05/29/2014
  • 2. available for free at http://www.mapr.com/practical-machine-learning
  • 4. History matrix // real usecase: load from DFS // val A = drmFromHDFS(...) // our toy example val A = drmParallelize(dense( (1, 1, 1, 0), // Alice (1, 0, 1, 0), // Bob (0, 0, 1, 1)), // Charles numPartitions = 2)
  • 5. How often do items co-occur?
  • 6. How often do items co-occur? // compute co-occurrence matrix val C = A.t %*% A
  • 7. Which cooccurences are interesting?
  • 8. Which cooccurences are interesting? // compute some statistics val interactionsPerItem = drmBroadcast(A.colSums) // convert to indicator matrix val I = C.mapBlock() { // compute LLR scores from // cooccurrences and statistics ... // only keep interesting cooccurrences ... } // save indicator matrix I.writeDrm(...);
  • 9. Cooccurrence Analysis prototype available • MAHOUT-1464 provides full-fledged cooccurrence analysis protoype – applies selective downsampling to make computation tractable – support for cross-recommendations in datasets with multiple interaction types, e.g. • “people who watch this video also watch those videos” • “people who enter this search query watch those videos” – code to run this on the Movielens and Epinions datasets • future plan: easy indexing of indicator matrix with Apache Solr to allow for search-as-recommendation deployments – prototype for MR code already existing at https://github.com/pferrel/solr-recommender – integration is in the works
  • 11. Underlying systems • currently: runtime based on Apache Spark – fast and expressive cluster computing system – general computation graphs, in-memory primitives, rich API, interactive shell • potentially supported in the future: • Apache Flink (formerly: “Stratosphere”) • H20
  • 12. Runtime & Optimization • Execution is defered, user composes logical operators • Computational actions implicitly trigger optimization (= selection of physical plan) and execution • Optimization factors: size of operands, orientation of operands, partitioning, sharing of computational paths • e. g.: matrix multiplication: – 5 physical operators for drmA %*% drmB – 2 operators for drmA %*% inMemA – 1 operator for drm A %*% x – 1 operator for x %*% drmA val C = A.t %*% A I.writeDrm(path); val inMemV =(U %*% M).collect
  • 13. Optimization Example • Computation of ATA in example • Naïve execution 1st pass: transpose A (requires repartitioning of A) 2nd pass: multiply result with A (expensive, potentially requires repartitioning again) • Logical optimization: rewrite plan to use specialized logical operator for Transpose-Times-Self matrix multiplication val C = A.t %*% A
  • 14. Optimization Example • Computation of ATA in example • Naïve execution 1st pass: transpose A (requires repartitioning of A) 2nd pass: multiply result with A (expensive, potentially requires repartitioning again) • Logical optimization: rewrite plan to use specialized logical operator for Transpose-Times-Self matrix multiplication val C = A.t %*% A Transpose A
  • 15. Optimization Example • Computation of ATA in example • Naïve execution 1st pass: transpose A (requires repartitioning of A) 2nd pass: multiply result with A (expensive, potentially requires repartitioning again) • Logical optimization: rewrite plan to use specialized logical operator for Transpose-Times-Self matrix multiplication val C = A.t %*% A Transpose MatrixMult A A C
  • 16. Optimization Example • Computation of ATA in example • Naïve execution 1st pass: transpose A (requires repartitioning of A) 2nd pass: multiply result with A (expensive, potentially requires repartitioning again) • Logical optimization Optimizer rewrites plan to use specialized logical operator for Transpose-Times-Self matrix multiplication val C = A.t %*% A Transpose MatrixMult A A C Transpose- Times-Self A C
  • 17. Tranpose-Times-Self • Mahout computes ATA via row-outer-product formulation – executes in a single pass over row-partitioned A     m i T ii T aaAA 0
  • 18. Tranpose-Times-Self • Mahout computes ATA via row-outer-product formulation – executes in a single pass over row-partitioned A     m i T ii T aaAA 0 A
  • 19. Tranpose-Times-Self • Mahout computes ATA via row-outer-product formulation – executes in a single pass over row-partitioned A     m i T ii T aaAA 0 x AAT
  • 20. Tranpose-Times-Self • Mahout computes ATA via row-outer-product formulation – executes in a single pass over row-partitioned A     m i T ii T aaAA 0 x = x AAT a1• a1• T
  • 21. Tranpose-Times-Self • Mahout computes ATA via row-outer-product formulation – executes in a single pass over row-partitioned A     m i T ii T aaAA 0 x = x + x AAT a1• a1• T a2• a2• T
  • 22. Tranpose-Times-Self • Mahout computes ATA via row-outer-product formulation – executes in a single pass over row-partitioned A     m i T ii T aaAA 0 x = x + +x x AAT a1• a1• T a2• a2• T a3• a3• T
  • 23. Tranpose-Times-Self • Mahout computes ATA via row-outer-product formulation – executes in a single pass over row-partitioned A     m i T ii T aaAA 0 x = x + + +x x x AAT a1• a1• T a2• a2• T a3• a3• T a4• a4• T
  • 24. Physical operators for Transpose-Times-Self • Two physical operators (concrete implementations) available for Transpose-Times-Self operation – standard operator AtA – operator AtA_slim, specialized implementation for tall & skinny matrices • Optimizer must choose – currently: depends on user-defined threshold for number of columns – ideally: cost based decision, dependent on estimates of intermediate result sizes Transpose- Times-Self A C
  • 25. Physical operators for the distributed computation of ATA
  • 27. A2  1100 Physical operator AtA           1100 0101 0111 A1 A worker 1 worker 2       0101 0111
  • 28. A2  1100 Physical operator AtA           1100 0101 0111 A1 A worker 1 worker 2       0101 0111 for 1st partition for 1st partition
  • 29. A2  1100 Physical operator AtA           1100 0101 0111 A1 A worker 1 worker 2       0101 0111  0111 1 1        1100 0 0       for 1st partition for 1st partition
  • 30. A2  1100 Physical operator AtA           1100 0101 0111 A1 A worker 1 worker 2       0101 0111  0111 1 1        1100 0 0       for 1st partition for 1st partition  0101 0 1      
  • 31. A2  1100 Physical operator AtA           1100 0101 0111 A1 A worker 1 worker 2       0101 0111  0111 1 1        1100 0 0       for 1st partition for 1st partition  0101 0 1       for 2nd partition for 2nd partition
  • 32. A2  1100 Physical operator AtA           1100 0101 0111 A1 A worker 1 worker 2       0101 0111  0111 1 1        1100 0 0       for 1st partition for 1st partition  0101 0 1        0111 0 1       for 2nd partition  1100 1 1       for 2nd partition
  • 33. A2  1100 Physical operator AtA           1100 0101 0111 A1 A worker 1 worker 2       0101 0111  0111 1 1        1100 0 0       for 1st partition for 1st partition  0101 0 1        0111 0 1       for 2nd partition  0101 0 1        1100 1 1       for 2nd partition
  • 34. A2  1100 Physical operator AtA           1100 0101 0111 A1 A worker 1 worker 2       0101 0111       0111 0111       0000 0000 for 1st partition for 1st partition       0000 0101       0000 0111 for 2nd partition       0000 0101       1100 1100 for 2nd partition
  • 35. A2  1100 Physical operator AtA           1100 0101 0111 A1 A worker 1 worker 2       0101 0111       0111 0111       0000 0000 for 1st partition for 1st partition       0000 0101       0000 0111 for 2nd partition       0000 0101       1100 1100 for 2nd partition       0111 0212 worker 3       1100 1312 worker 4 ∑ ∑ ATA
  • 37. A2  1100 Physical operator AtA_slim           1100 0101 0111 A1 A worker 1 worker 2       0101 0111
  • 38. A2 TA2A2  1100                  1 11 000 0000 Physical operator AtA_slim           1100 0101 0111 A1 TA1A1 A worker 1 worker 2       0101 0111                  0 02 011 0212
  • 39. A2 TA2A2  1100                  1 11 000 0000 Physical operator AtA_slim           1100 0101 0111 A1 TA1A1 A C = ATA worker 1 worker 2 A1 TA1 + A2 TA2 driver       0101 0111                  0 02 011 0212               1100 1312 0111 0212
  • 40. Thank you. Questions? Overview of Mahout‘s Scala & Spark Bindings: http://s.apache.org/mahout-spark Tutorial on playing with Mahout‘s Spark shell http://s.apache.org/mahout-spark-shell