• Like
  • Save
Co-occurrence Based Recommendations with Mahout, Scala and Spark
Upcoming SlideShare
Loading in...5
×
 

Co-occurrence Based Recommendations with Mahout, Scala and Spark

on

  • 450 views

 

Statistics

Views

Total Views
450
Views on SlideShare
446
Embed Views
4

Actions

Likes
2
Downloads
11
Comments
0

1 Embed 4

http://www.slideee.com 4

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Co-occurrence Based Recommendations with Mahout, Scala and Spark Co-occurrence Based Recommendations with Mahout, Scala and Spark Presentation Transcript

    • Co-occurrence-based recommendations with Mahout, Scala & Spark Sebastian Schelter @sscdotopen BigData Beers 05/29/2014
    • available for free at http://www.mapr.com/practical-machine-learning
    • Cooccurrence Analysis
    • History matrix // real usecase: load from DFS // val A = drmFromHDFS(...) // our toy example val A = drmParallelize(dense( (1, 1, 1, 0), // Alice (1, 0, 1, 0), // Bob (0, 0, 1, 1)), // Charles numPartitions = 2)
    • How often do items co-occur?
    • How often do items co-occur? // compute co-occurrence matrix val C = A.t %*% A
    • Which cooccurences are interesting?
    • Which cooccurences are interesting? // compute some statistics val interactionsPerItem = drmBroadcast(A.colSums) // convert to indicator matrix val I = C.mapBlock() { // compute LLR scores from // cooccurrences and statistics ... // only keep interesting cooccurrences ... } // save indicator matrix I.writeDrm(...);
    • Cooccurrence Analysis prototype available • MAHOUT-1464 provides full-fledged cooccurrence analysis protoype – applies selective downsampling to make computation tractable – support for cross-recommendations in datasets with multiple interaction types, e.g. • “people who watch this video also watch those videos” • “people who enter this search query watch those videos” – code to run this on the Movielens and Epinions datasets • future plan: easy indexing of indicator matrix with Apache Solr to allow for search-as-recommendation deployments – prototype for MR code already existing at https://github.com/pferrel/solr-recommender – integration is in the works
    • Under the covers
    • Underlying systems • currently: runtime based on Apache Spark – fast and expressive cluster computing system – general computation graphs, in-memory primitives, rich API, interactive shell • potentially supported in the future: • Apache Flink (formerly: “Stratosphere”) • H20
    • Runtime & Optimization • Execution is defered, user composes logical operators • Computational actions implicitly trigger optimization (= selection of physical plan) and execution • Optimization factors: size of operands, orientation of operands, partitioning, sharing of computational paths • e. g.: matrix multiplication: – 5 physical operators for drmA %*% drmB – 2 operators for drmA %*% inMemA – 1 operator for drm A %*% x – 1 operator for x %*% drmA val C = A.t %*% A I.writeDrm(path); val inMemV =(U %*% M).collect
    • Optimization Example • Computation of ATA in example • Naïve execution 1st pass: transpose A (requires repartitioning of A) 2nd pass: multiply result with A (expensive, potentially requires repartitioning again) • Logical optimization: rewrite plan to use specialized logical operator for Transpose-Times-Self matrix multiplication val C = A.t %*% A
    • Optimization Example • Computation of ATA in example • Naïve execution 1st pass: transpose A (requires repartitioning of A) 2nd pass: multiply result with A (expensive, potentially requires repartitioning again) • Logical optimization: rewrite plan to use specialized logical operator for Transpose-Times-Self matrix multiplication val C = A.t %*% A Transpose A
    • Optimization Example • Computation of ATA in example • Naïve execution 1st pass: transpose A (requires repartitioning of A) 2nd pass: multiply result with A (expensive, potentially requires repartitioning again) • Logical optimization: rewrite plan to use specialized logical operator for Transpose-Times-Self matrix multiplication val C = A.t %*% A Transpose MatrixMult A A C
    • Optimization Example • Computation of ATA in example • Naïve execution 1st pass: transpose A (requires repartitioning of A) 2nd pass: multiply result with A (expensive, potentially requires repartitioning again) • Logical optimization Optimizer rewrites plan to use specialized logical operator for Transpose-Times-Self matrix multiplication val C = A.t %*% A Transpose MatrixMult A A C Transpose- Times-Self A C
    • Tranpose-Times-Self • Mahout computes ATA via row-outer-product formulation – executes in a single pass over row-partitioned A     m i T ii T aaAA 0
    • Tranpose-Times-Self • Mahout computes ATA via row-outer-product formulation – executes in a single pass over row-partitioned A     m i T ii T aaAA 0 A
    • Tranpose-Times-Self • Mahout computes ATA via row-outer-product formulation – executes in a single pass over row-partitioned A     m i T ii T aaAA 0 x AAT
    • Tranpose-Times-Self • Mahout computes ATA via row-outer-product formulation – executes in a single pass over row-partitioned A     m i T ii T aaAA 0 x = x AAT a1• a1• T
    • Tranpose-Times-Self • Mahout computes ATA via row-outer-product formulation – executes in a single pass over row-partitioned A     m i T ii T aaAA 0 x = x + x AAT a1• a1• T a2• a2• T
    • Tranpose-Times-Self • Mahout computes ATA via row-outer-product formulation – executes in a single pass over row-partitioned A     m i T ii T aaAA 0 x = x + +x x AAT a1• a1• T a2• a2• T a3• a3• T
    • Tranpose-Times-Self • Mahout computes ATA via row-outer-product formulation – executes in a single pass over row-partitioned A     m i T ii T aaAA 0 x = x + + +x x x AAT a1• a1• T a2• a2• T a3• a3• T a4• a4• T
    • Physical operators for Transpose-Times-Self • Two physical operators (concrete implementations) available for Transpose-Times-Self operation – standard operator AtA – operator AtA_slim, specialized implementation for tall & skinny matrices • Optimizer must choose – currently: depends on user-defined threshold for number of columns – ideally: cost based decision, dependent on estimates of intermediate result sizes Transpose- Times-Self A C
    • Physical operators for the distributed computation of ATA
    • Physical operator AtA           1100 0101 0111 A
    • A2  1100 Physical operator AtA           1100 0101 0111 A1 A worker 1 worker 2       0101 0111
    • A2  1100 Physical operator AtA           1100 0101 0111 A1 A worker 1 worker 2       0101 0111 for 1st partition for 1st partition
    • A2  1100 Physical operator AtA           1100 0101 0111 A1 A worker 1 worker 2       0101 0111  0111 1 1        1100 0 0       for 1st partition for 1st partition
    • A2  1100 Physical operator AtA           1100 0101 0111 A1 A worker 1 worker 2       0101 0111  0111 1 1        1100 0 0       for 1st partition for 1st partition  0101 0 1      
    • A2  1100 Physical operator AtA           1100 0101 0111 A1 A worker 1 worker 2       0101 0111  0111 1 1        1100 0 0       for 1st partition for 1st partition  0101 0 1       for 2nd partition for 2nd partition
    • A2  1100 Physical operator AtA           1100 0101 0111 A1 A worker 1 worker 2       0101 0111  0111 1 1        1100 0 0       for 1st partition for 1st partition  0101 0 1        0111 0 1       for 2nd partition  1100 1 1       for 2nd partition
    • A2  1100 Physical operator AtA           1100 0101 0111 A1 A worker 1 worker 2       0101 0111  0111 1 1        1100 0 0       for 1st partition for 1st partition  0101 0 1        0111 0 1       for 2nd partition  0101 0 1        1100 1 1       for 2nd partition
    • A2  1100 Physical operator AtA           1100 0101 0111 A1 A worker 1 worker 2       0101 0111       0111 0111       0000 0000 for 1st partition for 1st partition       0000 0101       0000 0111 for 2nd partition       0000 0101       1100 1100 for 2nd partition
    • A2  1100 Physical operator AtA           1100 0101 0111 A1 A worker 1 worker 2       0101 0111       0111 0111       0000 0000 for 1st partition for 1st partition       0000 0101       0000 0111 for 2nd partition       0000 0101       1100 1100 for 2nd partition       0111 0212 worker 3       1100 1312 worker 4 ∑ ∑ ATA
    • Physical operator AtA_slim           1100 0101 0111 A
    • A2  1100 Physical operator AtA_slim           1100 0101 0111 A1 A worker 1 worker 2       0101 0111
    • A2 TA2A2  1100                  1 11 000 0000 Physical operator AtA_slim           1100 0101 0111 A1 TA1A1 A worker 1 worker 2       0101 0111                  0 02 011 0212
    • A2 TA2A2  1100                  1 11 000 0000 Physical operator AtA_slim           1100 0101 0111 A1 TA1A1 A C = ATA worker 1 worker 2 A1 TA1 + A2 TA2 driver       0101 0111                  0 02 011 0212               1100 1312 0111 0212
    • Thank you. Questions? Overview of Mahout‘s Scala & Spark Bindings: http://s.apache.org/mahout-spark Tutorial on playing with Mahout‘s Spark shell http://s.apache.org/mahout-spark-shell