1©MapR Technologies 2013- Confidential
Introduction to Mahout
And How To Build a Recommender
2©MapR Technologies 2013- Confidential
Me, Us
 Ted Dunning, Chief Application Architect, MapR
Committer PMC member, Mahout, Zookeeper, Drill
Bought the beer at the first HUG
 MapR
Distributes more open source components for Hadoop
Adds major technology for performance, HA, industry standard API’s
 Tonight
Hash tag - #tchug
See also - @ApacheMahout @ApacheDrill
@ted_dunning and @mapR
3©MapR Technologies 2013- Confidential
Sidebar on Drill
 Apache Drill
– SQL on Hadoop (and other things)
– Intended to solve problems for 1-5 years from now
Not the problems from 1-10 years ago
– Multiple levels of API supported
• SQL-2003
• Logical plan language (DAG in JSON)
• Physical plan language (DAG with push-down, exchange markers)
• Execution plan language (many DAG’s)
 Current state
– SQL 2003 support in place
– Logical plan interpreter useful for testing
– Value vectors near completion
– High performance RPC working
4©MapR Technologies 2013- Confidential
More on Drill
 Just completed OSCON workshop
 Workshop materials available shortly
– Extracted technology demonstrators
– Sample queries
 Send me email or tweet for more info
5©MapR Technologies 2013- Confidential
What’s Up?
 What is Mahout?
– Math library
– Clustering, classifiers, other stuff
 Recommendation
– Generalities
– Algorithm Specifics
– System Design
– Important things never mentioned
 Final thoughts
6©MapR Technologies 2013- Confidential
What is Mahout?
 “Scalable machine learning”
– not just Hadoop-oriented machine learning
– not entirely, that is. Just mostly.
 Components
– math library
– clustering
– classification
– decompositions
– recommendations
7©MapR Technologies 2013- Confidential
What is Mahout?
 “Scalable machine learning”
– not just Hadoop-oriented machine learning
– not entirely, that is. Just mostly.
 Components
– math library
– clustering
– classification
– decompositions
– recommendations
8©MapR Technologies 2013- Confidential
Mahout Math
9©MapR Technologies 2013- Confidential
Mahout Math
 Goals are
– basic linear algebra,
– and statistical sampling,
– and good clustering,
– decent speed,
– extensibility,
– especially for sparse data
 But not
– totally badass speed
– comprehensive set of algorithms
– optimization, root finders, quadrature
10©MapR Technologies 2013- Confidential
Matrices and Vectors
 At the core:
– DenseVector, RandomAccessSparseVector
– DenseMatrix, SparseRowMatrix
 Highly composable API
 Important ideas:
– view*, assign and aggregate
– iteration
m.viewDiagonal().assign(v)
11©MapR Technologies 2013- Confidential
Assign? View?
 Why assign?
– Copying is the major cost for naïve matrix packages
– In-place operations critical to reasonable performance
– Many kinds of updates required, so functional style very helpful
 Why view?
– In-place operations often required for blocks, rows, columns or diagonals
– With views, we need #assign + #views methods
– Without views, we need #assign x #views methods
 Synergies
– With both views and assign, many loops become single line
12©MapR Technologies 2013- Confidential
Assign
 Matrices
 Vectors
Matrix assign(double value);
Matrix assign(double[][] values);
Matrix assign(Matrix other);
Matrix assign(DoubleFunction f);
Matrix assign(Matrix other, DoubleDoubleFunction f);
Vector assign(double value);
Vector assign(double[] values);
Vector assign(Vector other);
Vector assign(DoubleFunction f);
Vector assign(Vector other, DoubleDoubleFunction f);
Vector assign(DoubleDoubleFunction f, double y);
13©MapR Technologies 2013- Confidential
Views
 Matrices
 Vectors
Matrix viewPart(int[] offset, int[] size);
Matrix viewPart(int row, int rlen, int col, int clen);
Vector viewRow(int row);
Vector viewColumn(int column);
Vector viewDiagonal();
Vector viewPart(int offset, int length);
14©MapR Technologies 2013- Confidential
Aggregates
 Matrices
 Vectors
double zSum();
double aggregate(
DoubleDoubleFunction reduce, DoubleFunction map);
double aggregate(Vector other,
DoubleDoubleFunction aggregator,
DoubleDoubleFunction combiner);
double zSum();
Vector aggregateRows(VectorFunction f);
Vector aggregateColumns(VectorFunction f);
double aggregate(DoubleDoubleFunction combiner,
DoubleFunction mapper);
15©MapR Technologies 2013- Confidential
Predefined Functions
 Many handy functions
ABS LOG2
ACOS NEGATE
ASIN RINT
ATAN SIGN
CEIL SIN
COS SQRT
EXP SQUARE
FLOOR SIGMOID
IDENTITY SIGMOIDGRADIENT
INV TAN
LOGARITHM
16©MapR Technologies 2013- Confidential
Examples
double alpha; a.assign(alpha);
a.assign(b, Functions.chain(
Functions.plus(beta),
Functions.times(alpha));
A =a
A =aB+ b
17©MapR Technologies 2013- Confidential
Sparse Optimizations
 DoubleDoubleFunction abstract properties
 And Vector properties
public boolean isLikeRightPlus();
public boolean isLikeLeftMult();
public boolean isLikeRightMult();
public boolean isLikeMult();
public boolean isCommutative();
public boolean isAssociative();
public boolean isAssociativeAndCommutative();
public boolean isDensifying();
public boolean isDense();
public boolean isSequentialAccess();
public double getLookupCost();
public double getIteratorAdvanceCost();
public boolean isAddConstantTime();
18©MapR Technologies 2013- Confidential
More Examples
 The trace of a matrix
 Set diagonal to zero
 Set diagonal to negative of row sums
19©MapR Technologies 2013- Confidential
Examples
 The trace of a matrix
 Set diagonal to zero
 Set diagonal to negative of row sums
m.viewDiagonal().zSum()
20©MapR Technologies 2013- Confidential
Examples
 The trace of a matrix
 Set diagonal to zero
 Set diagonal to negative of row sums
m.viewDiagonal().zSum()
m.viewDiagonal().assign(0)
21©MapR Technologies 2013- Confidential
Examples
 The trace of a matrix
 Set diagonal to zero
 Set diagonal to negative of row sums excluding the diagonal
m.viewDiagonal().zSum()
m.viewDiagonal().assign(0)
Vector diag = m.viewDiagonal().assign(0);
diag.assign(m.rowSums().assign(Functions.MINUS));
22©MapR Technologies 2013- Confidential
Iteration
 Matrices are Iterable in Mahout
 Vectors are densely or sparsely iterable
// compute both row and columns sums in one pass
for (MatrixSlice row: m) {
rSums.set(row.index(), row.zSum());
cSums.assign(row, Functions.PLUS);
}
double entropy = 0;
for (Vector.Element e: v.nonZeroes()) {
entropy += e.get() * Math.log(e.get());
}
23©MapR Technologies 2013- Confidential
Random Sampling
 Samples from some type
 Lots of kinds
ChineseRestaurant Missing Normal
Empirical Multinomial PoissonSampler
IndianBuffet MultiNormal Sampler
public interface Sampler<T> {
T sample();
}
public abstract class AbstractSamplerFunction
extends DoubleFunction
implements Sampler<Double>
24©MapR Technologies 2013- Confidential
Clustering and Such
 Streaming k-means and ball k-means
– streaming reduces very large data to a cluster sketch
– ball k-means is a high quality k-means implementation
– the cluster sketch is also usable for other applications
– single machine threaded and map-reduce versions available
 SVD and friends
– stochastic SVD has in-memory, single machine out-of-core and map-reduce
versions
– good for reducing very large sparse matrices to tall skinny dense ones
 Spectral clustering
– based on SVD, allows massive dimensional clustering
25©MapR Technologies 2013- Confidential
Mahout Math Summary
 Matrices, Vectors
– views
– in-place assignment
– aggregations
– iterations
 Functions
– lots built-in
– cooperate with sparse vector optimizations
 Sampling
– abstract samplers
– samplers as functions
 Other stuff … clustering, SVD
26©MapR Technologies 2013- Confidential
Recommenders
27©MapR Technologies 2013- Confidential
Recommendations
 Often known as collaborative filtering
 Actors interact with items
– observe successful interaction
 We want to suggest additional successful interactions
 Observations inherently very sparse
28©MapR Technologies 2013- Confidential
The Big Ideas
 Cooccurrence is the core operation (and it is pretty simple)
 Cooccurrence can be extended to handle important new
capabilities
 Recommendation systems can be deployed ideally using search
technology
29©MapR Technologies 2013- Confidential
Examples of Recommendations
 Customers buying books (Linden et al)
 Web visitors rating music (Shardanand and Maes) or movies (Riedl,
et al), (Netflix)
 Internet radio listeners not skipping songs (Musicmatch)
 Internet video watchers watching >30 s (Veoh)
 Visibility in a map UI (new Google maps)
30©MapR Technologies 2013- Confidential
A simple recommendation architecture
 Look at the history of interactions
 Find significant item cooccurrence in user histories
 Use these cooccurring items as “indicators”
 For all indicators in user history, accumulate scores for related
items
31©MapR Technologies 2013- Confidential
Recommendation Basics
 History:
User Thing
1 3
2 4
3 4
2 3
3 2
1 1
2 1
32©MapR Technologies 2013- Confidential
Recommendation Basics
 History as matrix:
 (t1, t3) cooccur 2 times,
 (t1, t4) once,
 (t2, t4) once,
 (t3, t4) once
t1 t2 t3 t4
u1 1 0 1 0
u2 1 0 1 1
u3 0 1 0 1
33©MapR Technologies 2013- Confidential
A Quick Simplification
 Users who do h
 Also do r
Ah
AT
Ah( )
AT
A( )h
User-centric recommendations
Item-centric recommendations
34©MapR Technologies 2013- Confidential
Recommendation Basics
 Coocurrence
t1 t2 t3 t4
t1 2 0 2 1
t2 0 1 0 1
t3 2 0 1 1
t4 1 1 1 2
35©MapR Technologies 2013- Confidential
Problems with Raw Cooccurrence
 Very popular items co-occur with everything
– Welcome document
– Elevator music
 That isn’t interesting
– We want anomalous cooccurrence
36©MapR Technologies 2013- Confidential
Recommendation Basics
 Coocurrence
t1 t2 t3 t4
t1 2 0 2 1
t2 0 1 0 1
t3 2 0 1 1
t4 1 1 1 2
t3 not t3
t1 2 1
not t1 1 1
37©MapR Technologies 2013- Confidential
Spot the Anomaly
 Root LLR is roughly like standard deviations
A not A
B 13 1000
not B 1000 100,000
A not A
B 1 0
not B 0 2
A not A
B 1 0
not B 0 10,000
A not A
B 10 0
not B 0 100,000
0.44 0.98
2.26 7.15
39©MapR Technologies 2013- Confidential
Threshold by Score
 Coocurrence
t1 t2 t3 t4
t1 2 0 2 1
t2 0 1 0 1
t3 2 0 1 1
t4 1 1 1 2
40©MapR Technologies 2013- Confidential
Threshold by Score
 Significant cooccurrence => Indicators
t1 t2 t3 t4
t1 1 0 0 1
t2 0 1 0 1
t3 0 0 1 1
t4 1 0 0 1
41©MapR Technologies 2013- Confidential
So Far, So Good
 Classic recommendation systems based on these approaches
– Musicmatch (ca 2000)
– Veoh Networks (ca 2005)
 Currently available in Mahout
– See RowSimilarityJob
 Very simple to deploy
– Compute indicators
– Store in search engine
– Works very well with enough data
42©MapR Technologies 2013- Confidential
What’s right
about this?
43©MapR Technologies 2013- Confidential
Virtues of Current State of the Art
 Lots of well publicized history
– Musicmatch, Veoh, Netflix, Amazon, Overstock
 Lots of support
– Mahout, commercial offerings like Myrrix
 Lots of existing code
– Mahout, commercial codes
 Proven track record
 Well socialized solution
44©MapR Technologies 2013- Confidential
What’s wrong
about this?
45©MapR Technologies 2013- Confidential
Problems for Recommenders
 Cold start
 Disjoint populations
 Long tail
 Multiple kinds of evidence (multi-modal recommendations)
– unstructured add-on data
– other transaction streams
– textual descriptions
46©MapR Technologies 2013- Confidential
What is this multi-modal stuff?
 But people don’t just do one thing
 One kind of behavior is useful for predicting other kinds
 Having a complete picture is important for accuracy
 What has the user said, viewed, clicked, closed, bought lately?
47©MapR Technologies 2013- Confidential
Example Multi-modal Inputs
 Overlap in restaurant visits is useful
 Big spender cues
 Cuisine as an indicator
 Review text as an indicator
48©MapR Technologies 2013- Confidential
Too Limited
 People do more than one kind of thing
 Different kinds of behaviors give different quality, quantity and
kind of information
 We don’t have to do co-occurrence
 We can do cross-occurrence
 Result is cross-recommendation
49©MapR Technologies 2013- Confidential
Heh?
51©MapR Technologies 2013- Confidential
For example
 Users enter queries (A)
– (actor = user, item=query)
 Users view videos (B)
– (actor = user, item=video)
 ATA gives query recommendation
– “did you mean to ask for”
 BTB gives video recommendation
– “you might like these videos”
52©MapR Technologies 2013- Confidential
The punch-line
 BTA recommends videos in response to a query
– (isn’t that a search engine?)
– (not quite, it doesn’t look at content or meta-data)
53©MapR Technologies 2013- Confidential
Real-life example
 Query: “Paco de Lucia”
 Conventional meta-data search results:
– “hombres del paco” times 400
– not much else
 Recommendation based search:
– Flamenco guitar and dancers
– Spanish and classical guitar
– Van Halen doing a classical/flamenco riff
54©MapR Technologies 2013- Confidential
Real-life example
55©MapR Technologies 2013- Confidential
Hypothetical Example
 Want a navigational ontology?
 Just put labels on a web page with traffic
– This gives A = users x label clicks
 Remember viewing history
– This gives B = users x items
 Cross recommend
– B’A = label to item mapping
 After several users click, results are whatever users think they
should be
56©MapR Technologies 2013- Confidential
57©MapR Technologies 2013- Confidential
Nice. But we
can do better?
58©MapR Technologies 2013- Confidential
Ausers
things
59©MapR Technologies 2013- Confidential
A1 A2
é
ë
ù
û
users
thing
type 1
thing
type 2
60©MapR Technologies 2013- Confidential
A1 A2
é
ë
ù
û
T
A1 A2
é
ë
ù
û=
A1
T
A2
T
é
ë
ê
ê
ù
û
ú
ú
A1 A2
é
ë
ù
û
=
A1
T
A1 A1
T
A2
AT
2A1 AT
2A2
é
ë
ê
ê
ù
û
ú
ú
r1
r2
é
ë
ê
ê
ù
û
ú
ú
=
A1
T
A1 A1
T
A2
AT
2A1 AT
2A2
é
ë
ê
ê
ù
û
ú
ú
h1
h2
é
ë
ê
ê
ù
û
ú
ú
r1 = A1
T
A1 A1
T
A2
é
ëê
ù
ûú
h1
h2
é
ë
ê
ê
ù
û
ú
ú
61©MapR Technologies 2013- Confidential
Summary
 Input: Multiple kinds of behavior on one set of things
 Output: Recommendations for one kind of behavior with a
different set of things
 Cross recommendation is a special case
62©MapR Technologies 2013- Confidential
Now again, without
the scary math
63©MapR Technologies 2013- Confidential
Input Data
 User transactions
– user id, merchant id
– SIC code, amount
– Descriptions, cuisine, …
 Offer transactions
– user id, offer id
– vendor id, merchant id’s,
– offers, views, accepts
64©MapR Technologies 2013- Confidential
Input Data
 User transactions
– user id, merchant id
– SIC code, amount
– Descriptions, cuisine, …
 Offer transactions
– user id, offer id
– vendor id, merchant id’s,
– offers, views, accepts
 Derived user data
– merchant id’s
– anomalous descriptor terms
– offer & vendor id’s
 Derived merchant data
– local top40
– SIC code
– vendor code
– amount distribution
65©MapR Technologies 2013- Confidential
Cross-recommendation
 Per merchant indicators
– merchant id’s
– chain id’s
– SIC codes
– indicator terms from text
– offer vendor id’s
 Computed by finding anomalous (indicator => merchant) rates
66©MapR Technologies 2013- Confidential
How can we deploy
this?
67©MapR Technologies 2013- Confidential
Search-based Recommendations
 Sample document
– Merchant Id
– Field for text description
– Phone
– Address
– Location
68©MapR Technologies 2013- Confidential
Search-based Recommendations
 Sample document
– Merchant Id
– Field for text description
– Phone
– Address
– Location
– Indicator merchant id’s
– Indicator industry (SIC) id’s
– Indicator offers
– Indicator text
– Local top40
69©MapR Technologies 2013- Confidential
Search-based Recommendations
 Sample document
– Merchant Id
– Field for text description
– Phone
– Address
– Location
– Indicator merchant id’s
– Indicator industry (SIC) id’s
– Indicator offers
– Indicator text
– Local top40
 Sample query
– Current location
– Recent merchant descriptions
– Recent merchant id’s
– Recent SIC codes
– Recent accepted offers
– Local top40
70©MapR Technologies 2013- Confidential
Search-based Recommendations
 Sample document
– Merchant Id
– Field for text description
– Phone
– Address
– Location
– Indicator merchant id’s
– Indicator industry (SIC) id’s
– Indicator offers
– Indicator text
– Local top40
 Sample query
– Current location
– Recent merchant descriptions
– Recent merchant id’s
– Recent SIC codes
– Recent accepted offers
– Local top40
Original data
and meta-data
Derived from cooccurrence
and cross-occurrence
analysis
Recommendation
query
71©MapR Technologies 2013- Confidential
SolR
Indexer
SolR
Indexer
Solr
indexing
Cooccurrence
(Mahout)
Item meta-
data
Index
shards
Complete
history
Analyze with Map-Reduce
72©MapR Technologies 2013- Confidential
SolR
Indexer
SolR
Indexer
Solr
search
Web tier
Item meta-
data
Index
shards
User
history
Deploy with Conventional Search System
73©MapR Technologies 2013- Confidential
Objective Results
 At a very large credit card company
 History is all transactions
 Development time to minimal viable product about 4 months
 General release 2-3 months later
 Search-based recs at or equal in quality to other techniques
74©MapR Technologies 2013- Confidential
 Contact:
– tdunning@maprtech.com
– @ted_dunning
– @apachemahout
– @user-subscribe@mahout.apache.org
 Slides and such
http://www.slideshare.net/tdunning
 Hash tags: #mapr #apachemahout #recommendations

Introduction to Mahout

  • 1.
    1©MapR Technologies 2013-Confidential Introduction to Mahout And How To Build a Recommender
  • 2.
    2©MapR Technologies 2013-Confidential Me, Us  Ted Dunning, Chief Application Architect, MapR Committer PMC member, Mahout, Zookeeper, Drill Bought the beer at the first HUG  MapR Distributes more open source components for Hadoop Adds major technology for performance, HA, industry standard API’s  Tonight Hash tag - #tchug See also - @ApacheMahout @ApacheDrill @ted_dunning and @mapR
  • 3.
    3©MapR Technologies 2013-Confidential Sidebar on Drill  Apache Drill – SQL on Hadoop (and other things) – Intended to solve problems for 1-5 years from now Not the problems from 1-10 years ago – Multiple levels of API supported • SQL-2003 • Logical plan language (DAG in JSON) • Physical plan language (DAG with push-down, exchange markers) • Execution plan language (many DAG’s)  Current state – SQL 2003 support in place – Logical plan interpreter useful for testing – Value vectors near completion – High performance RPC working
  • 4.
    4©MapR Technologies 2013-Confidential More on Drill  Just completed OSCON workshop  Workshop materials available shortly – Extracted technology demonstrators – Sample queries  Send me email or tweet for more info
  • 5.
    5©MapR Technologies 2013-Confidential What’s Up?  What is Mahout? – Math library – Clustering, classifiers, other stuff  Recommendation – Generalities – Algorithm Specifics – System Design – Important things never mentioned  Final thoughts
  • 6.
    6©MapR Technologies 2013-Confidential What is Mahout?  “Scalable machine learning” – not just Hadoop-oriented machine learning – not entirely, that is. Just mostly.  Components – math library – clustering – classification – decompositions – recommendations
  • 7.
    7©MapR Technologies 2013-Confidential What is Mahout?  “Scalable machine learning” – not just Hadoop-oriented machine learning – not entirely, that is. Just mostly.  Components – math library – clustering – classification – decompositions – recommendations
  • 8.
    8©MapR Technologies 2013-Confidential Mahout Math
  • 9.
    9©MapR Technologies 2013-Confidential Mahout Math  Goals are – basic linear algebra, – and statistical sampling, – and good clustering, – decent speed, – extensibility, – especially for sparse data  But not – totally badass speed – comprehensive set of algorithms – optimization, root finders, quadrature
  • 10.
    10©MapR Technologies 2013-Confidential Matrices and Vectors  At the core: – DenseVector, RandomAccessSparseVector – DenseMatrix, SparseRowMatrix  Highly composable API  Important ideas: – view*, assign and aggregate – iteration m.viewDiagonal().assign(v)
  • 11.
    11©MapR Technologies 2013-Confidential Assign? View?  Why assign? – Copying is the major cost for naïve matrix packages – In-place operations critical to reasonable performance – Many kinds of updates required, so functional style very helpful  Why view? – In-place operations often required for blocks, rows, columns or diagonals – With views, we need #assign + #views methods – Without views, we need #assign x #views methods  Synergies – With both views and assign, many loops become single line
  • 12.
    12©MapR Technologies 2013-Confidential Assign  Matrices  Vectors Matrix assign(double value); Matrix assign(double[][] values); Matrix assign(Matrix other); Matrix assign(DoubleFunction f); Matrix assign(Matrix other, DoubleDoubleFunction f); Vector assign(double value); Vector assign(double[] values); Vector assign(Vector other); Vector assign(DoubleFunction f); Vector assign(Vector other, DoubleDoubleFunction f); Vector assign(DoubleDoubleFunction f, double y);
  • 13.
    13©MapR Technologies 2013-Confidential Views  Matrices  Vectors Matrix viewPart(int[] offset, int[] size); Matrix viewPart(int row, int rlen, int col, int clen); Vector viewRow(int row); Vector viewColumn(int column); Vector viewDiagonal(); Vector viewPart(int offset, int length);
  • 14.
    14©MapR Technologies 2013-Confidential Aggregates  Matrices  Vectors double zSum(); double aggregate( DoubleDoubleFunction reduce, DoubleFunction map); double aggregate(Vector other, DoubleDoubleFunction aggregator, DoubleDoubleFunction combiner); double zSum(); Vector aggregateRows(VectorFunction f); Vector aggregateColumns(VectorFunction f); double aggregate(DoubleDoubleFunction combiner, DoubleFunction mapper);
  • 15.
    15©MapR Technologies 2013-Confidential Predefined Functions  Many handy functions ABS LOG2 ACOS NEGATE ASIN RINT ATAN SIGN CEIL SIN COS SQRT EXP SQUARE FLOOR SIGMOID IDENTITY SIGMOIDGRADIENT INV TAN LOGARITHM
  • 16.
    16©MapR Technologies 2013-Confidential Examples double alpha; a.assign(alpha); a.assign(b, Functions.chain( Functions.plus(beta), Functions.times(alpha)); A =a A =aB+ b
  • 17.
    17©MapR Technologies 2013-Confidential Sparse Optimizations  DoubleDoubleFunction abstract properties  And Vector properties public boolean isLikeRightPlus(); public boolean isLikeLeftMult(); public boolean isLikeRightMult(); public boolean isLikeMult(); public boolean isCommutative(); public boolean isAssociative(); public boolean isAssociativeAndCommutative(); public boolean isDensifying(); public boolean isDense(); public boolean isSequentialAccess(); public double getLookupCost(); public double getIteratorAdvanceCost(); public boolean isAddConstantTime();
  • 18.
    18©MapR Technologies 2013-Confidential More Examples  The trace of a matrix  Set diagonal to zero  Set diagonal to negative of row sums
  • 19.
    19©MapR Technologies 2013-Confidential Examples  The trace of a matrix  Set diagonal to zero  Set diagonal to negative of row sums m.viewDiagonal().zSum()
  • 20.
    20©MapR Technologies 2013-Confidential Examples  The trace of a matrix  Set diagonal to zero  Set diagonal to negative of row sums m.viewDiagonal().zSum() m.viewDiagonal().assign(0)
  • 21.
    21©MapR Technologies 2013-Confidential Examples  The trace of a matrix  Set diagonal to zero  Set diagonal to negative of row sums excluding the diagonal m.viewDiagonal().zSum() m.viewDiagonal().assign(0) Vector diag = m.viewDiagonal().assign(0); diag.assign(m.rowSums().assign(Functions.MINUS));
  • 22.
    22©MapR Technologies 2013-Confidential Iteration  Matrices are Iterable in Mahout  Vectors are densely or sparsely iterable // compute both row and columns sums in one pass for (MatrixSlice row: m) { rSums.set(row.index(), row.zSum()); cSums.assign(row, Functions.PLUS); } double entropy = 0; for (Vector.Element e: v.nonZeroes()) { entropy += e.get() * Math.log(e.get()); }
  • 23.
    23©MapR Technologies 2013-Confidential Random Sampling  Samples from some type  Lots of kinds ChineseRestaurant Missing Normal Empirical Multinomial PoissonSampler IndianBuffet MultiNormal Sampler public interface Sampler<T> { T sample(); } public abstract class AbstractSamplerFunction extends DoubleFunction implements Sampler<Double>
  • 24.
    24©MapR Technologies 2013-Confidential Clustering and Such  Streaming k-means and ball k-means – streaming reduces very large data to a cluster sketch – ball k-means is a high quality k-means implementation – the cluster sketch is also usable for other applications – single machine threaded and map-reduce versions available  SVD and friends – stochastic SVD has in-memory, single machine out-of-core and map-reduce versions – good for reducing very large sparse matrices to tall skinny dense ones  Spectral clustering – based on SVD, allows massive dimensional clustering
  • 25.
    25©MapR Technologies 2013-Confidential Mahout Math Summary  Matrices, Vectors – views – in-place assignment – aggregations – iterations  Functions – lots built-in – cooperate with sparse vector optimizations  Sampling – abstract samplers – samplers as functions  Other stuff … clustering, SVD
  • 26.
    26©MapR Technologies 2013-Confidential Recommenders
  • 27.
    27©MapR Technologies 2013-Confidential Recommendations  Often known as collaborative filtering  Actors interact with items – observe successful interaction  We want to suggest additional successful interactions  Observations inherently very sparse
  • 28.
    28©MapR Technologies 2013-Confidential The Big Ideas  Cooccurrence is the core operation (and it is pretty simple)  Cooccurrence can be extended to handle important new capabilities  Recommendation systems can be deployed ideally using search technology
  • 29.
    29©MapR Technologies 2013-Confidential Examples of Recommendations  Customers buying books (Linden et al)  Web visitors rating music (Shardanand and Maes) or movies (Riedl, et al), (Netflix)  Internet radio listeners not skipping songs (Musicmatch)  Internet video watchers watching >30 s (Veoh)  Visibility in a map UI (new Google maps)
  • 30.
    30©MapR Technologies 2013-Confidential A simple recommendation architecture  Look at the history of interactions  Find significant item cooccurrence in user histories  Use these cooccurring items as “indicators”  For all indicators in user history, accumulate scores for related items
  • 31.
    31©MapR Technologies 2013-Confidential Recommendation Basics  History: User Thing 1 3 2 4 3 4 2 3 3 2 1 1 2 1
  • 32.
    32©MapR Technologies 2013-Confidential Recommendation Basics  History as matrix:  (t1, t3) cooccur 2 times,  (t1, t4) once,  (t2, t4) once,  (t3, t4) once t1 t2 t3 t4 u1 1 0 1 0 u2 1 0 1 1 u3 0 1 0 1
  • 33.
    33©MapR Technologies 2013-Confidential A Quick Simplification  Users who do h  Also do r Ah AT Ah( ) AT A( )h User-centric recommendations Item-centric recommendations
  • 34.
    34©MapR Technologies 2013-Confidential Recommendation Basics  Coocurrence t1 t2 t3 t4 t1 2 0 2 1 t2 0 1 0 1 t3 2 0 1 1 t4 1 1 1 2
  • 35.
    35©MapR Technologies 2013-Confidential Problems with Raw Cooccurrence  Very popular items co-occur with everything – Welcome document – Elevator music  That isn’t interesting – We want anomalous cooccurrence
  • 36.
    36©MapR Technologies 2013-Confidential Recommendation Basics  Coocurrence t1 t2 t3 t4 t1 2 0 2 1 t2 0 1 0 1 t3 2 0 1 1 t4 1 1 1 2 t3 not t3 t1 2 1 not t1 1 1
  • 37.
    37©MapR Technologies 2013-Confidential Spot the Anomaly  Root LLR is roughly like standard deviations A not A B 13 1000 not B 1000 100,000 A not A B 1 0 not B 0 2 A not A B 1 0 not B 0 10,000 A not A B 10 0 not B 0 100,000 0.44 0.98 2.26 7.15
  • 38.
    39©MapR Technologies 2013-Confidential Threshold by Score  Coocurrence t1 t2 t3 t4 t1 2 0 2 1 t2 0 1 0 1 t3 2 0 1 1 t4 1 1 1 2
  • 39.
    40©MapR Technologies 2013-Confidential Threshold by Score  Significant cooccurrence => Indicators t1 t2 t3 t4 t1 1 0 0 1 t2 0 1 0 1 t3 0 0 1 1 t4 1 0 0 1
  • 40.
    41©MapR Technologies 2013-Confidential So Far, So Good  Classic recommendation systems based on these approaches – Musicmatch (ca 2000) – Veoh Networks (ca 2005)  Currently available in Mahout – See RowSimilarityJob  Very simple to deploy – Compute indicators – Store in search engine – Works very well with enough data
  • 41.
    42©MapR Technologies 2013-Confidential What’s right about this?
  • 42.
    43©MapR Technologies 2013-Confidential Virtues of Current State of the Art  Lots of well publicized history – Musicmatch, Veoh, Netflix, Amazon, Overstock  Lots of support – Mahout, commercial offerings like Myrrix  Lots of existing code – Mahout, commercial codes  Proven track record  Well socialized solution
  • 43.
    44©MapR Technologies 2013-Confidential What’s wrong about this?
  • 44.
    45©MapR Technologies 2013-Confidential Problems for Recommenders  Cold start  Disjoint populations  Long tail  Multiple kinds of evidence (multi-modal recommendations) – unstructured add-on data – other transaction streams – textual descriptions
  • 45.
    46©MapR Technologies 2013-Confidential What is this multi-modal stuff?  But people don’t just do one thing  One kind of behavior is useful for predicting other kinds  Having a complete picture is important for accuracy  What has the user said, viewed, clicked, closed, bought lately?
  • 46.
    47©MapR Technologies 2013-Confidential Example Multi-modal Inputs  Overlap in restaurant visits is useful  Big spender cues  Cuisine as an indicator  Review text as an indicator
  • 47.
    48©MapR Technologies 2013-Confidential Too Limited  People do more than one kind of thing  Different kinds of behaviors give different quality, quantity and kind of information  We don’t have to do co-occurrence  We can do cross-occurrence  Result is cross-recommendation
  • 48.
    49©MapR Technologies 2013-Confidential Heh?
  • 49.
    51©MapR Technologies 2013-Confidential For example  Users enter queries (A) – (actor = user, item=query)  Users view videos (B) – (actor = user, item=video)  ATA gives query recommendation – “did you mean to ask for”  BTB gives video recommendation – “you might like these videos”
  • 50.
    52©MapR Technologies 2013-Confidential The punch-line  BTA recommends videos in response to a query – (isn’t that a search engine?) – (not quite, it doesn’t look at content or meta-data)
  • 51.
    53©MapR Technologies 2013-Confidential Real-life example  Query: “Paco de Lucia”  Conventional meta-data search results: – “hombres del paco” times 400 – not much else  Recommendation based search: – Flamenco guitar and dancers – Spanish and classical guitar – Van Halen doing a classical/flamenco riff
  • 52.
    54©MapR Technologies 2013-Confidential Real-life example
  • 53.
    55©MapR Technologies 2013-Confidential Hypothetical Example  Want a navigational ontology?  Just put labels on a web page with traffic – This gives A = users x label clicks  Remember viewing history – This gives B = users x items  Cross recommend – B’A = label to item mapping  After several users click, results are whatever users think they should be
  • 54.
  • 55.
    57©MapR Technologies 2013-Confidential Nice. But we can do better?
  • 56.
    58©MapR Technologies 2013-Confidential Ausers things
  • 57.
    59©MapR Technologies 2013-Confidential A1 A2 é ë ù û users thing type 1 thing type 2
  • 58.
    60©MapR Technologies 2013-Confidential A1 A2 é ë ù û T A1 A2 é ë ù û= A1 T A2 T é ë ê ê ù û ú ú A1 A2 é ë ù û = A1 T A1 A1 T A2 AT 2A1 AT 2A2 é ë ê ê ù û ú ú r1 r2 é ë ê ê ù û ú ú = A1 T A1 A1 T A2 AT 2A1 AT 2A2 é ë ê ê ù û ú ú h1 h2 é ë ê ê ù û ú ú r1 = A1 T A1 A1 T A2 é ëê ù ûú h1 h2 é ë ê ê ù û ú ú
  • 59.
    61©MapR Technologies 2013-Confidential Summary  Input: Multiple kinds of behavior on one set of things  Output: Recommendations for one kind of behavior with a different set of things  Cross recommendation is a special case
  • 60.
    62©MapR Technologies 2013-Confidential Now again, without the scary math
  • 61.
    63©MapR Technologies 2013-Confidential Input Data  User transactions – user id, merchant id – SIC code, amount – Descriptions, cuisine, …  Offer transactions – user id, offer id – vendor id, merchant id’s, – offers, views, accepts
  • 62.
    64©MapR Technologies 2013-Confidential Input Data  User transactions – user id, merchant id – SIC code, amount – Descriptions, cuisine, …  Offer transactions – user id, offer id – vendor id, merchant id’s, – offers, views, accepts  Derived user data – merchant id’s – anomalous descriptor terms – offer & vendor id’s  Derived merchant data – local top40 – SIC code – vendor code – amount distribution
  • 63.
    65©MapR Technologies 2013-Confidential Cross-recommendation  Per merchant indicators – merchant id’s – chain id’s – SIC codes – indicator terms from text – offer vendor id’s  Computed by finding anomalous (indicator => merchant) rates
  • 64.
    66©MapR Technologies 2013-Confidential How can we deploy this?
  • 65.
    67©MapR Technologies 2013-Confidential Search-based Recommendations  Sample document – Merchant Id – Field for text description – Phone – Address – Location
  • 66.
    68©MapR Technologies 2013-Confidential Search-based Recommendations  Sample document – Merchant Id – Field for text description – Phone – Address – Location – Indicator merchant id’s – Indicator industry (SIC) id’s – Indicator offers – Indicator text – Local top40
  • 67.
    69©MapR Technologies 2013-Confidential Search-based Recommendations  Sample document – Merchant Id – Field for text description – Phone – Address – Location – Indicator merchant id’s – Indicator industry (SIC) id’s – Indicator offers – Indicator text – Local top40  Sample query – Current location – Recent merchant descriptions – Recent merchant id’s – Recent SIC codes – Recent accepted offers – Local top40
  • 68.
    70©MapR Technologies 2013-Confidential Search-based Recommendations  Sample document – Merchant Id – Field for text description – Phone – Address – Location – Indicator merchant id’s – Indicator industry (SIC) id’s – Indicator offers – Indicator text – Local top40  Sample query – Current location – Recent merchant descriptions – Recent merchant id’s – Recent SIC codes – Recent accepted offers – Local top40 Original data and meta-data Derived from cooccurrence and cross-occurrence analysis Recommendation query
  • 69.
    71©MapR Technologies 2013-Confidential SolR Indexer SolR Indexer Solr indexing Cooccurrence (Mahout) Item meta- data Index shards Complete history Analyze with Map-Reduce
  • 70.
    72©MapR Technologies 2013-Confidential SolR Indexer SolR Indexer Solr search Web tier Item meta- data Index shards User history Deploy with Conventional Search System
  • 71.
    73©MapR Technologies 2013-Confidential Objective Results  At a very large credit card company  History is all transactions  Development time to minimal viable product about 4 months  General release 2-3 months later  Search-based recs at or equal in quality to other techniques
  • 72.
    74©MapR Technologies 2013-Confidential  Contact: – tdunning@maprtech.com – @ted_dunning – @apachemahout – @user-subscribe@mahout.apache.org  Slides and such http://www.slideshare.net/tdunning  Hash tags: #mapr #apachemahout #recommendations