Seattle Scalability Mahout


Published on

Talk given at the Seattle Scalability / NoSQL / Hadoop / etc MeetUp on March 31, 2010

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • And the usual references for LSI and Spectral Decomposition
  • Seattle Scalability Mahout

    1. 1. Numerical Recipes<br />in <br />Hadoop<br />Jake Mannix<br />linkedin/in/jakemannix<br />twitter/pbrane<br /><br /><br />Principal SDE, LinkedIn<br />Committer, Apache Mahout, Zoie, Bobo-Browse, Decomposer<br />Author, Lucene in Depth (Manning MM/DD/2010)<br />
    2. 2. A Mathematician’s Apology<br />What mathematical structure describes all of these?<br />Full-text search:<br />Score documents matching “query string”<br />Collaborative filtering recommendation:<br />Users who liked {those} also liked {these}<br />(Social/web)-graph proximity:<br />People/pages “close” to {this} are {these}<br />
    3. 3. Matrix Multiplication!<br />
    4. 4. Full-text Search<br />Vector Space Model of IR<br />Corpus as term-document matrix<br />Query as bag-of-words vector<br />Full-text search is just: <br />
    5. 5. Collaborative Filtering<br />User preference matrix <br />(and item-item similarity matrix )<br />Input user as vector of preferences <br />(simple) Item-based CF recommendations are:<br />T<br />
    6. 6. Graph Proximity<br />Adjacency matrix:<br />2nd degree adjacency matrix: <br /> Input all of a user’s “friends” or page links:<br />(weighted) distance measure of 1st – 3rd degree connections is then:<br />
    7. 7. Dictionary<br />Applications Linear Algebra<br />
    8. 8. How does this help?<br />In Search:<br />Latent Semantic Indexing (LSI)<br />probabalistic LSI<br />Latent Dirichlet Allocation<br />In Recommenders:<br />Singular Value Decomposition<br />Layered Restricted Boltzmann Machines <br />(Deep Belief Networks)<br />In Graphs:<br />PageRank<br />Spectral Decomposition / Spectral Clustering<br />
    9. 9. Often use “Dimensional Reduction”<br />To alleviate the sparse Big Data problem of “the curse of dimensionality”<br />Used to improve recall and relevance <br />in general: smooth the metric on your data set<br />
    10. 10. New applications with Matrices<br />If Search is finding doc-vector by: <br />and users query with data represented: Q = <br />Giving implicit feedback based on click-through per session: C =<br />
    11. 11. … continued<br />Then has the form (docs-by-terms) for search!<br />Approach has been used by Ted Dunning at Veoh<br />(and probably others)<br />
    12. 12. Linear Algebra performance tricks<br />Naïve item-based recommendations:<br />Calculate item similarity matrix:<br />Calculate item recs:<br />Express in one step:<br />In matrix notation:<br />Re-writing as:<br /> is the vector of preferences for user “v”, <br /> is the vector of preferences of item “i”<br />The result is the matrix sum of the outer (tensor) products of these vectors, scaled by the entry they intersect at.<br />
    13. 13. Item Recommender via Hadoop<br />
    14. 14. Apache Mahout<br />Apache Mahout currently on release 0.3<br /><br />Will be a “Top Level Project” soon (before 0.4)<br />( )<br />“Scalable Machine Learning with commercially friendly licensing”<br />
    15. 15. Mahout Features<br /> Recommenders <br />absorbed the Taste project<br />Classification (Naïve Bayes, C-Bayes, more)<br />Clustering (Canopy, fuzzy-K-means, Dirichlet, etc…)<br />Fast non-distributed linear mathematics <br />absorbed the classic CERN Colt project<br />Distributed Matrices and decomposition<br />absorbed the Decomposer project<br />mahout shell-script analogous to $HADOOP_HOME/bin/hadoop<br />$MAHOUT_HOME/bin/mahout kmeans –i “in” –o “out” –k 100<br />$MAHOUT_HOME/bin/mahout svd –i “in” –o “out” –k 300<br />etc…<br />Taste web-app for real-time recommendations<br />
    16. 16. DistributedRowMatrix<br />Wrapper around a SequenceFile<IntWritable,VectorWritable><br />Distributed methods like:<br />Matrix transpose();<br />Matrix times(Matrix other);<br />Vector times(Vectorv);<br />Vector timesSquared(Vectorv);<br />To get SVD: pass into DistributedLanczosSolver:<br />LanczosSolver.solve(Matrix input, Matrix eigenVectors, List<Double> eigenValues, int rank); <br />
    17. 17. Questions?<br />Contact: <br /><br /><br /><br /><br /><br />
    18. 18. Appendix<br />There are lots of ways to deal with sparse Big Data, and many (not all) need to deal with the dimensionality of the feature-space growing beyond reasonable limits, and techniques to deal with this depend heavily on your data…<br />That having been said, there are some general techniques<br />
    19. 19. Dealing with Curse of Dimensionality<br />Sparseness means fast, but overlap is too small<br />Can we reduce the dimensionality (from “all possible text tokens” or “all userIds”) while keeping the nice aspects of the search problem?<br />If possible, collapse “similar” vectors (synonymous terms, userIds with high overlap, etc…) towards each other while keeping “dissimilar” vectors far apart…<br />
    20. 20. Solution A: Matrix decomposition<br />Singular Value Decomposition (truncated)<br />“best” approximation to your matrix<br />Used in Latent Semantic Indexing (LSI)<br />For graphs: spectral decomposition<br />Collaborative filtering (Netflix leaderboard)<br />Issues: very computation intensive <br />no parallelized open-source packages see Apache Mahout<br />Makes things too dense<br />
    21. 21. SVD: continued<br />Hadoopimpl. in Mahout (Lanczos)<br />O(N*d*k) for rank-k SVD on N docs, delt’s each <br />Density can be dealt with by doing Canopy Clustering offline<br />But only extracting linear feature mixes<br />Also, still very computation intensive and I/O intensive (k-passes over data set), are there better dimensional reduction methods?<br />
    22. 22. Solution B: Stochastic Decomposition co-ocurrence-based kernel + online Random Projection + SVD<br />
    23. 23. Co-ocurrence-based kernel<br />Extract bigram phrases / pairs of items rated by the same person (using Log-Likelihood Ratio test to pick the best)<br />“Disney on Ice was Amazing!” -> {“disney”, “disney on ice”, “ice”, “was” “amazing”}<br />{item1:4, item2:5, item5:3, item9:1} -> {item1:4, (items1+2):4.5, item2:5, item5:3,…}<br />Dim(features) goes from 105to 108+(yikes!)<br />
    24. 24. Online Random Projection<br />Randomly project kernelized text vectors down to “merely” 103dimensions with a Gaussian matrix <br />Or project eachnGram down to an random (but sparse) 103-dim vector:<br />V= {123876244 =>1.3} (tf-IDF of “disney”)<br />V’= c*{h(i) => 1, h(h(i)) =>1, h(h(h(i))) =>1}<br /> (c= 1.3 / sqrt(3)) <br />
    25. 25. Outer-product and Sum<br />Take the 103-dim projected vectors and outer-product with themselves,<br />result is 103x103-dim matrix<br /><ul><li>sum these in a Combiner</li></ul>All results go to single Reducer, where you compute…<br />
    26. 26. SVD <br />SVD-them quickly (they fit in memory) <br />Over and over again (as new data comes in)<br />Use the most recent SVD to project your (already randomly projected) text still further (now encoding “semantic” similarity).<br />SVD-projected vectors can be assigned immediately to nearest clusters if desired<br />
    27. 27. References<br />Randomized matrix decomposition review:<br />Sparse hashing/projection:<br />John Langford et al. “VowpalWabbit”<br /><br />