Your SlideShare is downloading.
×

×
Saving this for later?
Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.

Text the download link to your phone

Standard text messaging rates apply

Like this presentation? Why not share!

- The Fourth (1924-1929) and the Fift... by hilalplaza 750 views
- Intro to bm25 by Joseph Táng 346 views
- Probabilistic Retrieval TFIDF by DKALab 1117 views
- Up thử cái mới by Dung Trương 78 views
- Text Similarities - PG Pushpin by jsurve 1556 views
- 32296 23 algoritma tf idf by Universitas Bina ... 98 views

7,891

Published on

Talk given at the Seattle Scalability / NoSQL / Hadoop / etc MeetUp on March 31, 2010

Talk given at the Seattle Scalability / NoSQL / Hadoop / etc MeetUp on March 31, 2010

Published in:
Technology

No Downloads

Total Views

7,891

On Slideshare

0

From Embeds

0

Number of Embeds

3

Shares

0

Downloads

199

Comments

0

Likes

11

No embeds

No notes for slide

- 1. Numerical Recipes<br />in <br />Hadoop<br />Jake Mannix<br />linkedin/in/jakemannix<br />twitter/pbrane<br />jake.mannix@gmail.com<br />jmannix@apache.org<br />Principal SDE, LinkedIn<br />Committer, Apache Mahout, Zoie, Bobo-Browse, Decomposer<br />Author, Lucene in Depth (Manning MM/DD/2010)<br />
- 2. A Mathematician’s Apology<br />What mathematical structure describes all of these?<br />Full-text search:<br />Score documents matching “query string”<br />Collaborative filtering recommendation:<br />Users who liked {those} also liked {these}<br />(Social/web)-graph proximity:<br />People/pages “close” to {this} are {these}<br />
- 3. Matrix Multiplication!<br />
- 4. Full-text Search<br />Vector Space Model of IR<br />Corpus as term-document matrix<br />Query as bag-of-words vector<br />Full-text search is just: <br />
- 5. Collaborative Filtering<br />User preference matrix <br />(and item-item similarity matrix )<br />Input user as vector of preferences <br />(simple) Item-based CF recommendations are:<br />T<br />
- 6. Graph Proximity<br />Adjacency matrix:<br />2nd degree adjacency matrix: <br /> Input all of a user’s “friends” or page links:<br />(weighted) distance measure of 1st – 3rd degree connections is then:<br />
- 7. Dictionary<br />Applications Linear Algebra<br />
- 8. How does this help?<br />In Search:<br />Latent Semantic Indexing (LSI)<br />probabalistic LSI<br />Latent Dirichlet Allocation<br />In Recommenders:<br />Singular Value Decomposition<br />Layered Restricted Boltzmann Machines <br />(Deep Belief Networks)<br />In Graphs:<br />PageRank<br />Spectral Decomposition / Spectral Clustering<br />
- 9. Often use “Dimensional Reduction”<br />To alleviate the sparse Big Data problem of “the curse of dimensionality”<br />Used to improve recall and relevance <br />in general: smooth the metric on your data set<br />
- 10. New applications with Matrices<br />If Search is finding doc-vector by: <br />and users query with data represented: Q = <br />Giving implicit feedback based on click-through per session: C =<br />
- 11. … continued<br />Then has the form (docs-by-terms) for search!<br />Approach has been used by Ted Dunning at Veoh<br />(and probably others)<br />
- 12. Linear Algebra performance tricks<br />Naïve item-based recommendations:<br />Calculate item similarity matrix:<br />Calculate item recs:<br />Express in one step:<br />In matrix notation:<br />Re-writing as:<br /> is the vector of preferences for user “v”, <br /> is the vector of preferences of item “i”<br />The result is the matrix sum of the outer (tensor) products of these vectors, scaled by the entry they intersect at.<br />
- 13. Item Recommender via Hadoop<br />
- 14. Apache Mahout<br />Apache Mahout currently on release 0.3<br />http://lucene.apache.org/mahout<br />Will be a “Top Level Project” soon (before 0.4)<br />( http://mahout.apache.org )<br />“Scalable Machine Learning with commercially friendly licensing”<br />
- 15. Mahout Features<br /> Recommenders <br />absorbed the Taste project<br />Classification (Naïve Bayes, C-Bayes, more)<br />Clustering (Canopy, fuzzy-K-means, Dirichlet, etc…)<br />Fast non-distributed linear mathematics <br />absorbed the classic CERN Colt project<br />Distributed Matrices and decomposition<br />absorbed the Decomposer project<br />mahout shell-script analogous to $HADOOP_HOME/bin/hadoop<br />$MAHOUT_HOME/bin/mahout kmeans –i “in” –o “out” –k 100<br />$MAHOUT_HOME/bin/mahout svd –i “in” –o “out” –k 300<br />etc…<br />Taste web-app for real-time recommendations<br />
- 16. DistributedRowMatrix<br />Wrapper around a SequenceFile<IntWritable,VectorWritable><br />Distributed methods like:<br />Matrix transpose();<br />Matrix times(Matrix other);<br />Vector times(Vectorv);<br />Vector timesSquared(Vectorv);<br />To get SVD: pass into DistributedLanczosSolver:<br />LanczosSolver.solve(Matrix input, Matrix eigenVectors, List<Double> eigenValues, int rank); <br />
- 17. Questions?<br />Contact: <br />jake.mannix@gmail.com<br />jmannix@apache.org<br />http://twitter.com/pbrane<br />http://www.decomposer.org/blog<br />http://www.linkedin.com/in/jakemannix<br />
- 18. Appendix<br />There are lots of ways to deal with sparse Big Data, and many (not all) need to deal with the dimensionality of the feature-space growing beyond reasonable limits, and techniques to deal with this depend heavily on your data…<br />That having been said, there are some general techniques<br />
- 19. Dealing with Curse of Dimensionality<br />Sparseness means fast, but overlap is too small<br />Can we reduce the dimensionality (from “all possible text tokens” or “all userIds”) while keeping the nice aspects of the search problem?<br />If possible, collapse “similar” vectors (synonymous terms, userIds with high overlap, etc…) towards each other while keeping “dissimilar” vectors far apart…<br />
- 20. Solution A: Matrix decomposition<br />Singular Value Decomposition (truncated)<br />“best” approximation to your matrix<br />Used in Latent Semantic Indexing (LSI)<br />For graphs: spectral decomposition<br />Collaborative filtering (Netflix leaderboard)<br />Issues: very computation intensive <br />no parallelized open-source packages see Apache Mahout<br />Makes things too dense<br />
- 21. SVD: continued<br />Hadoopimpl. in Mahout (Lanczos)<br />O(N*d*k) for rank-k SVD on N docs, delt’s each <br />Density can be dealt with by doing Canopy Clustering offline<br />But only extracting linear feature mixes<br />Also, still very computation intensive and I/O intensive (k-passes over data set), are there better dimensional reduction methods?<br />
- 22. Solution B: Stochastic Decomposition co-ocurrence-based kernel + online Random Projection + SVD<br />
- 23. Co-ocurrence-based kernel<br />Extract bigram phrases / pairs of items rated by the same person (using Log-Likelihood Ratio test to pick the best)<br />“Disney on Ice was Amazing!” -> {“disney”, “disney on ice”, “ice”, “was” “amazing”}<br />{item1:4, item2:5, item5:3, item9:1} -> {item1:4, (items1+2):4.5, item2:5, item5:3,…}<br />Dim(features) goes from 105to 108+(yikes!)<br />
- 24. Online Random Projection<br />Randomly project kernelized text vectors down to “merely” 103dimensions with a Gaussian matrix <br />Or project eachnGram down to an random (but sparse) 103-dim vector:<br />V= {123876244 =>1.3} (tf-IDF of “disney”)<br />V’= c*{h(i) => 1, h(h(i)) =>1, h(h(h(i))) =>1}<br /> (c= 1.3 / sqrt(3)) <br />
- 25. Outer-product and Sum<br />Take the 103-dim projected vectors and outer-product with themselves,<br />result is 103x103-dim matrix<br /><ul><li>sum these in a Combiner</li></ul>All results go to single Reducer, where you compute…<br />
- 26. SVD <br />SVD-them quickly (they fit in memory) <br />Over and over again (as new data comes in)<br />Use the most recent SVD to project your (already randomly projected) text still further (now encoding “semantic” similarity).<br />SVD-projected vectors can be assigned immediately to nearest clusters if desired<br />
- 27. References<br />Randomized matrix decomposition review: http://arxiv.org/abs/0909.4061<br />Sparse hashing/projection:<br />John Langford et al. “VowpalWabbit”<br />http://hunch.net/~vw/<br />

Be the first to comment