Your SlideShare is downloading. ×
0
Building Mini-Google in Ruby


                                                                              Ilya Grigorik...
postrank.com/topic/ruby




                               The slides…                           Twitter                  ...
Ruby + Math
                                                                            PageRank
              Optimizatio...
PageRank                                       PageRank + Ruby




      Tools
        +                       Examples   ...
Consume with care…
                     everything that follows is based on released / public domain info




Building Min...
Search-engine graveyard
                                                                  Google did pretty well…




Buil...
Query: Ruby




                                                                                           Results




   ...
Query: Ruby




                                                                                                Results


...
CPU Speed                                     333Mhz
          RAM                                           32-64MB

    ...
Creating & Maintaining an Inverted Index
                                                                    DIY and the g...
require 'set'
                                                      {
                                                    ...
require 'set'
                                                      {
                                                    ...
require 'set'
                                                      {
                                                    ...
# query: quot;what is bananaquot;
 p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;]
 # > #<Set: {}>...
# query: quot;what is bananaquot;
 p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;]
 # > #<Set: {}>...
# query: quot;what is bananaquot;
 p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;]
 # > #<Set: {}>...
# query: quot;what is bananaquot;
 p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;]
 # > #<Set: {}>...
require 'set'

    pages = {
     quot;1quot; => quot;it is what it isquot;,
     quot;2quot; => quot;what is itquot;,
   ...
Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Ferret is a high-performance, full-featured text search engine library written for Ruby



Building Mini-Google in Ruby   ...
require 'ferret'
    include Ferret

    index = Index::Index.new()

    index << {:title => quot;1quot;, :content => quot...
require 'ferret'
    include Ferret

    index = Index::Index.new()

    index << {:title => quot;1quot;, :content => quot...
class Ferret::Analysis::Analyzer                                        class Ferret::Search::BooleanQuery
   class Ferret...
ferret.davebalmain.com/trac




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Ranking Results
                                                                      0-60 with PageRank…




Building Min...
index.search_each('content:quot;the brown cowquot;') do |id, score|
     puts quot;Score: #{score}, #{index[id][:title]} q...
index.search_each('content:quot;the brown cowquot;') do |id, score|
     puts quot;Score: #{score}, #{index[id][:title]} q...
3                         5                4
            the                4                         3                5
 ...
3                         5                  4
            the                 4                         3                ...
W1        W2   …           …             …        …       …             …     WN

         Doc 1       15        23   …
  ...
NArray is an Numerical N-dimensional Array class (implemented in C)



                                                   ...
NArray is an Numerical N-dimensional Array class (implemented in C)




                                                  ...
Links as votes




                                                                                        PageRank
      ...
P = 0.85



                               Follow link from page he/she is currently on.



                              ...
Follow link from page he/she is currently on.
                                                                            ...
On Page P, clicks on link to K
                                                                                         P ...
P = 0.05                                                      P = 0.20
                                                   ...
What is PageRank?
                                                                                             It’s a scal...
P = 0.05                                                      P = 0.20
                                                   ...
P = 0.05                                                      P = 0.20
                                                   ...
Teleportation?
                                                                                           sci-fi fans, … ?...
1. No in-links!                                                        3. Isolated Web



                                ...
•Breadth First Search
                               •Depth First Search
                               •A* Search
       ...
P(T) = 0.03
                                                                     P(T) = 0.15 / # of pages
        P(T) = 0...
Assume the web is N pages big
    Assume that probability of teleportation (t) is 0.15, and following link (s) is 0.85
   ...
Link Graph                                                 No link from 1 to N



                           1   2        ...
Links to…

                               {
                                   quot;1quot;      =         [25, 26],
      ...
Follow link from page he/she is currently on.
                                                                            ...
Don’t trust me! Verify it yourself!


                                                                                1
  ...
Enough hand-waving, dammit!
                                                                                  show me the ...
Hot, Fast, Awesome

                                                                  Birth of EM-Proxy
                  ...
http://rb-gsl.rubyforge.org/




                                                                        Hot, Fast, Awesom...
http://ruby-gsl.sourceforge.net/
                       Click there! … Give yourself a weekend.


Building Mini-Google in ...
require quot;gslquot;
   include GSL

   # INPUT: link structure matrix (NxN)
   # OUTPUT: pagerank scores
   def pagerank...
require quot;gslquot;
   include GSL

   # INPUT: link structure matrix (NxN)
   # OUTPUT: pagerank scores
   def pagerank...
require quot;gslquot;
   include GSL

   # INPUT: link structure matrix (NxN)
   # OUTPUT: pagerank scores
   def pagerank...
X
                 P = 0.33                                              P = 0.33
                               N


     ...
X
                 P = 0.05                                               P = 0.07
                               N


    ...
PageRank + Ferret
                                                                             awesome search, ftw!




Bu...
2
                               P = 0.05                                                          P = 0.07
              ...
index.search_each('content:quot;worldquot;') do |id, score|
 puts quot;Score: #{score}, #{index[id][:title]} (PR: #{index[...
index.search_each('content:quot;worldquot;') do |id, score|
 puts quot;Score: #{score}, #{index[id][:title]} (PR: #{index[...
index.search_each('content:quot;worldquot;') do |id, score|
 puts quot;Score: #{score}, #{index[id][:title]} (PR: #{index[...
Search*: Graphs are ubiquitous!
                                                 PageRank is a general purpose hammer




...
Username               GitCred
                                                                    =======================...
Hmm…




                                                                  Analyze the social graph:
                     ...
PageRank + Product Graph
                                                                                          E-comme...
PageRank = Powerful Hammer
                                                                                          use i...
Personalization
                                                                      how would you do it?




Building Mi...
0.15
                                        Teleportation distribution doesn’t
        =            ⋮                    ...
Make pages with links!




                                                                      Gaming PageRank
       ht...
Slides: http://bit.ly/railsconf-pagerank

    Ferret: http://bit.ly/ferret
    RB-GSL: http://bit.ly/rb-gsl

    PageRank ...
Upcoming SlideShare
Loading in...5
×

Building A Mini Google High Performance Computing In Ruby

1,954

Published on

Published in: Technology, News & Politics

Transcript of "Building A Mini Google High Performance Computing In Ruby"

  1. 1. Building Mini-Google in Ruby Ilya Grigorik @igrigorik Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  2. 2. postrank.com/topic/ruby The slides… Twitter My blog Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  3. 3. Ruby + Math PageRank Optimization Examples Indexing Misc Fun Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  4. 4. PageRank PageRank + Ruby Tools + Examples Indexing Optimization Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  5. 5. Consume with care… everything that follows is based on released / public domain info Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  6. 6. Search-engine graveyard Google did pretty well… Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  7. 7. Query: Ruby Results 1. Crawl 2. Index 3. Rank Search pipeline 50,000-foot view Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  8. 8. Query: Ruby Results 1. Crawl 2. Index 3. Rank Bah Interesting Fun Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  9. 9. CPU Speed 333Mhz RAM 32-64MB Index 27,000,000 documents Index refresh once a month~ish PageRank computation several days Laptop CPU 2.1Ghz VM RAM 1GB 1-Million page web ~10 minutes circa 1997-1998 Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  10. 10. Creating & Maintaining an Inverted Index DIY and the gotchas within Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  11. 11. require 'set' { quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>, pages = { quot;aquot;=>#<Set: {quot;3quot;}>, quot;1quot; => quot;it is what it isquot;, quot;bananaquot;=>#<Set: {quot;3quot;}>, quot;2quot; => quot;what is itquot;, quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>, quot;3quot; => quot;it is a bananaquot; quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>} } } index = {} pages.each do |page, content| content.split(/s/).each do |word| if index[word] index[word] << page else index[word] = Set.new(page) end end end Building an Inverted Index Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  12. 12. require 'set' { quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>, pages = { quot;aquot;=>#<Set: {quot;3quot;}>, quot;1quot; => quot;it is what it isquot;, quot;bananaquot;=>#<Set: {quot;3quot;}>, quot;2quot; => quot;what is itquot;, quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>, quot;3quot; => quot;it is a bananaquot; quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>} } } index = {} pages.each do |page, content| content.split(/s/).each do |word| if index[word] index[word] << page else index[word] = Set.new(page) end end end Building an Inverted Index Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  13. 13. require 'set' { quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>, pages = { quot;aquot;=>#<Set: {quot;3quot;}>, quot;1quot; => quot;it is what it isquot;, quot;bananaquot;=>#<Set: {quot;3quot;}>, quot;2quot; => quot;what is itquot;, quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>, quot;3quot; => quot;it is a bananaquot; quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>} } } index = {} pages.each do |page, content| Word => [Document] content.split(/s/).each do |word| if index[word] index[word] << page else index[word] = Set.new(page) end end end Building an Inverted Index Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  14. 14. # query: quot;what is bananaquot; p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;] # > #<Set: {}> # query: quot;a bananaquot; p index[quot;aquot;] & index[quot;bananaquot;] # > #<Set: {quot;3quot;}> 1 3 2 # query: quot;what isquot; p index[quot;whatquot;] & index[quot;isquot;] # > #<Set: {quot;1quot;, quot;2quot;}> { quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>, quot;aquot;=>#<Set: {quot;3quot;}>, quot;bananaquot;=>#<Set: {quot;3quot;}>, quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>, quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>} Querying the index } Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  15. 15. # query: quot;what is bananaquot; p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;] # > #<Set: {}> # query: quot;a bananaquot; p index[quot;aquot;] & index[quot;bananaquot;] # > #<Set: {quot;3quot;}> 1 3 2 # query: quot;what isquot; p index[quot;whatquot;] & index[quot;isquot;] # > #<Set: {quot;1quot;, quot;2quot;}> { quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>, quot;aquot;=>#<Set: {quot;3quot;}>, quot;bananaquot;=>#<Set: {quot;3quot;}>, quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>, quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>} Querying the index } Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  16. 16. # query: quot;what is bananaquot; p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;] # > #<Set: {}> # query: quot;a bananaquot; p index[quot;aquot;] & index[quot;bananaquot;] # > #<Set: {quot;3quot;}> 1 3 2 # query: quot;what isquot; p index[quot;whatquot;] & index[quot;isquot;] # > #<Set: {quot;1quot;, quot;2quot;}> { quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>, quot;aquot;=>#<Set: {quot;3quot;}>, quot;bananaquot;=>#<Set: {quot;3quot;}>, quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>, quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>} Querying the index } Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  17. 17. # query: quot;what is bananaquot; p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;] # > #<Set: {}> # query: quot;a bananaquot; p index[quot;aquot;] & index[quot;bananaquot;] # > #<Set: {quot;3quot;}> What order? # query: quot;what isquot; p index[quot;whatquot;] & index[quot;isquot;] [1, 2] or [2,1] # > #<Set: {quot;1quot;, quot;2quot;}> { quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>, quot;aquot;=>#<Set: {quot;3quot;}>, quot;bananaquot;=>#<Set: {quot;3quot;}>, quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>, quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>} Querying the index } Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  18. 18. require 'set' pages = { quot;1quot; => quot;it is what it isquot;, quot;2quot; => quot;what is itquot;, quot;3quot; => quot;it is a bananaquot; } PDF, HTML, RSS? index = {} Lowercase / Upcase? pages.each do |page, content| Compact Index? Hmmm? content.split(/s/).each do |word| Stop words? if index[word] Persistence? index[word] << page else index[word] = Set.new(page) end end end Building an Inverted Index Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  19. 19. Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  20. 20. Ferret is a high-performance, full-featured text search engine library written for Ruby Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  21. 21. require 'ferret' include Ferret index = Index::Index.new() index << {:title => quot;1quot;, :content => quot;it is what it isquot;} index << {:title => quot;2quot;, :content => quot;what is itquot;} index << {:title => quot;3quot;, :content => quot;it is a bananaquot;} index.search_each('content:quot;bananaquot;') do |id, score| puts quot;Score: #{score}, #{index[id][:title]} quot; end > Score: 1.0, 3 Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  22. 22. require 'ferret' include Ferret index = Index::Index.new() index << {:title => quot;1quot;, :content => quot;it is what it isquot;} index << {:title => quot;2quot;, :content => quot;what is itquot;} index << {:title => quot;3quot;, :content => quot;it is a bananaquot;} index.search_each('content:quot;bananaquot;') do |id, score| puts quot;Score: #{score}, #{index[id][:title]} quot; end > Score: 1.0, 3 Hmmm? Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  23. 23. class Ferret::Analysis::Analyzer class Ferret::Search::BooleanQuery class Ferret::Analysis::AsciiLetterAnalyzer class Ferret::Search::ConstantScoreQuery class Ferret::Analysis::AsciiLetterTokenizer class Ferret::Search::Explanation class Ferret::Analysis::AsciiLowerCaseFilter class Ferret::Search::Filter class Ferret::Analysis::AsciiStandardAnalyzer class Ferret::Search::FilteredQuery class Ferret::Analysis::AsciiStandardTokenizer class Ferret::Search::FuzzyQuery class Ferret::Analysis::AsciiWhiteSpaceAnalyzer class Ferret::Search::Hit class Ferret::Analysis::AsciiWhiteSpaceTokenizer class Ferret::Search::MatchAllQuery class Ferret::Analysis::HyphenFilter class Ferret::Search::MultiSearcher class Ferret::Analysis::LetterAnalyzer class Ferret::Search::MultiTermQuery class Ferret::Analysis::LetterTokenizer class Ferret::Search::PhraseQuery class Ferret::Analysis::LowerCaseFilter class Ferret::Search::PrefixQuery class Ferret::Analysis::MappingFilter class Ferret::Search::Query class Ferret::Analysis::PerFieldAnalyzer class Ferret::Search::QueryFilter class Ferret::Analysis::RegExpAnalyzer class Ferret::Search::RangeFilter class Ferret::Analysis::RegExpTokenizer class Ferret::Search::RangeQuery class Ferret::Analysis::StandardAnalyzer class Ferret::Search::Searcher class Ferret::Analysis::StandardTokenizer class Ferret::Search::Sort class Ferret::Analysis::StemFilter class Ferret::Search::SortField class Ferret::Analysis::StopFilter class Ferret::Search::TermQuery class Ferret::Analysis::Token class Ferret::Search::TopDocs class Ferret::Analysis::TokenStream class Ferret::Search::TypedRangeFilter class Ferret::Analysis::WhiteSpaceAnalyzer class Ferret::Search::TypedRangeQuery class Ferret::Search::WildcardQuery class Ferret::Analysis::WhiteSpaceTokenizer Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  24. 24. ferret.davebalmain.com/trac Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  25. 25. Ranking Results 0-60 with PageRank… Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  26. 26. index.search_each('content:quot;the brown cowquot;') do |id, score| puts quot;Score: #{score}, #{index[id][:title]} quot; end > Score: 0.827, 3 > Score: 0.523, 5 Relevance? > Score: 0.125, 4 3 5 4 the 4 3 5 brown 1 3 1 cow 1 4 1 Score 6 10 7 Naïve: Term Frequency Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  27. 27. index.search_each('content:quot;the brown cowquot;') do |id, score| puts quot;Score: #{score}, #{index[id][:title]} quot; end > Score: 0.827, 3 > Score: 0.523, 5 > Score: 0.125, 4 3 5 4 the 4 3 5 Skew brown 1 3 1 cow 1 4 1 Score 6 10 7 Naïve: Term Frequency Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  28. 28. 3 5 4 the 4 3 5 Skew brown 1 3 1 cow 1 4 1 # of docs Score = TF * IDF the 6 TF = # occurrences / # words brown 3 IDF = # docs / # docs with W cow 4 Total # of documents: 10 TF-IDF Term Frequency * Inverse Document Frequency Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  29. 29. 3 5 4 the 4 3 5 brown 1 3 1 cow 1 4 1 Doc # 3 score for ‘the’: # of docs 4/10 * ln(10/6) = 0.204 the 6 Doc # 3 score for ‘brown’: brown 3 1/10 * ln(10/3) = 0.120 cow 4 Doc # 3 score for ‘cow’: 1/10 * ln(10/4) = 0.092 Total # of documents: 10 # words in document: 10 TF-IDF Score = 0.204 + 0.120 + 0.092 = 0.416 Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  30. 30. W1 W2 … … … … … … WN Doc 1 15 23 … Doc 2 24 12 … … … … … … Doc K Size = N * K * size of Ruby object Ouch. Pages = N = 10,000 Words = K = 2,000 Ruby Object = 20+ bytes Frequency Matrix Footprint = 384 MB Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  31. 31. NArray is an Numerical N-dimensional Array class (implemented in C) # create new NArray. initialize with 0. NArray.new(typecode, size, ...) # 1 byte unsigned integer NArray.byte(size,...) # 2 byte signed integer NArray.sint(size,...) # 4 byte signed integer NArray.int(size,...) # single precision float NArray.sfloat(size,...) # double precision float NArray.float(size,...) # single precision complex NArray.scomplex(size,...) # double precision complex NArray.complex(size,...) # Ruby object NArray.object(size,...) NArray http://narray.rubyforge.org/ Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  32. 32. NArray is an Numerical N-dimensional Array class (implemented in C) NArray http://narray.rubyforge.org/ Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  33. 33. Links as votes PageRank the google juice Problem: link gaming Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  34. 34. P = 0.85 Follow link from page he/she is currently on. Teleport to a random location on the web. P = 0.15 Random Surfer powerful abstraction Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  35. 35. Follow link from page he/she is currently on. Page K Teleport to a random location on the web. Page N Page M Surfin’ rinse & repeat, ad naseum Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  36. 36. On Page P, clicks on link to K P = 0.85 On Page K clicks on link to M P = 0.85 On Page M teleports to X P = 0.15 Surfin’ … rinse & repeat, ad naseum Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  37. 37. P = 0.05 P = 0.20 X N P = 0.15 M K P = 0.6 Analyzing the Web Graph extracting PageRank Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  38. 38. What is PageRank? It’s a scalar! Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  39. 39. P = 0.05 P = 0.20 X N P = 0.15 M K P = 0.6 What is PageRank? it’s a probability! Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  40. 40. P = 0.05 P = 0.20 X N P = 0.15 M K P = 0.6 What is PageRank? Higher Pr, Higher Importance? it’s a probability! Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  41. 41. Teleportation? sci-fi fans, … ? Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  42. 42. 1. No in-links! 3. Isolated Web X N K 2. No out-links! M M Reasons for teleportation enumerating edge cases Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  43. 43. •Breadth First Search •Depth First Search •A* Search •Lexicographic Search •Dijkstra’s Algorithm •Floyd-Warshall •Triangulation and Comparability detection require 'gratr/import' dg = Digraph[1,2, 2,3, 2,4, 4,5, 6,4, 1,6] dg.directed? # true dg.vertex?(4) # true dg.edge?(2,4) # true dg.vertices # [5, 6, 1, 2, 3, 4] Exploring Graphs Graph[1,2,1,3,1,4,2,5].bfs # [1, 2, 3, 4, 5] gratr.rubyforge.com Graph[1,2,1,3,1,4,2,5].dfs # [1, 2, 5, 3, 4] Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  44. 44. P(T) = 0.03 P(T) = 0.15 / # of pages P(T) = 0.03 P(T) = 0.03 X N K P(T) = 0.03 M P(T) = 0.03 M P(T) = 0.03 Teleportation probabilities Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  45. 45. Assume the web is N pages big Assume that probability of teleportation (t) is 0.15, and following link (s) is 0.85 Assume that teleportation probability (E) is uniform Assume that you start on any random page (uniform distribution L), then 0.15 = = ⋮ 0.15 Then after one step, the probability your on page X is: ∗ + ∗ (0.85 ∗ + 0.15 ∗ ) PageRank: Simplified Mathematical Def’n cause that’s how we roll Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  46. 46. Link Graph No link from 1 to N 1 2 … … N 1 1 0 … … 0 2 0 1 … … 1 … … … … … … … … … … … … N 0 1 … … 1 G = The Link Graph Huge! ginormous and sparse Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  47. 47. Links to… { quot;1quot; = [25, 26], quot;2quot; = [1], Page quot;5quot; = [123,2], quot;6quot; = [67, 1] } G as a dictionary more compact… Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  48. 48. Follow link from page he/she is currently on. Page K Teleport to a random location on the web. Computing PageRank the tedious way Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  49. 49. Don’t trust me! Verify it yourself! 1 −1 ⋮ = − = Identity matrix Computing PageRank in one swoop Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  50. 50. Enough hand-waving, dammit! show me the code Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  51. 51. Hot, Fast, Awesome Birth of EM-Proxy flash of the obvious Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  52. 52. http://rb-gsl.rubyforge.org/ Hot, Fast, Awesome Click there! … Give yourself a weekend. Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  53. 53. http://ruby-gsl.sourceforge.net/ Click there! … Give yourself a weekend. Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  54. 54. require quot;gslquot; include GSL # INPUT: link structure matrix (NxN) # OUTPUT: pagerank scores def pagerank(g) Verify NxN raise if g.size1 != g.size2 i = Matrix.I(g.size1) # identity matrix p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector s = 0.85 # probability of following a link t = 1-s # probability of teleportation t*((i-s*g).invert)*p end PageRank in Ruby 6 lines, or less Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  55. 55. require quot;gslquot; include GSL # INPUT: link structure matrix (NxN) # OUTPUT: pagerank scores def pagerank(g) Constants… raise if g.size1 != g.size2 i = Matrix.I(g.size1) # identity matrix p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector s = 0.85 # probability of following a link t = 1-s # probability of teleportation t*((i-s*g).invert)*p end PageRank in Ruby 6 lines, or less Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  56. 56. require quot;gslquot; include GSL # INPUT: link structure matrix (NxN) # OUTPUT: pagerank scores def pagerank(g) raise if g.size1 != g.size2 i = Matrix.I(g.size1) # identity matrix p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector s = 0.85 # probability of following a link t = 1-s # probability of teleportation t*((i-s*g).invert)*p end PageRank in Ruby PageRank! 6 lines, or less Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  57. 57. X P = 0.33 P = 0.33 N P = 0.33 K pagerank(Matrix[[0,0,1], [0,0,1], [1,0,0]]) [0.33, 0.33, 0.33] Ex: Circular Web testing intuition… Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  58. 58. X P = 0.05 P = 0.07 N P = 0.87 K pagerank(Matrix[[0,0,0], [0.5,0,0], [0.5,1,1]]) [0.05, 0.07, 0.87] Ex: All roads lead to K testing intuition… Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  59. 59. PageRank + Ferret awesome search, ftw! Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  60. 60. 2 P = 0.05 P = 0.07 1 require 'ferret' P = 0.87 3 include Ferret index = Index::Index.new() index {:title = quot;1quot;, :content = quot;it is what it isquot;, :pr = 0.05 } index {:title = quot;2quot;, :content = quot;what is itquot;, :pr = 0.07 } index {:title = quot;3quot;, :content = quot;it is a bananaquot;, :pr = 0.87 } Store PageRank Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  61. 61. index.search_each('content:quot;worldquot;') do |id, score| puts quot;Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})quot; end TF-IDF Search puts quot;*quot; * 50 sf_pr = Search::SortField.new(:pr, :type = :float, :reverse = true) index.search_each('content:quot;worldquot;', :sort = sf_pr) do |id, score| puts quot;Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})quot; end # Score: 0.267119228839874, 3 (PR: 0.87) # Score: 0.17807948589325, 1 (PR: 0.05) # Score: 0.17807948589325, 2 (PR: 0.07) # *********************************** # Score: 0.267119228839874, 3, (PR: 0.87) # Score: 0.17807948589325, 2, (PR: 0.07) # Score: 0.17807948589325, 1, (PR: 0.05) Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  62. 62. index.search_each('content:quot;worldquot;') do |id, score| puts quot;Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})quot; end PageRank FTW! puts quot;*quot; * 50 sf_pr = Search::SortField.new(:pr, :type = :float, :reverse = true) index.search_each('content:quot;worldquot;', :sort = sf_pr) do |id, score| puts quot;Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})quot; end # Score: 0.267119228839874, 3 (PR: 0.87) # Score: 0.17807948589325, 1 (PR: 0.05) # Score: 0.17807948589325, 2 (PR: 0.07) # *********************************** # Score: 0.267119228839874, 3, (PR: 0.87) # Score: 0.17807948589325, 2, (PR: 0.07) # Score: 0.17807948589325, 1, (PR: 0.05) Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  63. 63. index.search_each('content:quot;worldquot;') do |id, score| puts quot;Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})quot; end puts quot;*quot; * 50 sf_pr = Search::SortField.new(:pr, :type = :float, :reverse = true) index.search_each('content:quot;worldquot;', :sort = sf_pr) do |id, score| puts quot;Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})quot; end # Score: 0.267119228839874, 3 (PR: 0.87) # Score: 0.17807948589325, 1 (PR: 0.05) Others # Score: 0.17807948589325, 2 (PR: 0.07) # *********************************** # Score: 0.267119228839874, 3, (PR: 0.87) # Score: 0.17807948589325, 2, (PR: 0.07) Google # Score: 0.17807948589325, 1, (PR: 0.05) Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  64. 64. Search*: Graphs are ubiquitous! PageRank is a general purpose hammer Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  65. 65. Username GitCred ============================== 37signals 10.00 imbriaco 9.76 why 8.74 rails 8.56 defunkt 8.17 technoweenie 7.83 jeresig 7.60 mojombo 7.51 yui 7.34 drnic 7.34 pjhyett 6.91 wycats 6.85 dhh 6.84 http://bit.ly/3YQPU PageRank + Social Graph GitHub Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  66. 66. Hmm… Analyze the social graph: - Filter messages by ‘TwitterRank’ - Suggest users by ‘TwitterRank’ -… PageRank + Social Graph Twitter Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  67. 67. PageRank + Product Graph E-commerce Link items purchased in same cart… Run PR on it. Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  68. 68. PageRank = Powerful Hammer use it! Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  69. 69. Personalization how would you do it? Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  70. 70. 0.15 Teleportation distribution doesn’t = ⋮ have to be uniform! 0.15 yahoo.com is my homepage! PageRank + Personalization customize the teleportation vector Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  71. 71. Make pages with links! Gaming PageRank http://bit.ly/pagerank-spam for fun and profit (I don’t endorse it) Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  72. 72. Slides: http://bit.ly/railsconf-pagerank Ferret: http://bit.ly/ferret RB-GSL: http://bit.ly/rb-gsl PageRank on Wikipedia: http://bit.ly/wp-pagerank Gaming PageRank: http://bit.ly/pagerank-spam Michael Nielsen’s lectures on PageRank: http://michaelnielsen.org/blog Questions? The slides… Twitter My blog Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×