Building Mini‐Google in Ruby 


                                                                               Ilya Grigor...
postrank.com/topic/ruby 




                                The slides…                           Twi+er                 ...
Ruby + Math 
                                                                             PageRank 
               OpDmiza...
PageRank                                        PageRank + Ruby 




      Tools 
        +                        Example...
Consume with care… 
                     everything that follows is based on released / public domain info 




Building M...
Search‐engine graveyard 
                                                                   Google did pre9y well… 




Bu...
Query: Ruby 




                                                                                              Results 


...
Query: Ruby 




                                                                                               Results 

...
CPU Speed                                         333Mhz 
          RAM                                               32‐6...
CreaDng & Maintaining an Inverted Index  
                                                                     DIY and the...
require 'set'
                                                       {
    pages = {                                      ...
require 'set'
                                                       {
    pages = {                                      ...
require 'set'
                                                       {
    pages = {                                      ...
# query: "what is banana"
 p index["what"] & index["is"] & index["banana"]
 # > #<Set: {}>


 # query: "a banana"
 p index...
# query: "what is banana"
 p index["what"] & index["is"] & index["banana"]
 # > #<Set: {}>


 # query: "a banana"
 p index...
# query: "what is banana"
 p index["what"] & index["is"] & index["banana"]
 # > #<Set: {}>


 # query: "a banana"
 p index...
# query: "what is banana"
 p index["what"] & index["is"] & index["banana"]
 # > #<Set: {}>


 # query: "a banana"
 p index...
require 'set'

    pages = {
     "1" => "it is what it is",
     "2" => "what is it",
     "3" => "it is a banana"
    }
...
Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
 
                Ferret is a high‐performance, full‐featured text search engine library wri9en for Ruby



Building Mini‐...
require 'ferret'
    include Ferret

    index = Index::Index.new()

    index << {:title => "1", :content => "it is what ...
require 'ferret'
    include Ferret

    index = Index::Index.new()

    index << {:title => "1", :content => "it is what ...
class Ferret::Analysis::Analyzer                                        class Ferret::Search::BooleanQuery 
   class Ferre...
ferret.davebalmain.com/trac 




Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
Ranking Results 
                                                                       0‐60 with PageRank… 




Building ...
index.search_each('content:"the brown cow"') do |id, score|
     puts "Score: #{score}, #{index[id][:title]} "
    end

  ...
index.search_each('content:"the brown cow"') do |id, score|
     puts "Score: #{score}, #{index[id][:title]} "
    end

  ...
3                          5                 4 
            the                 4                          3              ...
3                          5                   4 
            the                  4                          3           ...
W1         W2    …           …             …        …        …             …      WN 

         Doc 1        15        23 ...
NArray is an Numerical N‐dimensional Array class (implemented in C)  



                                                 ...
NArray is an Numerical N‐dimensional Array class (implemented in C)  




                                                ...
Links as votes 




                                                                                          PageRank 
  ...
P = 0.85 



                                Follow link from page he/she is currently on.  



                          ...
Follow link from page he/she is currently on.  
                                                                          ...
On Page P, clicks on link to K 
                                                                                          ...
P = 0.05                                                       P = 0.20 
                                                 ...
What is PageRank? 
                                                                                               It’s a s...
P = 0.05                                                       P = 0.20 
                                                 ...
P = 0.05                                                       P = 0.20 
                                                 ...
TeleportaDon? 
                                                                                             sci‐fi fans, … ...
1. No in‐links!                                                         3. Isolated Web 




                             ...
•  readth First Search 
                                 B
                                •  epth First Search 
         ...
P(T) = 0.03 
        P(T) = 0.03                                                    P(T) = 0.15 / # of pages 
            ...
Assume the web is N pages big 
    Assume that probability of teleportaCon (t) is 0.15, and following link (s) is 0.85 
  ...
Link Graph                                                  No  link from 1 to N  



                           1    2   ...
Links to… 

                                {
                                    "1"      =>         [25, 26],
          ...
Follow link from page he/she is currently on.  
                                                                          ...
Don’t trust me! Verify it yourself! 




                                IdenDty matrix 




                             ...
Enough hand‐waving, dammit! 
                                                                                   show me th...
Hot, Fast, Awesome 


                                                                   Birth of EM‐Proxy 
              ...
h:p://rb‐gsl.rubyforge.org/ 




                                                                         Hot, Fast, Aweso...
h:p://ruby‐gsl.sourceforge.net/ 
                       Click there!  …  Give yourself a weekend.  


Building Mini‐Google...
require "gsl"
   include GSL

   # INPUT: link structure matrix (NxN)
   # OUTPUT: pagerank scores
   def pagerank(g)
    ...
require "gsl"
   include GSL

   # INPUT: link structure matrix (NxN)
   # OUTPUT: pagerank scores
   def pagerank(g)     ...
require "gsl"
   include GSL

   # INPUT: link structure matrix (NxN)
   # OUTPUT: pagerank scores
   def pagerank(g)
    ...
P = 0.33                                      X         P = 0.33 
                                N 


                   ...
P = 0.05                                      X         P = 0.07 
                                N 


                   ...
PageRank + Ferret 
                                                                              awesome search, Tw! 




...
P = 0.05                                         2                    P = 0.07 
                                          ...
index.search_each('content:"world"') do |id, score|
 puts "Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})"
...
index.search_each('content:"world"') do |id, score|
 puts "Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})"
...
index.search_each('content:"world"') do |id, score|
 puts "Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})"
...
Search*: Graphs are ubiquitous! 
                                                  PageRank is a general purpose hammer 

...
Username               GitCred
                                                                     ======================...
Hmm… 




                                                                   Analyze the social graph: 
                  ...
PageRank + Product Graph 
                                                                                            E‐co...
PageRank = Powerful Hammer 
                                                                                            us...
PersonalizaDon 
                                                                       how would you do it? 




Building ...
TeleportaDon distribuDon doesn’t 
                                                   have to be uniform! 




            ...
Make pages with links! 




                                                                       Gaming PageRank 
      ...
Slides: hXp://bit.ly/railsconf‐pagerank 

    Ferret: hXp://bit.ly/ferret 
    RB‐GSL: hXp://bit.ly/rb‐gsl 

    PageRank ...
Upcoming SlideShare
Loading in …5
×

Building A Mini Google High Performance Computing In Ruby Presentation 1

1,174 views

Published on

Published in: Technology, News & Politics
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,174
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
18
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Building A Mini Google High Performance Computing In Ruby Presentation 1

  1. 1. Building Mini‐Google in Ruby  Ilya Grigorik  @igrigorik  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  2. 2. postrank.com/topic/ruby  The slides…  Twi+er  My blog  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  3. 3. Ruby + Math  PageRank  OpDmizaDon  Misc Fun  Examples  Indexing  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  4. 4. PageRank  PageRank + Ruby  Tools  +   Examples  Indexing  OpDmizaDon  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  5. 5. Consume with care…  everything that follows is based on released / public domain info  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  6. 6. Search‐engine graveyard  Google did pre9y well…  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  7. 7. Query: Ruby  Results  1. Crawl  2. Index  3. Rank  Search pipeline  50,000‐foot view  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  8. 8. Query: Ruby  Results  1. Crawl  2. Index  3. Rank  Bah  InteresDng  Fun  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  9. 9. CPU Speed       333Mhz  RAM         32‐64MB  Index         27,000,000 documents  Index refresh      once a month~ish  PageRank computaCon  several days  Laptop CPU       2.1Ghz  VM RAM       1GB  1‐Million page web    ~10 minutes  circa 1997‐1998  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  10. 10. CreaDng & Maintaining an Inverted Index   DIY and the gotchas within  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  11. 11. require 'set' { pages = { "it"=>#<Set: {"1", "2", "3"}>, "1" => "it is what it is", "a"=>#<Set: {"3"}>, "2" => "what is it", "banana"=>#<Set: {"3"}>, "3" => "it is a banana" "what"=>#<Set: {"1", "2"}>, } "is"=>#<Set: {"1", "2", "3"}>} } index = {} pages.each do |page, content| content.split(/s/).each do |word| if index[word] index[word] << page else index[word] = Set.new(page) end end end Building an Inverted Index  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  12. 12. require 'set' { pages = { "it"=>#<Set: {"1", "2", "3"}>, "1" => "it is what it is", "a"=>#<Set: {"3"}>, "2" => "what is it", "banana"=>#<Set: {"3"}>, "3" => "it is a banana" "what"=>#<Set: {"1", "2"}>, } "is"=>#<Set: {"1", "2", "3"}>} } index = {} pages.each do |page, content| content.split(/s/).each do |word| if index[word] index[word] << page else index[word] = Set.new(page) end end end Building an Inverted Index  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  13. 13. require 'set' { pages = { "it"=>#<Set: {"1", "2", "3"}>, "1" => "it is what it is", "a"=>#<Set: {"3"}>, "2" => "what is it", "banana"=>#<Set: {"3"}>, "3" => "it is a banana" "what"=>#<Set: {"1", "2"}>, } "is"=>#<Set: {"1", "2", "3"}>} } index = {} pages.each do |page, content| Word => [Document]  content.split(/s/).each do |word| if index[word] index[word] << page else index[word] = Set.new(page) end end end Building an Inverted Index  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  14. 14. # query: "what is banana" p index["what"] & index["is"] & index["banana"] # > #<Set: {}> # query: "a banana" p index["a"] & index["banana"] # > #<Set: {"3"}> # query: "what is" 1  2  3  p index["what"] & index["is"] # > #<Set: {"1", "2"}> { "it"=>#<Set: {"1", "2", "3"}>, "a"=>#<Set: {"3"}>, "banana"=>#<Set: {"3"}>, "what"=>#<Set: {"1", "2"}>, "is"=>#<Set: {"1", "2", "3"}>} } Querying the index  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  15. 15. # query: "what is banana" p index["what"] & index["is"] & index["banana"] # > #<Set: {}> # query: "a banana" p index["a"] & index["banana"] # > #<Set: {"3"}> # query: "what is" 1  2  3  p index["what"] & index["is"] # > #<Set: {"1", "2"}> { "it"=>#<Set: {"1", "2", "3"}>, "a"=>#<Set: {"3"}>, "banana"=>#<Set: {"3"}>, "what"=>#<Set: {"1", "2"}>, "is"=>#<Set: {"1", "2", "3"}>} } Querying the index  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  16. 16. # query: "what is banana" p index["what"] & index["is"] & index["banana"] # > #<Set: {}> # query: "a banana" p index["a"] & index["banana"] # > #<Set: {"3"}> # query: "what is" 1  2  3  p index["what"] & index["is"] # > #<Set: {"1", "2"}> { "it"=>#<Set: {"1", "2", "3"}>, "a"=>#<Set: {"3"}>, "banana"=>#<Set: {"3"}>, "what"=>#<Set: {"1", "2"}>, "is"=>#<Set: {"1", "2", "3"}>} } Querying the index  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  17. 17. # query: "what is banana" p index["what"] & index["is"] & index["banana"] # > #<Set: {}> # query: "a banana" p index["a"] & index["banana"] # > #<Set: {"3"}> What order?  # query: "what is" p index["what"] & index["is"] # > #<Set: {"1", "2"}> [1, 2] or [2,1]   { "it"=>#<Set: {"1", "2", "3"}>, "a"=>#<Set: {"3"}>, "banana"=>#<Set: {"3"}>, "what"=>#<Set: {"1", "2"}>, "is"=>#<Set: {"1", "2", "3"}>} } Querying the index  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  18. 18. require 'set' pages = { "1" => "it is what it is", "2" => "what is it", "3" => "it is a banana" } index = {} PDF, HTML, RSS?  Lowercase / Upcase?  pages.each do |page, content| Compact Index?  Hmmm?  content.split(/s/).each do |word| Stop words?  if index[word] Persistence?  index[word] << page else index[word] = Set.new(page) end end end Building an Inverted Index  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  19. 19. Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  20. 20.   Ferret is a high‐performance, full‐featured text search engine library wri9en for Ruby Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  21. 21. require 'ferret' include Ferret index = Index::Index.new() index << {:title => "1", :content => "it is what it is"} index << {:title => "2", :content => "what is it"} index << {:title => "3", :content => "it is a banana"} index.search_each('content:"banana"') do |id, score| puts "Score: #{score}, #{index[id][:title]} " end > Score: 1.0, 3 Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  22. 22. require 'ferret' include Ferret index = Index::Index.new() index << {:title => "1", :content => "it is what it is"} index << {:title => "2", :content => "what is it"} index << {:title => "3", :content => "it is a banana"} index.search_each('content:"banana"') do |id, score| puts "Score: #{score}, #{index[id][:title]} " end > Score: 1.0, 3 Hmmm?  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  23. 23. class Ferret::Analysis::Analyzer  class Ferret::Search::BooleanQuery  class Ferret::Analysis::AsciiLe+erAnalyzer  class Ferret::Search::ConstantScoreQuery  class Ferret::Analysis::AsciiLe+erTokenizer  class Ferret::Search::ExplanaCon  class Ferret::Analysis::AsciiLowerCaseFilter  class Ferret::Search::Filter  class Ferret::Analysis::AsciiStandardAnalyzer  class Ferret::Search::FilteredQuery  class Ferret::Analysis::AsciiStandardTokenizer  class Ferret::Search::FuzzyQuery  class Ferret::Analysis::AsciiWhiteSpaceAnalyzer  class Ferret::Search::Hit  class Ferret::Analysis::AsciiWhiteSpaceTokenizer  class Ferret::Search::MatchAllQuery  class Ferret::Analysis::HyphenFilter  class Ferret::Search::MulCSearcher  class Ferret::Analysis::Le+erAnalyzer  class Ferret::Search::MulCTermQuery  class Ferret::Analysis::Le+erTokenizer  class Ferret::Search::PhraseQuery  class Ferret::Analysis::LowerCaseFilter  class Ferret::Search::PrefixQuery  class Ferret::Analysis::MappingFilter  class Ferret::Search::Query  class Ferret::Analysis::PerFieldAnalyzer  class Ferret::Search::QueryFilter  class Ferret::Analysis::RegExpAnalyzer  class Ferret::Search::RangeFilter  class Ferret::Analysis::RegExpTokenizer  class Ferret::Search::RangeQuery  class Ferret::Analysis::StandardAnalyzer  class Ferret::Search::Searcher  class Ferret::Analysis::StandardTokenizer  class Ferret::Search::Sort  class Ferret::Analysis::StemFilter  class Ferret::Search::SortField  class Ferret::Analysis::StopFilter  class Ferret::Search::TermQuery  class Ferret::Analysis::Token  class Ferret::Search::TopDocs  class Ferret::Analysis::TokenStream  class Ferret::Search::TypedRangeFilter  class Ferret::Analysis::WhiteSpaceAnalyzer  class Ferret::Search::TypedRangeQuery  class Ferret::Analysis::WhiteSpaceTokenizer class Ferret::Search::WildcardQuery  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  24. 24. ferret.davebalmain.com/trac  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  25. 25. Ranking Results  0‐60 with PageRank…  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  26. 26. index.search_each('content:"the brown cow"') do |id, score| puts "Score: #{score}, #{index[id][:title]} " end > Score: 0.827, 3 > Score: 0.523, 5 Relevance?  > Score: 0.125, 4 3  5  4  the  4  3  5  brown  1  3  1  cow  1  4  1  Score  6  10  7  Naïve: Term Frequency  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  27. 27. index.search_each('content:"the brown cow"') do |id, score| puts "Score: #{score}, #{index[id][:title]} " end > Score: 0.827, 3 > Score: 0.523, 5 > Score: 0.125, 4 3  5  4  the  4  3  5  Skew  brown  1  3  1  cow  1  4  1  Score  6  10  7  Naïve: Term Frequency  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  28. 28. 3  5  4  the  4  3  5  brown  1  3  1  Skew  cow  1  4  1  # of docs  Score = TF * IDF the  6  brown  3  TF = # occurrences / # words IDF = # docs / # docs with W cow  4  Total # of documents: 10 TF‐IDF  Term Frequency * Inverse Document Frequency  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  29. 29. 3  5  4  the  4  3  5  brown  1  3  1  cow  1  4  1  # of docs  Doc # 3 score for ‘the’: 4/10 * ln(10/6) = 0.204 the  6  brown  3  Doc # 3 score for ‘brown’: 1/10 * ln(10/3) = 0.120 cow  4  Doc # 3 score for ‘cow’: 1/10 * ln(10/4) = 0.092 Total # of documents: 10 # words in document: 10 Score = 0.204 + 0.120 + 0.092 = 0.416  TF‐IDF  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  30. 30. W1  W2  …  …  …  …  …  …  WN  Doc 1  15  23  …  Doc 2  24  12  …  …  …  …  …  …  Doc K  Size = N * K * size of Ruby object Ouch.  Pages = N = 10,000 Words = K = 2,000 Ruby Object = 20+ bytes Footprint = 384 MB Frequency Matrix  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  31. 31. NArray is an Numerical N‐dimensional Array class (implemented in C)   # create new NArray. initialize with 0. NArray.new(typecode, size, ...) # 1 byte unsigned integer NArray.byte(size,...) # 2 byte signed integer NArray.sint(size,...) # 4 byte signed integer NArray.int(size,...) # single precision float NArray.sfloat(size,...) # double precision float NArray.float(size,...) # single precision complex NArray.scomplex(size,...) # double precision complex NArray.complex(size,...) # Ruby object NArray.object(size,...) NArray  h9p://narray.rubyforge.org/  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  32. 32. NArray is an Numerical N‐dimensional Array class (implemented in C)   NArray  h9p://narray.rubyforge.org/  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  33. 33. Links as votes  PageRank  Problem: link gaming  the google juice  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  34. 34. P = 0.85  Follow link from page he/she is currently on.   Teleport to a random locaGon on the web.  P = 0.15  Random Surfer  powerful abstracJon  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  35. 35. Follow link from page he/she is currently on.   Page K  Teleport to a random locaGon on the web.  Page N  Page M  Surfin’  rinse & repeat, ad naseum  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  36. 36. On Page P, clicks on link to K  P = 0.85  On Page K clicks on link to M  P = 0.85  On Page M teleports to X  P = 0.15  …  Surfin’  rinse & repeat, ad naseum  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  37. 37. P = 0.05  P = 0.20  X  N  P = 0.15  K  M P = 0.6  Analyzing the Web Graph  extracJng PageRank  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  38. 38. What is PageRank?  It’s a scalar!  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  39. 39. P = 0.05  P = 0.20  X  N  P = 0.15  K  M P = 0.6  What is PageRank?  it’s a probability!  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  40. 40. P = 0.05  P = 0.20  X  N  P = 0.15  K  M P = 0.6  What is PageRank?  Higher Pr, Higher Importance?  it’s a probability!  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  41. 41. TeleportaDon?  sci‐fi fans, … ?  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  42. 42. 1. No in‐links!  3. Isolated Web  X  N  K  2. No out‐links!  M M Reasons for teleportaDon  enumeraJng edge cases  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  43. 43. •  readth First Search  B •  epth First Search  D •  * Search   A •  exicographic Search   L •  ijkstra’s Algorithm   D •  loyd‐Warshall   F •  riangulaCon and Comparability detecCon   T require 'gratr/import' dg = Digraph[1,2, 2,3, 2,4, 4,5, 6,4, 1,6] dg.directed? # true dg.vertex?(4) # true dg.edge?(2,4) # true dg.vertices # [5, 6, 1, 2, 3, 4] Exploring Graphs  Graph[1,2,1,3,1,4,2,5].bfs # [1, 2, 3, 4, 5] gratr.rubyforge.com  Graph[1,2,1,3,1,4,2,5].dfs # [1, 2, 5, 3, 4] Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  44. 44. P(T) = 0.03  P(T) = 0.03  P(T) = 0.15 / # of pages  P(T) = 0.03  X  N  K  P(T) = 0.03  M P(T) = 0.03  M P(T) = 0.03  TeleportaDon  probabiliJes  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  45. 45. Assume the web is N pages big  Assume that probability of teleportaCon (t) is 0.15, and following link (s) is 0.85  Assume that teleportaCon probability (E) is uniform  Assume that you start on any random page (uniform distribuDon L), then Then a^er one step, the probability your on page X is:  PageRank: Simplified MathemaDcal Def’n  cause that’s how we roll  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  46. 46. Link Graph  No  link from 1 to N   1  2  …  …  N  1  1  0  …  …  0  2  0  1  …  …  1  …  …  …  …  …  …  …  …  …  …  …  …  N  0  1  …  …  1  Huge!  G = The Link Graph  ginormous and sparse  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  47. 47. Links to…  { "1" => [25, 26], Page   "2" => [1], "5" => [123,2], "6" => [67, 1] } G as a dicDonary  more compact…  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  48. 48. Follow link from page he/she is currently on.   Page K  Teleport to a random locaGon on the web.  CompuDng PageRank  the tedious way  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  49. 49. Don’t trust me! Verify it yourself!  IdenDty matrix  CompuDng PageRank  in one swoop  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  50. 50. Enough hand‐waving, dammit!  show me the code  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  51. 51. Hot, Fast, Awesome  Birth of EM‐Proxy  flash of the obvious  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  52. 52. h:p://rb‐gsl.rubyforge.org/  Hot, Fast, Awesome  Click there!  …  Give yourself a weekend.   Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  53. 53. h:p://ruby‐gsl.sourceforge.net/  Click there!  …  Give yourself a weekend.   Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  54. 54. require "gsl" include GSL # INPUT: link structure matrix (NxN) # OUTPUT: pagerank scores def pagerank(g) Verify NxN  raise if g.size1 != g.size2 i = Matrix.I(g.size1) # identity matrix p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector s = 0.85 # probability of following a link t = 1-s # probability of teleportation t*((i-s*g).invert)*p end PageRank in Ruby  6 lines, or less  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  55. 55. require "gsl" include GSL # INPUT: link structure matrix (NxN) # OUTPUT: pagerank scores def pagerank(g) Constants…  raise if g.size1 != g.size2 i = Matrix.I(g.size1) # identity matrix p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector s = 0.85 # probability of following a link t = 1-s # probability of teleportation t*((i-s*g).invert)*p end PageRank in Ruby  6 lines, or less  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  56. 56. require "gsl" include GSL # INPUT: link structure matrix (NxN) # OUTPUT: pagerank scores def pagerank(g) raise if g.size1 != g.size2 i = Matrix.I(g.size1) # identity matrix p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector s = 0.85 # probability of following a link t = 1-s # probability of teleportation t*((i-s*g).invert)*p end PageRank!  PageRank in Ruby  6 lines, or less  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  57. 57. P = 0.33  X  P = 0.33  N  P = 0.33  K  pagerank(Matrix[[0,0,1], [0,0,1], [1,0,0]]) > [0.33, 0.33, 0.33] Ex: Circular Web  tesJng intuiJon…  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  58. 58. P = 0.05  X  P = 0.07  N  P = 0.87  K  pagerank(Matrix[[0,0,0], [0.5,0,0], [0.5,1,1]]) > [0.05, 0.07, 0.87] Ex: All roads lead to K  tesJng intuiJon…  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  59. 59. PageRank + Ferret  awesome search, Tw!  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  60. 60. P = 0.05  2  P = 0.07  1  require 'ferret' P = 0.87  include Ferret 3  index = Index::Index.new() index << {:title => "1", :content => "it is what it is", :pr => 0.05 } index << {:title => "2", :content => "what is it", :pr => 0.07 } index << {:title => "3", :content => "it is a banana", :pr => 0.87 } Store PageRank  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  61. 61. index.search_each('content:"world"') do |id, score| puts "Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})" end puts "*" * 50 TF‐IDF Search  sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true) index.search_each('content:"world"', :sort => sf_pr) do |id, score| puts "Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})" end # Score: 0.267119228839874, 3 (PR: 0.87) # Score: 0.17807948589325, 1 (PR: 0.05) # Score: 0.17807948589325, 2 (PR: 0.07) # *********************************** # Score: 0.267119228839874, 3, (PR: 0.87) # Score: 0.17807948589325, 2, (PR: 0.07) # Score: 0.17807948589325, 1, (PR: 0.05) Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  62. 62. index.search_each('content:"world"') do |id, score| puts "Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})" end PageRank FTW!  puts "*" * 50 sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true) index.search_each('content:"world"', :sort => sf_pr) do |id, score| puts "Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})" end # Score: 0.267119228839874, 3 (PR: 0.87) # Score: 0.17807948589325, 1 (PR: 0.05) # Score: 0.17807948589325, 2 (PR: 0.07) # *********************************** # Score: 0.267119228839874, 3, (PR: 0.87) # Score: 0.17807948589325, 2, (PR: 0.07) # Score: 0.17807948589325, 1, (PR: 0.05) Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  63. 63. index.search_each('content:"world"') do |id, score| puts "Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})" end puts "*" * 50 sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true) index.search_each('content:"world"', :sort => sf_pr) do |id, score| puts "Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})" end # Score: 0.267119228839874, 3 (PR: 0.87) # Score: 0.17807948589325, 1 (PR: 0.05) Others  # Score: 0.17807948589325, 2 (PR: 0.07) # *********************************** # Score: 0.267119228839874, 3, (PR: 0.87) # Score: 0.17807948589325, 2, (PR: 0.07) Google  # Score: 0.17807948589325, 1, (PR: 0.05) Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  64. 64. Search*: Graphs are ubiquitous!  PageRank is a general purpose hammer  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  65. 65. Username GitCred ============================== 37signals 10.00 imbriaco 9.76 why 8.74 rails 8.56 defunkt 8.17 technoweenie 7.83 jeresig 7.60 mojombo 7.51 yui 7.34 drnic 7.34 pjhyett 6.91 wycats 6.85 dhh 6.84 h:p://bit.ly/3YQPU  PageRank + Social Graph  GitHub  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  66. 66. Hmm…  Analyze the social graph:  ‐  Filter messages by ‘Twi:erRank’  ‐  Suggest users by ‘Twi:erRank’  ‐  …  PageRank + Social Graph  Twi9er  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  67. 67. PageRank + Product Graph  E‐commerce  Link items purchased in same cart… Run PR on it.  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  68. 68. PageRank = Powerful Hammer  use it!  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  69. 69. PersonalizaDon  how would you do it?  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  70. 70. TeleportaDon distribuDon doesn’t  have to be uniform!  yahoo.com is  my homepage!  PageRank + PersonalizaDon  customize the teleportaJon vector  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  71. 71. Make pages with links!  Gaming PageRank  hXp://bit.ly/pagerank‐spam   for fun and profit (I don’t endorse it)  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  72. 72. Slides: hXp://bit.ly/railsconf‐pagerank  Ferret: hXp://bit.ly/ferret  RB‐GSL: hXp://bit.ly/rb‐gsl  PageRank on Wikipedia: hXp://bit.ly/wp‐pagerank  Gaming PageRank: hXp://bit.ly/pagerank‐spam   Michael Nielsen’s lectures on PageRank:  hXp://michaelnielsen.org/blog    QuesDons?  The slides…  Twi+er  My blog  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 

×