SlideShare a Scribd company logo
Building Mini-Google in Ruby


                                                                              Ilya Grigorik
                                                                                          @igrigorik


Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
postrank.com/topic/ruby




                               The slides…                           Twitter                           My blog


Building Mini-Google in Ruby      http://bit.ly/railsconf-pagerank             @igrigorik #railsconf
Ruby + Math
                                                                            PageRank
              Optimization




                               Examples                                          Indexing
   Misc Fun




Building Mini-Google in Ruby     http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
PageRank                                       PageRank + Ruby




      Tools
        +                       Examples                                       Indexing
   Optimization

Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Consume with care…
                     everything that follows is based on released / public domain info




Building Mini-Google in Ruby        http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
Search-engine graveyard
                                                                  Google did pretty well…




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Query: Ruby




                                                                                           Results




       1. Crawl                           2. Index                                         3. Rank




                                                                  Search pipeline
                                                                                  50,000-foot view



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Query: Ruby




                                                                                                Results




       1. Crawl                           2. Index                                          3. Rank




            Bah                           Interesting                                     Fun




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
CPU Speed                                     333Mhz
          RAM                                           32-64MB

          Index                                         27,000,000 documents
          Index refresh                                 once a month~ish
          PageRank computation                          several days

          Laptop CPU                                    2.1Ghz
          VM RAM                                        1GB
          1-Million page web                            ~10 minutes


                                                                  circa 1997-1998



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Creating & Maintaining an Inverted Index
                                                                    DIY and the gotchas within




Building Mini-Google in Ruby     http://bit.ly/railsconf-pagerank     @igrigorik #railsconf
require 'set'
                                                      {
                                                         quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>,
    pages = {
                                                         quot;aquot;=>#<Set: {quot;3quot;}>,
     quot;1quot; => quot;it is what it isquot;,
                                                         quot;bananaquot;=>#<Set: {quot;3quot;}>,
     quot;2quot; => quot;what is itquot;,
                                                         quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>,
     quot;3quot; => quot;it is a bananaquot;
                                                         quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>}
    }
                                                        }
    index = {}

    pages.each do |page, content|
     content.split(/s/).each do |word|
      if index[word]
        index[word] << page
      else
        index[word] = Set.new(page)
      end
     end
    end


                                                Building an Inverted Index

Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
require 'set'
                                                      {
                                                         quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>,
    pages = {
                                                         quot;aquot;=>#<Set: {quot;3quot;}>,
     quot;1quot; => quot;it is what it isquot;,
                                                         quot;bananaquot;=>#<Set: {quot;3quot;}>,
     quot;2quot; => quot;what is itquot;,
                                                         quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>,
     quot;3quot; => quot;it is a bananaquot;
                                                         quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>}
    }
                                                        }
    index = {}

    pages.each do |page, content|
     content.split(/s/).each do |word|
      if index[word]
        index[word] << page
      else
        index[word] = Set.new(page)
      end
     end
    end


                                                Building an Inverted Index

Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
require 'set'
                                                      {
                                                         quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>,
    pages = {
                                                         quot;aquot;=>#<Set: {quot;3quot;}>,
     quot;1quot; => quot;it is what it isquot;,
                                                         quot;bananaquot;=>#<Set: {quot;3quot;}>,
     quot;2quot; => quot;what is itquot;,
                                                         quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>,
     quot;3quot; => quot;it is a bananaquot;
                                                         quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>}
    }
                                                        }
    index = {}

    pages.each do |page, content|
                                                                  Word => [Document]
     content.split(/s/).each do |word|
      if index[word]
        index[word] << page
      else
        index[word] = Set.new(page)
      end
     end
    end


                                                Building an Inverted Index

Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
# query: quot;what is bananaquot;
 p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;]
 # > #<Set: {}>


 # query: quot;a bananaquot;
 p index[quot;aquot;] & index[quot;bananaquot;]
 # > #<Set: {quot;3quot;}>


                                                                    1                        3
                                                                                 2
 # query: quot;what isquot;
 p index[quot;whatquot;] & index[quot;isquot;]
 # > #<Set: {quot;1quot;, quot;2quot;}>


 {
   quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>,
   quot;aquot;=>#<Set: {quot;3quot;}>,
   quot;bananaquot;=>#<Set: {quot;3quot;}>,
   quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>,
   quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>}
                                                                  Querying the index
  }



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank      @igrigorik #railsconf
# query: quot;what is bananaquot;
 p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;]
 # > #<Set: {}>


 # query: quot;a bananaquot;
 p index[quot;aquot;] & index[quot;bananaquot;]
 # > #<Set: {quot;3quot;}>


                                                                    1                        3
                                                                                 2
 # query: quot;what isquot;
 p index[quot;whatquot;] & index[quot;isquot;]
 # > #<Set: {quot;1quot;, quot;2quot;}>


 {
   quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>,
   quot;aquot;=>#<Set: {quot;3quot;}>,
   quot;bananaquot;=>#<Set: {quot;3quot;}>,
   quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>,
   quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>}
                                                                  Querying the index
  }



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank      @igrigorik #railsconf
# query: quot;what is bananaquot;
 p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;]
 # > #<Set: {}>


 # query: quot;a bananaquot;
 p index[quot;aquot;] & index[quot;bananaquot;]
 # > #<Set: {quot;3quot;}>


                                                                    1                        3
                                                                                 2
 # query: quot;what isquot;
 p index[quot;whatquot;] & index[quot;isquot;]
 # > #<Set: {quot;1quot;, quot;2quot;}>


 {
   quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>,
   quot;aquot;=>#<Set: {quot;3quot;}>,
   quot;bananaquot;=>#<Set: {quot;3quot;}>,
   quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>,
   quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>}
                                                                  Querying the index
  }



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank      @igrigorik #railsconf
# query: quot;what is bananaquot;
 p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;]
 # > #<Set: {}>


 # query: quot;a bananaquot;
 p index[quot;aquot;] & index[quot;bananaquot;]
 # > #<Set: {quot;3quot;}>

                                                                  What order?
 # query: quot;what isquot;
 p index[quot;whatquot;] & index[quot;isquot;]
                                                                  [1, 2] or [2,1]
 # > #<Set: {quot;1quot;, quot;2quot;}>


 {
   quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>,
   quot;aquot;=>#<Set: {quot;3quot;}>,
   quot;bananaquot;=>#<Set: {quot;3quot;}>,
   quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>,
   quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>}
                                                                    Querying the index
  }



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank            @igrigorik #railsconf
require 'set'

    pages = {
     quot;1quot; => quot;it is what it isquot;,
     quot;2quot; => quot;what is itquot;,
     quot;3quot; => quot;it is a bananaquot;
    }
                                                                    PDF, HTML, RSS?
    index = {}
                                                                  Lowercase / Upcase?
    pages.each do |page, content|                                   Compact Index?
                                                                        Hmmm?
     content.split(/s/).each do |word|                               Stop words?
      if index[word]                                                  Persistence?
        index[word] << page
      else
        index[word] = Set.new(page)
      end
     end
    end


                                                Building an Inverted Index

Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank         @igrigorik #railsconf
Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Ferret is a high-performance, full-featured text search engine library written for Ruby



Building Mini-Google in Ruby           http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
require 'ferret'
    include Ferret

    index = Index::Index.new()

    index << {:title => quot;1quot;, :content => quot;it is what it isquot;}
    index << {:title => quot;2quot;, :content => quot;what is itquot;}
    index << {:title => quot;3quot;, :content => quot;it is a bananaquot;}

    index.search_each('content:quot;bananaquot;') do |id, score|
     puts quot;Score: #{score}, #{index[id][:title]} quot;
    end


    > Score: 1.0, 3




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
require 'ferret'
    include Ferret

    index = Index::Index.new()

    index << {:title => quot;1quot;, :content => quot;it is what it isquot;}
    index << {:title => quot;2quot;, :content => quot;what is itquot;}
    index << {:title => quot;3quot;, :content => quot;it is a bananaquot;}

    index.search_each('content:quot;bananaquot;') do |id, score|
     puts quot;Score: #{score}, #{index[id][:title]} quot;
    end


    > Score: 1.0, 3


                               Hmmm?




Building Mini-Google in Ruby           http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
class Ferret::Analysis::Analyzer                                        class Ferret::Search::BooleanQuery
   class Ferret::Analysis::AsciiLetterAnalyzer                             class Ferret::Search::ConstantScoreQuery
   class Ferret::Analysis::AsciiLetterTokenizer                            class Ferret::Search::Explanation
   class Ferret::Analysis::AsciiLowerCaseFilter                            class Ferret::Search::Filter
   class Ferret::Analysis::AsciiStandardAnalyzer                           class Ferret::Search::FilteredQuery
   class Ferret::Analysis::AsciiStandardTokenizer                          class Ferret::Search::FuzzyQuery
   class Ferret::Analysis::AsciiWhiteSpaceAnalyzer                         class Ferret::Search::Hit
   class Ferret::Analysis::AsciiWhiteSpaceTokenizer                        class Ferret::Search::MatchAllQuery
   class Ferret::Analysis::HyphenFilter                                    class Ferret::Search::MultiSearcher
   class Ferret::Analysis::LetterAnalyzer                                  class Ferret::Search::MultiTermQuery
   class Ferret::Analysis::LetterTokenizer                                 class Ferret::Search::PhraseQuery
   class Ferret::Analysis::LowerCaseFilter                                 class Ferret::Search::PrefixQuery
   class Ferret::Analysis::MappingFilter                                   class Ferret::Search::Query
   class Ferret::Analysis::PerFieldAnalyzer                                class Ferret::Search::QueryFilter
   class Ferret::Analysis::RegExpAnalyzer                                  class Ferret::Search::RangeFilter
   class Ferret::Analysis::RegExpTokenizer                                 class Ferret::Search::RangeQuery
   class Ferret::Analysis::StandardAnalyzer                                class Ferret::Search::Searcher
   class Ferret::Analysis::StandardTokenizer                               class Ferret::Search::Sort
   class Ferret::Analysis::StemFilter                                      class Ferret::Search::SortField
   class Ferret::Analysis::StopFilter                                      class Ferret::Search::TermQuery
   class Ferret::Analysis::Token                                           class Ferret::Search::TopDocs
   class Ferret::Analysis::TokenStream                                     class Ferret::Search::TypedRangeFilter
   class Ferret::Analysis::WhiteSpaceAnalyzer                              class Ferret::Search::TypedRangeQuery
                                                                           class Ferret::Search::WildcardQuery
   class Ferret::Analysis::WhiteSpaceTokenizer



Building Mini-Google in Ruby            http://bit.ly/railsconf-pagerank              @igrigorik #railsconf
ferret.davebalmain.com/trac




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Ranking Results
                                                                      0-60 with PageRank…




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
index.search_each('content:quot;the brown cowquot;') do |id, score|
     puts quot;Score: #{score}, #{index[id][:title]} quot;
    end

    > Score: 0.827, 3
    > Score: 0.523, 5                                                 Relevance?
    > Score: 0.125, 4

                               3                     5                   4
            the                4                     3                   5
          brown                1                     3                   1
            cow                1                     4                   1
     Score                     6                    10                   7


                                                              Naïve: Term Frequency

Building Mini-Google in Ruby       http://bit.ly/railsconf-pagerank          @igrigorik #railsconf
index.search_each('content:quot;the brown cowquot;') do |id, score|
     puts quot;Score: #{score}, #{index[id][:title]} quot;
    end

    > Score: 0.827, 3
    > Score: 0.523, 5
    > Score: 0.125, 4

                               3                     5                4
            the                4                     3                5
                                                                                                  Skew
          brown                1                     3                1
            cow                1                     4                1
     Score                     6                    10                7


                                                              Naïve: Term Frequency

Building Mini-Google in Ruby       http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
3                         5                4
            the                4                         3                5
                                                                                                       Skew
          brown                1                         3                1
            cow                1                         4                1


                               # of docs
                                                              Score = TF * IDF
                   the             6
                                                              TF = # occurrences / # words
                  brown            3
                                                              IDF = # docs / # docs with W
                   cow             4


     Total # of documents:                     10


                                                                                                      TF-IDF
                                           Term Frequency * Inverse Document Frequency


Building Mini-Google in Ruby           http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
3                         5                  4
            the                 4                         3                  5
          brown                 1                         3                  1
            cow                 1                         4                  1


                                                                      Doc # 3 score for ‘the’:
                               # of docs
                                                                      4/10 * ln(10/6) = 0.204
                   the              6
                                                                      Doc # 3 score for ‘brown’:
                  brown             3
                                                                      1/10 * ln(10/3) = 0.120
                   cow              4
                                                                      Doc # 3 score for ‘cow’:
                                                                      1/10 * ln(10/4) = 0.092
     Total # of documents:                      10
     # words in document:                       10


                                                                                                         TF-IDF
                               Score = 0.204 + 0.120 + 0.092 = 0.416



Building Mini-Google in Ruby            http://bit.ly/railsconf-pagerank         @igrigorik #railsconf
W1        W2   …           …             …        …       …             …     WN

         Doc 1       15        23   …
         Doc 2       24        12   …
         …            …        …    …
         …
         Doc K

         Size = N * K * size of Ruby object
                                                                                   Ouch.
          Pages = N = 10,000
          Words = K = 2,000
          Ruby Object = 20+ bytes

                                                                       Frequency Matrix
          Footprint = 384 MB



Building Mini-Google in Ruby        http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
NArray is an Numerical N-dimensional Array class (implemented in C)



                                                      #    create new NArray. initialize with 0.
       NArray.new(typecode, size, ...)
                                                      #    1 byte unsigned integer
       NArray.byte(size,...)
                                                      #    2 byte signed integer
       NArray.sint(size,...)
                                                      #    4 byte signed integer
       NArray.int(size,...)
                                                      #    single precision float
       NArray.sfloat(size,...)
                                                      #    double precision float
       NArray.float(size,...)
                                                      #    single precision complex
       NArray.scomplex(size,...)
                                                      #    double precision complex
       NArray.complex(size,...)
                                                      #    Ruby object
       NArray.object(size,...)




                                                                                              NArray
                                                                    http://narray.rubyforge.org/



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
NArray is an Numerical N-dimensional Array class (implemented in C)




                                                                                            NArray
                                                                  http://narray.rubyforge.org/



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank     @igrigorik #railsconf
Links as votes




                                                                                        PageRank
                                                                                     the google juice
               Problem: link gaming




Building Mini-Google in Ruby      http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
P = 0.85



                               Follow link from page he/she is currently on.



                               Teleport to a random location on the web.



                                   P = 0.15

                                                                               Random Surfer
                                                                                    powerful abstraction




Building Mini-Google in Ruby           http://bit.ly/railsconf-pagerank        @igrigorik #railsconf
Follow link from page he/she is currently on.
                                                                                                      Page K

                               Teleport to a random location on the web.



        Page N                         Page M
                                                                                                    Surfin’
                                                                          rinse & repeat, ad naseum




Building Mini-Google in Ruby           http://bit.ly/railsconf-pagerank     @igrigorik #railsconf
On Page P, clicks on link to K
                                                                                         P = 0.85


                               On Page K clicks on link to M
                                                                                         P = 0.85


                               On Page M teleports to X

           P = 0.15
                                                                                                    Surfin’
                                                …
                                                                          rinse & repeat, ad naseum




Building Mini-Google in Ruby           http://bit.ly/railsconf-pagerank     @igrigorik #railsconf
P = 0.05                                                      P = 0.20
                                                               X
                               N

                                                                                     P = 0.15
                                                                      M
                                   K
                  P = 0.6




                                                       Analyzing the Web Graph
                                                                               extracting PageRank




Building Mini-Google in Ruby       http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
What is PageRank?
                                                                                             It’s a scalar!

Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank      @igrigorik #railsconf
P = 0.05                                                      P = 0.20
                                                               X
                               N

                                                                                     P = 0.15
                                                                      M
                                   K
                  P = 0.6




                                                                      What is PageRank?
                                                                                       it’s a probability!




Building Mini-Google in Ruby       http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
P = 0.05                                                      P = 0.20
                                                               X
                               N

                                                                                     P = 0.15
                                                                      M
                                   K
                  P = 0.6




                                                                      What is PageRank?
          Higher Pr, Higher Importance?
                                                                                       it’s a probability!




Building Mini-Google in Ruby       http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
Teleportation?
                                                                                           sci-fi fans, … ?




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank    @igrigorik #railsconf
1. No in-links!                                                        3. Isolated Web



                                           X
         N
                          K
                                                                                       2. No out-links!
                                        M
                                                                  M



                                                  Reasons for teleportation
                                                                      enumerating edge cases



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
•Breadth First Search
                               •Depth First Search
                               •A* Search
                               •Lexicographic Search
                               •Dijkstra’s Algorithm
                               •Floyd-Warshall
                               •Triangulation and Comparability detection

require 'gratr/import'

dg = Digraph[1,2, 2,3, 2,4, 4,5, 6,4, 1,6]

dg.directed? # true
dg.vertex?(4) # true
dg.edge?(2,4) # true
dg.vertices # [5, 6, 1, 2, 3, 4]
                                                                       Exploring Graphs
Graph[1,2,1,3,1,4,2,5].bfs # [1, 2, 3, 4, 5]                                  gratr.rubyforge.com
Graph[1,2,1,3,1,4,2,5].dfs # [1, 2, 5, 3, 4]



Building Mini-Google in Ruby        http://bit.ly/railsconf-pagerank    @igrigorik #railsconf
P(T) = 0.03
                                                                     P(T) = 0.15 / # of pages
        P(T) = 0.03
                                                                     P(T) = 0.03
                                            X
         N
                          K                                        P(T) = 0.03

                                         M
                P(T) = 0.03
                                                                   M
                               P(T) = 0.03


                                                                             Teleportation
                                                                                                probabilities



Building Mini-Google in Ruby    http://bit.ly/railsconf-pagerank        @igrigorik #railsconf
Assume the web is N pages big
    Assume that probability of teleportation (t) is 0.15, and following link (s) is 0.85
    Assume that teleportation probability (E) is uniform
    Assume that you start on any random page (uniform distribution L), then

                                                  0.15
                                                            ������
                                ������ = ������ =            ⋮
                                                  0.15
                                                            ������
    Then after one step, the probability your on page X is:
                                      ������ ∗ ������������ + ������������

                               ������ ∗ (0.85 ∗ ������ + 0.15 ∗ ������)


                     PageRank: Simplified Mathematical Def’n
                                                                    cause that’s how we roll



Building Mini-Google in Ruby     http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Link Graph                                                 No link from 1 to N



                           1   2                …                 …         N
             1             1   0                 …                …          0

             2             0   1                 …                …          1

             …             …   …                 …                …         …

             …             …   …                 …                …         …

             N             0   1                 …                …          1



                                                                  G = The Link Graph
                     Huge!
                                                                        ginormous and sparse



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
Links to…

                               {
                                   quot;1quot;      =>         [25, 26],
                                   quot;2quot;      =>         [1],
                  Page
                                   quot;5quot;      =>         [123,2],
                                   quot;6quot;      =>         [67, 1]
                               }



                                                                          G as a dictionary
                                                                                         more compact…



Building Mini-Google in Ruby        http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
Follow link from page he/she is currently on.
                                                                                                       Page K

                               Teleport to a random location on the web.




                                                                          Computing PageRank
                                                                                              the tedious way



Building Mini-Google in Ruby           http://bit.ly/railsconf-pagerank        @igrigorik #railsconf
Don’t trust me! Verify it yourself!


                                                                                ������1
                                                               −1                ⋮
                               ������ = ������ ������ − ������������                         ������ =
                                                                                ������������
                                    Identity matrix




                                                                         Computing PageRank
                                                                                                    in one swoop



Building Mini-Google in Ruby          http://bit.ly/railsconf-pagerank          @igrigorik #railsconf
Enough hand-waving, dammit!
                                                                                  show me the code




Building Mini-Google in Ruby     http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Hot, Fast, Awesome

                                                                  Birth of EM-Proxy
                                                                            flash of the obvious




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank     @igrigorik #railsconf
http://rb-gsl.rubyforge.org/




                                                                        Hot, Fast, Awesome




                       Click there! … Give yourself a weekend.


Building Mini-Google in Ruby         http://bit.ly/railsconf-pagerank      @igrigorik #railsconf
http://ruby-gsl.sourceforge.net/
                       Click there! … Give yourself a weekend.


Building Mini-Google in Ruby      http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
require quot;gslquot;
   include GSL

   # INPUT: link structure matrix (NxN)
   # OUTPUT: pagerank scores
   def pagerank(g)
                                                                         Verify NxN
    raise if g.size1 != g.size2

     i = Matrix.I(g.size1)                      # identity matrix
     p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector

     s = 0.85              # probability of following a link
     t = 1-s               # probability of teleportation

    t*((i-s*g).invert)*p
   end



                                                                         PageRank in Ruby
                                                                                              6 lines, or less



Building Mini-Google in Ruby          http://bit.ly/railsconf-pagerank     @igrigorik #railsconf
require quot;gslquot;
   include GSL

   # INPUT: link structure matrix (NxN)
   # OUTPUT: pagerank scores
   def pagerank(g)                                                          Constants…
    raise if g.size1 != g.size2

     i = Matrix.I(g.size1)                      # identity matrix
     p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector

     s = 0.85              # probability of following a link
     t = 1-s               # probability of teleportation

    t*((i-s*g).invert)*p
   end



                                                                         PageRank in Ruby
                                                                                              6 lines, or less



Building Mini-Google in Ruby          http://bit.ly/railsconf-pagerank     @igrigorik #railsconf
require quot;gslquot;
   include GSL

   # INPUT: link structure matrix (NxN)
   # OUTPUT: pagerank scores
   def pagerank(g)
    raise if g.size1 != g.size2

     i = Matrix.I(g.size1)                      # identity matrix
     p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector

     s = 0.85              # probability of following a link
     t = 1-s               # probability of teleportation

    t*((i-s*g).invert)*p
   end



                                                                         PageRank in Ruby
                  PageRank!
                                                                                              6 lines, or less



Building Mini-Google in Ruby          http://bit.ly/railsconf-pagerank     @igrigorik #railsconf
X
                 P = 0.33                                              P = 0.33
                               N


                                                                  P = 0.33
                                                    K


        pagerank(Matrix[[0,0,1], [0,0,1], [1,0,0]])
        > [0.33, 0.33, 0.33]


                                                                  Ex: Circular Web
                                                                               testing intuition…




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank    @igrigorik #railsconf
X
                 P = 0.05                                               P = 0.07
                               N


                                                                   P = 0.87
                                                    K


        pagerank(Matrix[[0,0,0], [0.5,0,0], [0.5,1,1]])
        > [0.05, 0.07, 0.87]


                                                              Ex: All roads lead to K
                                                                                testing intuition…




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank     @igrigorik #railsconf
PageRank + Ferret
                                                                             awesome search, ftw!




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
2
                               P = 0.05                                                          P = 0.07
                                                           1


require 'ferret'                                                                              P = 0.87
                                                                      3
include Ferret

index = Index::Index.new()

index << {:title => quot;1quot;, :content => quot;it is what it isquot;, :pr => 0.05 }
index << {:title => quot;2quot;, :content => quot;what is itquot;, :pr => 0.07 }
index << {:title => quot;3quot;, :content => quot;it is a bananaquot;, :pr => 0.87 }



                                                                               Store PageRank




Building Mini-Google in Ruby       http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
index.search_each('content:quot;worldquot;') do |id, score|
 puts quot;Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})quot;
end

                               TF-IDF Search
puts quot;*quot; * 50

sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true)

index.search_each('content:quot;worldquot;', :sort => sf_pr) do |id, score|
 puts quot;Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})quot;
end


#   Score: 0.267119228839874, 3 (PR: 0.87)
#   Score: 0.17807948589325, 1 (PR: 0.05)
#   Score: 0.17807948589325, 2 (PR: 0.07)
#   ***********************************
#   Score: 0.267119228839874, 3, (PR: 0.87)
#   Score: 0.17807948589325, 2, (PR: 0.07)
#   Score: 0.17807948589325, 1, (PR: 0.05)



Building Mini-Google in Ruby       http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
index.search_each('content:quot;worldquot;') do |id, score|
 puts quot;Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})quot;
end
                                                PageRank FTW!
puts quot;*quot; * 50

sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true)

index.search_each('content:quot;worldquot;', :sort => sf_pr) do |id, score|
 puts quot;Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})quot;
end


#   Score: 0.267119228839874, 3 (PR: 0.87)
#   Score: 0.17807948589325, 1 (PR: 0.05)
#   Score: 0.17807948589325, 2 (PR: 0.07)
#   ***********************************
#   Score: 0.267119228839874, 3, (PR: 0.87)
#   Score: 0.17807948589325, 2, (PR: 0.07)
#   Score: 0.17807948589325, 1, (PR: 0.05)



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
index.search_each('content:quot;worldquot;') do |id, score|
 puts quot;Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})quot;
end

puts quot;*quot; * 50

sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true)

index.search_each('content:quot;worldquot;', :sort => sf_pr) do |id, score|
 puts quot;Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})quot;
end


#    Score: 0.267119228839874, 3 (PR: 0.87)
#    Score: 0.17807948589325, 1 (PR: 0.05)                                         Others
#    Score: 0.17807948589325, 2 (PR: 0.07)
#    ***********************************
#    Score: 0.267119228839874, 3, (PR: 0.87)
#    Score: 0.17807948589325, 2, (PR: 0.07)                                        Google
#    Score: 0.17807948589325, 1, (PR: 0.05)



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Search*: Graphs are ubiquitous!
                                                 PageRank is a general purpose hammer




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Username               GitCred
                                                                    ==============================
                                                                    37signals              10.00
                                                                    imbriaco               9.76
                                                                    why                    8.74
                                                                    rails                  8.56
                                                                    defunkt                8.17
                                                                    technoweenie           7.83
                                                                    jeresig                7.60
                                                                    mojombo                7.51
                                                                    yui                    7.34
                                                                    drnic                  7.34
                                                                    pjhyett                6.91
                                                                    wycats                 6.85
                                                                    dhh                    6.84

           http://bit.ly/3YQPU

                                                       PageRank + Social Graph
                                                                                                    GitHub




Building Mini-Google in Ruby     http://bit.ly/railsconf-pagerank           @igrigorik #railsconf
Hmm…




                                                                  Analyze the social graph:
                                                                  - Filter messages by ‘TwitterRank’
                                                                  - Suggest users by ‘TwitterRank’
                                                                  -…
                                                     PageRank + Social Graph
                                                                                                   Twitter




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank            @igrigorik #railsconf
PageRank + Product Graph
                                                                                          E-commerce

                                  Link items purchased in same cart… Run PR on it.



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
PageRank = Powerful Hammer
                                                                                          use it!




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Personalization
                                                                      how would you do it?




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
0.15
                               ������         Teleportation distribution doesn’t
       ������ =            ⋮                          have to be uniform!
                    0.15
                               ������


                               yahoo.com is
                               my homepage!


                                                  PageRank + Personalization
                                                                 customize the teleportation vector




Building Mini-Google in Ruby        http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
Make pages with links!




                                                                      Gaming PageRank
       http://bit.ly/pagerank-spam                       for fun and profit (I don’t endorse it)




Building Mini-Google in Ruby    http://bit.ly/railsconf-pagerank          @igrigorik #railsconf
Slides: http://bit.ly/railsconf-pagerank

    Ferret: http://bit.ly/ferret
    RB-GSL: http://bit.ly/rb-gsl

    PageRank on Wikipedia: http://bit.ly/wp-pagerank
    Gaming PageRank: http://bit.ly/pagerank-spam

    Michael Nielsen’s lectures on PageRank:
    http://michaelnielsen.org/blog


                                                                                     Questions?

                               The slides…                           Twitter                           My blog


Building Mini-Google in Ruby      http://bit.ly/railsconf-pagerank             @igrigorik #railsconf

More Related Content

Viewers also liked

Rails 3 And The Real Secret To High Productivity Presentation
Rails 3 And The Real Secret To High Productivity PresentationRails 3 And The Real Secret To High Productivity Presentation
Rails 3 And The Real Secret To High Productivity Presentationrailsconf
 
Quality Code With Cucumber Presentation
Quality Code With Cucumber PresentationQuality Code With Cucumber Presentation
Quality Code With Cucumber Presentationrailsconf
 
J Ruby On Rails Presentation
J Ruby On Rails PresentationJ Ruby On Rails Presentation
J Ruby On Rails Presentationrailsconf
 
Rails Is From Mars Ruby Is From Venus Presentation 1
Rails Is From Mars  Ruby Is From Venus Presentation 1Rails Is From Mars  Ruby Is From Venus Presentation 1
Rails Is From Mars Ruby Is From Venus Presentation 1railsconf
 
Integrating Flex And Rails With Ruby Amf
Integrating Flex And Rails With Ruby AmfIntegrating Flex And Rails With Ruby Amf
Integrating Flex And Rails With Ruby Amfrailsconf
 
TDD (Test Driven Design)
TDD (Test Driven Design)TDD (Test Driven Design)
TDD (Test Driven Design)
nedirtv
 
Below And Beneath Tdd Test Last Development And Other Real World Test Patter...
Below And Beneath Tdd  Test Last Development And Other Real World Test Patter...Below And Beneath Tdd  Test Last Development And Other Real World Test Patter...
Below And Beneath Tdd Test Last Development And Other Real World Test Patter...railsconf
 
Don T Mock Yourself Out Presentation
Don T Mock Yourself Out PresentationDon T Mock Yourself Out Presentation
Don T Mock Yourself Out Presentationrailsconf
 

Viewers also liked (8)

Rails 3 And The Real Secret To High Productivity Presentation
Rails 3 And The Real Secret To High Productivity PresentationRails 3 And The Real Secret To High Productivity Presentation
Rails 3 And The Real Secret To High Productivity Presentation
 
Quality Code With Cucumber Presentation
Quality Code With Cucumber PresentationQuality Code With Cucumber Presentation
Quality Code With Cucumber Presentation
 
J Ruby On Rails Presentation
J Ruby On Rails PresentationJ Ruby On Rails Presentation
J Ruby On Rails Presentation
 
Rails Is From Mars Ruby Is From Venus Presentation 1
Rails Is From Mars  Ruby Is From Venus Presentation 1Rails Is From Mars  Ruby Is From Venus Presentation 1
Rails Is From Mars Ruby Is From Venus Presentation 1
 
Integrating Flex And Rails With Ruby Amf
Integrating Flex And Rails With Ruby AmfIntegrating Flex And Rails With Ruby Amf
Integrating Flex And Rails With Ruby Amf
 
TDD (Test Driven Design)
TDD (Test Driven Design)TDD (Test Driven Design)
TDD (Test Driven Design)
 
Below And Beneath Tdd Test Last Development And Other Real World Test Patter...
Below And Beneath Tdd  Test Last Development And Other Real World Test Patter...Below And Beneath Tdd  Test Last Development And Other Real World Test Patter...
Below And Beneath Tdd Test Last Development And Other Real World Test Patter...
 
Don T Mock Yourself Out Presentation
Don T Mock Yourself Out PresentationDon T Mock Yourself Out Presentation
Don T Mock Yourself Out Presentation
 

Similar to Building A Mini Google High Performance Computing In Ruby

Building A Mini Google High Performance Computing In Ruby Presentation 1
Building A Mini Google  High Performance Computing In Ruby Presentation 1Building A Mini Google  High Performance Computing In Ruby Presentation 1
Building A Mini Google High Performance Computing In Ruby Presentation 1elliando dias
 
It's Mechanize for it. Ruby as a Finder.
It's Mechanize for it. Ruby as a Finder.It's Mechanize for it. Ruby as a Finder.
It's Mechanize for it. Ruby as a Finder.Tomohiro Nishimura
 
Monitoring web application behaviour with cucumber-nagios
Monitoring web application behaviour with cucumber-nagiosMonitoring web application behaviour with cucumber-nagios
Monitoring web application behaviour with cucumber-nagios
Lindsay Holmwood
 
Happy Coding with Ruby on Rails
Happy Coding with Ruby on RailsHappy Coding with Ruby on Rails
Happy Coding with Ruby on Rails
Ochirkhuyag Lkhagva
 
Web application intro
Web application introWeb application intro
Web application intro
Tobias Pfeiffer
 
Google G Data Reading And Writing Data On The Web
Google G Data Reading And Writing Data On The WebGoogle G Data Reading And Writing Data On The Web
Google G Data Reading And Writing Data On The WebQConLondon2008
 
Google G Data Reading And Writing Data On The Web 1
Google G Data Reading And Writing Data On The Web 1Google G Data Reading And Writing Data On The Web 1
Google G Data Reading And Writing Data On The Web 1QConLondon2008
 
SearchMonkey
SearchMonkeySearchMonkey
SearchMonkey
Paul Tarjan
 
Intro To Django
Intro To DjangoIntro To Django
Intro To Django
Udi Bauman
 
Beautiful Java EE - PrettyFaces
Beautiful Java EE - PrettyFacesBeautiful Java EE - PrettyFaces
Beautiful Java EE - PrettyFaces
Lincoln III
 
Rails + Webpack
Rails + WebpackRails + Webpack
Rails + Webpack
Khor SoonHin
 
Web application intro + a bit of ruby (revised)
Web application intro + a bit of ruby (revised)Web application intro + a bit of ruby (revised)
Web application intro + a bit of ruby (revised)
Tobias Pfeiffer
 
Exploiting the newer perl to improve your plugins
Exploiting the newer perl to improve your pluginsExploiting the newer perl to improve your plugins
Exploiting the newer perl to improve your pluginsMarian Marinov
 
Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010
Abhishek Mishra
 
Rapid RIA development with Netzke
Rapid RIA development with NetzkeRapid RIA development with Netzke
Rapid RIA development with Netzkenetzke
 
When To Use Ruby On Rails
When To Use Ruby On RailsWhen To Use Ruby On Rails
When To Use Ruby On Rails
dosire
 
Cucumber
CucumberCucumber
Cucumber
Bachue Zhou
 
Rails bestpractices.com
Rails bestpractices.comRails bestpractices.com
Rails bestpractices.com
Richard Huang
 
Lightweight Webservices with Sinatra and RestClient
Lightweight Webservices with Sinatra and RestClientLightweight Webservices with Sinatra and RestClient
Lightweight Webservices with Sinatra and RestClient
Adam Wiggins
 
More Secrets of JavaScript Libraries
More Secrets of JavaScript LibrariesMore Secrets of JavaScript Libraries
More Secrets of JavaScript Libraries
jeresig
 

Similar to Building A Mini Google High Performance Computing In Ruby (20)

Building A Mini Google High Performance Computing In Ruby Presentation 1
Building A Mini Google  High Performance Computing In Ruby Presentation 1Building A Mini Google  High Performance Computing In Ruby Presentation 1
Building A Mini Google High Performance Computing In Ruby Presentation 1
 
It's Mechanize for it. Ruby as a Finder.
It's Mechanize for it. Ruby as a Finder.It's Mechanize for it. Ruby as a Finder.
It's Mechanize for it. Ruby as a Finder.
 
Monitoring web application behaviour with cucumber-nagios
Monitoring web application behaviour with cucumber-nagiosMonitoring web application behaviour with cucumber-nagios
Monitoring web application behaviour with cucumber-nagios
 
Happy Coding with Ruby on Rails
Happy Coding with Ruby on RailsHappy Coding with Ruby on Rails
Happy Coding with Ruby on Rails
 
Web application intro
Web application introWeb application intro
Web application intro
 
Google G Data Reading And Writing Data On The Web
Google G Data Reading And Writing Data On The WebGoogle G Data Reading And Writing Data On The Web
Google G Data Reading And Writing Data On The Web
 
Google G Data Reading And Writing Data On The Web 1
Google G Data Reading And Writing Data On The Web 1Google G Data Reading And Writing Data On The Web 1
Google G Data Reading And Writing Data On The Web 1
 
SearchMonkey
SearchMonkeySearchMonkey
SearchMonkey
 
Intro To Django
Intro To DjangoIntro To Django
Intro To Django
 
Beautiful Java EE - PrettyFaces
Beautiful Java EE - PrettyFacesBeautiful Java EE - PrettyFaces
Beautiful Java EE - PrettyFaces
 
Rails + Webpack
Rails + WebpackRails + Webpack
Rails + Webpack
 
Web application intro + a bit of ruby (revised)
Web application intro + a bit of ruby (revised)Web application intro + a bit of ruby (revised)
Web application intro + a bit of ruby (revised)
 
Exploiting the newer perl to improve your plugins
Exploiting the newer perl to improve your pluginsExploiting the newer perl to improve your plugins
Exploiting the newer perl to improve your plugins
 
Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010
 
Rapid RIA development with Netzke
Rapid RIA development with NetzkeRapid RIA development with Netzke
Rapid RIA development with Netzke
 
When To Use Ruby On Rails
When To Use Ruby On RailsWhen To Use Ruby On Rails
When To Use Ruby On Rails
 
Cucumber
CucumberCucumber
Cucumber
 
Rails bestpractices.com
Rails bestpractices.comRails bestpractices.com
Rails bestpractices.com
 
Lightweight Webservices with Sinatra and RestClient
Lightweight Webservices with Sinatra and RestClientLightweight Webservices with Sinatra and RestClient
Lightweight Webservices with Sinatra and RestClient
 
More Secrets of JavaScript Libraries
More Secrets of JavaScript LibrariesMore Secrets of JavaScript Libraries
More Secrets of JavaScript Libraries
 

Recently uploaded

Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 

Recently uploaded (20)

Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 

Building A Mini Google High Performance Computing In Ruby

  • 1. Building Mini-Google in Ruby Ilya Grigorik @igrigorik Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 2. postrank.com/topic/ruby The slides… Twitter My blog Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 3. Ruby + Math PageRank Optimization Examples Indexing Misc Fun Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 4. PageRank PageRank + Ruby Tools + Examples Indexing Optimization Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 5. Consume with care… everything that follows is based on released / public domain info Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 6. Search-engine graveyard Google did pretty well… Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 7. Query: Ruby Results 1. Crawl 2. Index 3. Rank Search pipeline 50,000-foot view Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 8. Query: Ruby Results 1. Crawl 2. Index 3. Rank Bah Interesting Fun Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 9. CPU Speed 333Mhz RAM 32-64MB Index 27,000,000 documents Index refresh once a month~ish PageRank computation several days Laptop CPU 2.1Ghz VM RAM 1GB 1-Million page web ~10 minutes circa 1997-1998 Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 10. Creating & Maintaining an Inverted Index DIY and the gotchas within Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 11. require 'set' { quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>, pages = { quot;aquot;=>#<Set: {quot;3quot;}>, quot;1quot; => quot;it is what it isquot;, quot;bananaquot;=>#<Set: {quot;3quot;}>, quot;2quot; => quot;what is itquot;, quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>, quot;3quot; => quot;it is a bananaquot; quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>} } } index = {} pages.each do |page, content| content.split(/s/).each do |word| if index[word] index[word] << page else index[word] = Set.new(page) end end end Building an Inverted Index Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 12. require 'set' { quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>, pages = { quot;aquot;=>#<Set: {quot;3quot;}>, quot;1quot; => quot;it is what it isquot;, quot;bananaquot;=>#<Set: {quot;3quot;}>, quot;2quot; => quot;what is itquot;, quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>, quot;3quot; => quot;it is a bananaquot; quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>} } } index = {} pages.each do |page, content| content.split(/s/).each do |word| if index[word] index[word] << page else index[word] = Set.new(page) end end end Building an Inverted Index Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 13. require 'set' { quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>, pages = { quot;aquot;=>#<Set: {quot;3quot;}>, quot;1quot; => quot;it is what it isquot;, quot;bananaquot;=>#<Set: {quot;3quot;}>, quot;2quot; => quot;what is itquot;, quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>, quot;3quot; => quot;it is a bananaquot; quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>} } } index = {} pages.each do |page, content| Word => [Document] content.split(/s/).each do |word| if index[word] index[word] << page else index[word] = Set.new(page) end end end Building an Inverted Index Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 14. # query: quot;what is bananaquot; p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;] # > #<Set: {}> # query: quot;a bananaquot; p index[quot;aquot;] & index[quot;bananaquot;] # > #<Set: {quot;3quot;}> 1 3 2 # query: quot;what isquot; p index[quot;whatquot;] & index[quot;isquot;] # > #<Set: {quot;1quot;, quot;2quot;}> { quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>, quot;aquot;=>#<Set: {quot;3quot;}>, quot;bananaquot;=>#<Set: {quot;3quot;}>, quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>, quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>} Querying the index } Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 15. # query: quot;what is bananaquot; p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;] # > #<Set: {}> # query: quot;a bananaquot; p index[quot;aquot;] & index[quot;bananaquot;] # > #<Set: {quot;3quot;}> 1 3 2 # query: quot;what isquot; p index[quot;whatquot;] & index[quot;isquot;] # > #<Set: {quot;1quot;, quot;2quot;}> { quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>, quot;aquot;=>#<Set: {quot;3quot;}>, quot;bananaquot;=>#<Set: {quot;3quot;}>, quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>, quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>} Querying the index } Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 16. # query: quot;what is bananaquot; p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;] # > #<Set: {}> # query: quot;a bananaquot; p index[quot;aquot;] & index[quot;bananaquot;] # > #<Set: {quot;3quot;}> 1 3 2 # query: quot;what isquot; p index[quot;whatquot;] & index[quot;isquot;] # > #<Set: {quot;1quot;, quot;2quot;}> { quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>, quot;aquot;=>#<Set: {quot;3quot;}>, quot;bananaquot;=>#<Set: {quot;3quot;}>, quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>, quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>} Querying the index } Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 17. # query: quot;what is bananaquot; p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;] # > #<Set: {}> # query: quot;a bananaquot; p index[quot;aquot;] & index[quot;bananaquot;] # > #<Set: {quot;3quot;}> What order? # query: quot;what isquot; p index[quot;whatquot;] & index[quot;isquot;] [1, 2] or [2,1] # > #<Set: {quot;1quot;, quot;2quot;}> { quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>, quot;aquot;=>#<Set: {quot;3quot;}>, quot;bananaquot;=>#<Set: {quot;3quot;}>, quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>, quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>} Querying the index } Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 18. require 'set' pages = { quot;1quot; => quot;it is what it isquot;, quot;2quot; => quot;what is itquot;, quot;3quot; => quot;it is a bananaquot; } PDF, HTML, RSS? index = {} Lowercase / Upcase? pages.each do |page, content| Compact Index? Hmmm? content.split(/s/).each do |word| Stop words? if index[word] Persistence? index[word] << page else index[word] = Set.new(page) end end end Building an Inverted Index Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 19. Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 20. Ferret is a high-performance, full-featured text search engine library written for Ruby Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 21. require 'ferret' include Ferret index = Index::Index.new() index << {:title => quot;1quot;, :content => quot;it is what it isquot;} index << {:title => quot;2quot;, :content => quot;what is itquot;} index << {:title => quot;3quot;, :content => quot;it is a bananaquot;} index.search_each('content:quot;bananaquot;') do |id, score| puts quot;Score: #{score}, #{index[id][:title]} quot; end > Score: 1.0, 3 Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 22. require 'ferret' include Ferret index = Index::Index.new() index << {:title => quot;1quot;, :content => quot;it is what it isquot;} index << {:title => quot;2quot;, :content => quot;what is itquot;} index << {:title => quot;3quot;, :content => quot;it is a bananaquot;} index.search_each('content:quot;bananaquot;') do |id, score| puts quot;Score: #{score}, #{index[id][:title]} quot; end > Score: 1.0, 3 Hmmm? Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 23. class Ferret::Analysis::Analyzer class Ferret::Search::BooleanQuery class Ferret::Analysis::AsciiLetterAnalyzer class Ferret::Search::ConstantScoreQuery class Ferret::Analysis::AsciiLetterTokenizer class Ferret::Search::Explanation class Ferret::Analysis::AsciiLowerCaseFilter class Ferret::Search::Filter class Ferret::Analysis::AsciiStandardAnalyzer class Ferret::Search::FilteredQuery class Ferret::Analysis::AsciiStandardTokenizer class Ferret::Search::FuzzyQuery class Ferret::Analysis::AsciiWhiteSpaceAnalyzer class Ferret::Search::Hit class Ferret::Analysis::AsciiWhiteSpaceTokenizer class Ferret::Search::MatchAllQuery class Ferret::Analysis::HyphenFilter class Ferret::Search::MultiSearcher class Ferret::Analysis::LetterAnalyzer class Ferret::Search::MultiTermQuery class Ferret::Analysis::LetterTokenizer class Ferret::Search::PhraseQuery class Ferret::Analysis::LowerCaseFilter class Ferret::Search::PrefixQuery class Ferret::Analysis::MappingFilter class Ferret::Search::Query class Ferret::Analysis::PerFieldAnalyzer class Ferret::Search::QueryFilter class Ferret::Analysis::RegExpAnalyzer class Ferret::Search::RangeFilter class Ferret::Analysis::RegExpTokenizer class Ferret::Search::RangeQuery class Ferret::Analysis::StandardAnalyzer class Ferret::Search::Searcher class Ferret::Analysis::StandardTokenizer class Ferret::Search::Sort class Ferret::Analysis::StemFilter class Ferret::Search::SortField class Ferret::Analysis::StopFilter class Ferret::Search::TermQuery class Ferret::Analysis::Token class Ferret::Search::TopDocs class Ferret::Analysis::TokenStream class Ferret::Search::TypedRangeFilter class Ferret::Analysis::WhiteSpaceAnalyzer class Ferret::Search::TypedRangeQuery class Ferret::Search::WildcardQuery class Ferret::Analysis::WhiteSpaceTokenizer Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 24. ferret.davebalmain.com/trac Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 25. Ranking Results 0-60 with PageRank… Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 26. index.search_each('content:quot;the brown cowquot;') do |id, score| puts quot;Score: #{score}, #{index[id][:title]} quot; end > Score: 0.827, 3 > Score: 0.523, 5 Relevance? > Score: 0.125, 4 3 5 4 the 4 3 5 brown 1 3 1 cow 1 4 1 Score 6 10 7 Naïve: Term Frequency Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 27. index.search_each('content:quot;the brown cowquot;') do |id, score| puts quot;Score: #{score}, #{index[id][:title]} quot; end > Score: 0.827, 3 > Score: 0.523, 5 > Score: 0.125, 4 3 5 4 the 4 3 5 Skew brown 1 3 1 cow 1 4 1 Score 6 10 7 Naïve: Term Frequency Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 28. 3 5 4 the 4 3 5 Skew brown 1 3 1 cow 1 4 1 # of docs Score = TF * IDF the 6 TF = # occurrences / # words brown 3 IDF = # docs / # docs with W cow 4 Total # of documents: 10 TF-IDF Term Frequency * Inverse Document Frequency Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 29. 3 5 4 the 4 3 5 brown 1 3 1 cow 1 4 1 Doc # 3 score for ‘the’: # of docs 4/10 * ln(10/6) = 0.204 the 6 Doc # 3 score for ‘brown’: brown 3 1/10 * ln(10/3) = 0.120 cow 4 Doc # 3 score for ‘cow’: 1/10 * ln(10/4) = 0.092 Total # of documents: 10 # words in document: 10 TF-IDF Score = 0.204 + 0.120 + 0.092 = 0.416 Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 30. W1 W2 … … … … … … WN Doc 1 15 23 … Doc 2 24 12 … … … … … … Doc K Size = N * K * size of Ruby object Ouch. Pages = N = 10,000 Words = K = 2,000 Ruby Object = 20+ bytes Frequency Matrix Footprint = 384 MB Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 31. NArray is an Numerical N-dimensional Array class (implemented in C) # create new NArray. initialize with 0. NArray.new(typecode, size, ...) # 1 byte unsigned integer NArray.byte(size,...) # 2 byte signed integer NArray.sint(size,...) # 4 byte signed integer NArray.int(size,...) # single precision float NArray.sfloat(size,...) # double precision float NArray.float(size,...) # single precision complex NArray.scomplex(size,...) # double precision complex NArray.complex(size,...) # Ruby object NArray.object(size,...) NArray http://narray.rubyforge.org/ Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 32. NArray is an Numerical N-dimensional Array class (implemented in C) NArray http://narray.rubyforge.org/ Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 33. Links as votes PageRank the google juice Problem: link gaming Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 34. P = 0.85 Follow link from page he/she is currently on. Teleport to a random location on the web. P = 0.15 Random Surfer powerful abstraction Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 35. Follow link from page he/she is currently on. Page K Teleport to a random location on the web. Page N Page M Surfin’ rinse & repeat, ad naseum Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 36. On Page P, clicks on link to K P = 0.85 On Page K clicks on link to M P = 0.85 On Page M teleports to X P = 0.15 Surfin’ … rinse & repeat, ad naseum Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 37. P = 0.05 P = 0.20 X N P = 0.15 M K P = 0.6 Analyzing the Web Graph extracting PageRank Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 38. What is PageRank? It’s a scalar! Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 39. P = 0.05 P = 0.20 X N P = 0.15 M K P = 0.6 What is PageRank? it’s a probability! Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 40. P = 0.05 P = 0.20 X N P = 0.15 M K P = 0.6 What is PageRank? Higher Pr, Higher Importance? it’s a probability! Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 41. Teleportation? sci-fi fans, … ? Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 42. 1. No in-links! 3. Isolated Web X N K 2. No out-links! M M Reasons for teleportation enumerating edge cases Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 43. •Breadth First Search •Depth First Search •A* Search •Lexicographic Search •Dijkstra’s Algorithm •Floyd-Warshall •Triangulation and Comparability detection require 'gratr/import' dg = Digraph[1,2, 2,3, 2,4, 4,5, 6,4, 1,6] dg.directed? # true dg.vertex?(4) # true dg.edge?(2,4) # true dg.vertices # [5, 6, 1, 2, 3, 4] Exploring Graphs Graph[1,2,1,3,1,4,2,5].bfs # [1, 2, 3, 4, 5] gratr.rubyforge.com Graph[1,2,1,3,1,4,2,5].dfs # [1, 2, 5, 3, 4] Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 44. P(T) = 0.03 P(T) = 0.15 / # of pages P(T) = 0.03 P(T) = 0.03 X N K P(T) = 0.03 M P(T) = 0.03 M P(T) = 0.03 Teleportation probabilities Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 45. Assume the web is N pages big Assume that probability of teleportation (t) is 0.15, and following link (s) is 0.85 Assume that teleportation probability (E) is uniform Assume that you start on any random page (uniform distribution L), then 0.15 ������ ������ = ������ = ⋮ 0.15 ������ Then after one step, the probability your on page X is: ������ ∗ ������������ + ������������ ������ ∗ (0.85 ∗ ������ + 0.15 ∗ ������) PageRank: Simplified Mathematical Def’n cause that’s how we roll Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 46. Link Graph No link from 1 to N 1 2 … … N 1 1 0 … … 0 2 0 1 … … 1 … … … … … … … … … … … … N 0 1 … … 1 G = The Link Graph Huge! ginormous and sparse Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 47. Links to… { quot;1quot; => [25, 26], quot;2quot; => [1], Page quot;5quot; => [123,2], quot;6quot; => [67, 1] } G as a dictionary more compact… Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 48. Follow link from page he/she is currently on. Page K Teleport to a random location on the web. Computing PageRank the tedious way Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 49. Don’t trust me! Verify it yourself! ������1 −1 ⋮ ������ = ������ ������ − ������������ ������ = ������������ Identity matrix Computing PageRank in one swoop Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 50. Enough hand-waving, dammit! show me the code Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 51. Hot, Fast, Awesome Birth of EM-Proxy flash of the obvious Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 52. http://rb-gsl.rubyforge.org/ Hot, Fast, Awesome Click there! … Give yourself a weekend. Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 53. http://ruby-gsl.sourceforge.net/ Click there! … Give yourself a weekend. Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 54. require quot;gslquot; include GSL # INPUT: link structure matrix (NxN) # OUTPUT: pagerank scores def pagerank(g) Verify NxN raise if g.size1 != g.size2 i = Matrix.I(g.size1) # identity matrix p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector s = 0.85 # probability of following a link t = 1-s # probability of teleportation t*((i-s*g).invert)*p end PageRank in Ruby 6 lines, or less Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 55. require quot;gslquot; include GSL # INPUT: link structure matrix (NxN) # OUTPUT: pagerank scores def pagerank(g) Constants… raise if g.size1 != g.size2 i = Matrix.I(g.size1) # identity matrix p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector s = 0.85 # probability of following a link t = 1-s # probability of teleportation t*((i-s*g).invert)*p end PageRank in Ruby 6 lines, or less Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 56. require quot;gslquot; include GSL # INPUT: link structure matrix (NxN) # OUTPUT: pagerank scores def pagerank(g) raise if g.size1 != g.size2 i = Matrix.I(g.size1) # identity matrix p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector s = 0.85 # probability of following a link t = 1-s # probability of teleportation t*((i-s*g).invert)*p end PageRank in Ruby PageRank! 6 lines, or less Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 57. X P = 0.33 P = 0.33 N P = 0.33 K pagerank(Matrix[[0,0,1], [0,0,1], [1,0,0]]) > [0.33, 0.33, 0.33] Ex: Circular Web testing intuition… Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 58. X P = 0.05 P = 0.07 N P = 0.87 K pagerank(Matrix[[0,0,0], [0.5,0,0], [0.5,1,1]]) > [0.05, 0.07, 0.87] Ex: All roads lead to K testing intuition… Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 59. PageRank + Ferret awesome search, ftw! Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 60. 2 P = 0.05 P = 0.07 1 require 'ferret' P = 0.87 3 include Ferret index = Index::Index.new() index << {:title => quot;1quot;, :content => quot;it is what it isquot;, :pr => 0.05 } index << {:title => quot;2quot;, :content => quot;what is itquot;, :pr => 0.07 } index << {:title => quot;3quot;, :content => quot;it is a bananaquot;, :pr => 0.87 } Store PageRank Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 61. index.search_each('content:quot;worldquot;') do |id, score| puts quot;Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})quot; end TF-IDF Search puts quot;*quot; * 50 sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true) index.search_each('content:quot;worldquot;', :sort => sf_pr) do |id, score| puts quot;Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})quot; end # Score: 0.267119228839874, 3 (PR: 0.87) # Score: 0.17807948589325, 1 (PR: 0.05) # Score: 0.17807948589325, 2 (PR: 0.07) # *********************************** # Score: 0.267119228839874, 3, (PR: 0.87) # Score: 0.17807948589325, 2, (PR: 0.07) # Score: 0.17807948589325, 1, (PR: 0.05) Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 62. index.search_each('content:quot;worldquot;') do |id, score| puts quot;Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})quot; end PageRank FTW! puts quot;*quot; * 50 sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true) index.search_each('content:quot;worldquot;', :sort => sf_pr) do |id, score| puts quot;Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})quot; end # Score: 0.267119228839874, 3 (PR: 0.87) # Score: 0.17807948589325, 1 (PR: 0.05) # Score: 0.17807948589325, 2 (PR: 0.07) # *********************************** # Score: 0.267119228839874, 3, (PR: 0.87) # Score: 0.17807948589325, 2, (PR: 0.07) # Score: 0.17807948589325, 1, (PR: 0.05) Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 63. index.search_each('content:quot;worldquot;') do |id, score| puts quot;Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})quot; end puts quot;*quot; * 50 sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true) index.search_each('content:quot;worldquot;', :sort => sf_pr) do |id, score| puts quot;Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})quot; end # Score: 0.267119228839874, 3 (PR: 0.87) # Score: 0.17807948589325, 1 (PR: 0.05) Others # Score: 0.17807948589325, 2 (PR: 0.07) # *********************************** # Score: 0.267119228839874, 3, (PR: 0.87) # Score: 0.17807948589325, 2, (PR: 0.07) Google # Score: 0.17807948589325, 1, (PR: 0.05) Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 64. Search*: Graphs are ubiquitous! PageRank is a general purpose hammer Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 65. Username GitCred ============================== 37signals 10.00 imbriaco 9.76 why 8.74 rails 8.56 defunkt 8.17 technoweenie 7.83 jeresig 7.60 mojombo 7.51 yui 7.34 drnic 7.34 pjhyett 6.91 wycats 6.85 dhh 6.84 http://bit.ly/3YQPU PageRank + Social Graph GitHub Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 66. Hmm… Analyze the social graph: - Filter messages by ‘TwitterRank’ - Suggest users by ‘TwitterRank’ -… PageRank + Social Graph Twitter Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 67. PageRank + Product Graph E-commerce Link items purchased in same cart… Run PR on it. Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 68. PageRank = Powerful Hammer use it! Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 69. Personalization how would you do it? Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 70. 0.15 ������ Teleportation distribution doesn’t ������ = ⋮ have to be uniform! 0.15 ������ yahoo.com is my homepage! PageRank + Personalization customize the teleportation vector Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 71. Make pages with links! Gaming PageRank http://bit.ly/pagerank-spam for fun and profit (I don’t endorse it) Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 72. Slides: http://bit.ly/railsconf-pagerank Ferret: http://bit.ly/ferret RB-GSL: http://bit.ly/rb-gsl PageRank on Wikipedia: http://bit.ly/wp-pagerank Gaming PageRank: http://bit.ly/pagerank-spam Michael Nielsen’s lectures on PageRank: http://michaelnielsen.org/blog Questions? The slides… Twitter My blog Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf