SlideShare a Scribd company logo
1 of 41
Inside Search Engine
   A case study
        OF
Outline
• WWW   of SE(SearchEngine)

•A brief overview of SE history

•Getting Started!

•Basic Architecture of SE

•Inside PageRank

•Related work & Future
                                  2
Motivation
Unedited – anyone can enter content
    – Quality issues; Spam

Varied information types
    – Catalogs, dissertations, news reports, weather,
     pictures,videos…

Different kinds of users
• Lexis-Nexis: Paying, professional searchers
• Online catalogs: Scholars searching scholarly literature
• Web: Every type of person with every type of goal

Scale
• Hundreds of millions of searches/day; billions of docs
                                                             3
Motivation


      What’s the situation
         without SE?




                             4
Motivation

   “Necessity Is The mother of Invention”
                                            famous saying




    So,it’s a KDD(Knowledge Discovery
             from Data) process!

                                                            5
Motivation

             Search Engine
                  Saves
                  Today!



             A Search Engine helps
             you find things on the
               Internet. Any time
                anyone looks up
                anything on the
                    Internet!
                                 6
A brief History




                  7
A brief History




                  8
A brief History

 •Three major categories of SE
   – Full-text Search Engine
   – Dictonary Search Engine (generally speaking)
   – Meta Search Engine

 •Major Issues of SE
    - Understanding Search Queries
    - Understanding Website & Hyperlinks
    - Accuracy & Relevance
    - Honesty & Anti-Spam !




                                                    9
A brief History




                  10
Getting Started!

 •Importance of Links
   – Internal links (links within your site)
   – Outbound links (sites you link to)
   – Inbound links (sites linking to you)

 •Good Websites
   - Key pages with only a few click
   - User Navigation
    - Links easy for Robot
   - Anchor text




                                               11
Getting Started!

 •Anchor text (descriptive)
 •Crawler (spider)
    - Main difficulities
     - Graph Theory
     - A simple process




                              12
Getting Started!


 •Inverted Indexes (The IR Way )
 •How I.I are created?
 •A Detailed Example of two Docs
 •I.I for Web Search Engines




                                   13
Getting Started!

How I.I files are created?
                                               Term       Doc #
                                               now                   1
                                               is                    1
                                               the                   1
                                               time                  1
- Periodically rebuilt, static otherwise.      for
                                               all
                                                                     1
                                                                     1
                                               good                  1
                                               men                   1
- Docs are parsed to extract tokens. These     to                    1

   are saved with Doc ID                       come
                                               to
                                                                     1
                                                                     1
                                               the                   1
                                               aid                   1
                                               of                    1
         Doc 1                  Doc 2          their
                                               country
                                                                     1
                                                                     1
                                               it                    2

                           It was a dark and   was
                                               a
                                                                     2
                                                                     2
 Now is the time                               dark                  2

                            stormy night in    and
                                               stormy
                                                                     2
                                                                     2
for all good men                               night
                                               in
                                                                     2
                                                                     2
                              the country      the                   2
to come to the aid                             country
                                               manor
                                                                     2
                                                                     2
                           manor. The time     the                   2

 of their country                              time
                                               was
                                                                     2
                                                                     2
                          was past midnight    past
                                               midnight
                                                                     2
                                                                     2
                                                                  14
Getting Started!
                             Term       Doc #       Term       Doc #

How I.I files                now
                             is
                                                1
                                                1
                                                    a
                                                    aid
                                                                          2
                                                                          1


are created?
                             the                1   all                   1
                             time               1   and                   2
                             for                1   come                  1
                             all                1   country               1
                             good               1   country               2
                             men                1   dark                  2
                             to                 1   for                   1
                             come               1   good                  1
- After all documents have   to                 1   in                    2
                             the                1   is                    1
  been parsed the inverted   aid                1   it                    2

  file is sorted             of
                             their
                                                1
                                                1
                                                    manor
                                                    men
                                                                          2
                                                                          1

  alphabetically.            country
                             it
                                                1
                                                2
                                                    midnight
                                                    night
                                                                          2
                                                                          2
                             was                2   now                   1
                             a                  2   of                    1
                             dark               2   past                  2
                             and                2   stormy                2
                             stormy             2   the                   1
                             night              2   the                   1
                             in                 2   the                   2
                             the                2   the                   2
                             country            2   their                 1
                             manor              2   time                  1
                             the                2   time                  2
                             time               2   to                    1
                             was                2   to                    1
                             past               2   was                   2
                             midnight           2   was                15 2
Getting Started!
                            Term       Doc #

How I.I files               a
                            aid
                                               2
                                               1
                                                   Term
                                                   a
                                                              Doc #
                                                                      2
                                                                          Freq
                                                                                 1


are created?
                            all                1   aid                1          1
                            and                2   all                1          1
                            come               1   and                2          1
                            country            1   come               1          1
                            country            2   country            1          1
                            dark               2
 - Multiple term entries    for                1
                                                   country
                                                   dark
                                                                      2
                                                                      2
                                                                                 1
                                                                                 1
                            good               1
    for a single document   in                 2   for                1          1
                                                   good               1          1
    are merged.             is
                            it
                                               1
                                               2   in                 2          1
                            manor              2   is                 1          1
                            men                1   it                 2          1
 - Within-document term     midnight
                            night
                                               2
                                               2
                                                   manor              2          1
                                                   men                1          1
    frequency               now
                            of
                                               1
                                               1
                                                   midnight           2          1

    information is          past               2
                                                   night
                                                   now
                                                                      2
                                                                      1
                                                                                 1
                                                                                 1
                            stormy             2
    compiled.               the                1   of                 1          1
                            the                1   past               2          1
                            the                2   stormy             2          1
                            the                2   the                1          2
                            their              1   the                2          2
                            time               1   their              1          1
                            time               2   time               1          1
                            to                 1   time               2          1
                            to                 1
                                                   to                 1          2
                            was                2
                                                   was                2          216
                            was                2
Getting Started!

How I.I files are created?
  - Finally, the file can be split into

  • A Dictionary or Lexicon file
  and
  • A Postings file




                                          17
Getting Started!

How I.I files are created?
Term       Doc #       Freq
a                  2          1   Dictionary/Lexicon                     Postings
aid                1          1
all                1          1
and                2          1   Term       N docs       Tot Freq       Doc #       Freq
                                  a                   1              1           2             1
come               1          1
                                  aid                 1              1           1             1
country            1          1   all                 1              1           1             1
country            2          1   and                 1              1           2             1
dark               2          1   come                1              1           1             1
for                1          1   country             2              2           1             1
good               1          1   dark                1              1           2             1
in                 2          1   for                 1              1           2             1
                                  good                1              1           1             1
is                 1          1
                                  in                  1              1           1             1
it                 2          1
                                  is                  1              1           2             1
manor              2          1   it                  1              1           1             1
men                1          1   manor               1              1           2             1
midnight           2          1   men                 1              1           2             1
night              2          1   midnight            1              1           1             1
now                1          1   night               1              1           2             1
of                 1          1   now                 1              1           2             1
                                  of                  1              1           1             1
past               2          1
                                  past                1              1           1             1
stormy             2          1   stormy              1              1           2             1
the                1          2   the                 2              4           2             1
the                2          2   their               1              1           1             2
their              1          1   time                2              2           2             2
time               1          1   to                  1              2           1             1
time               2          1   was                 1              2           1             1
to                 1          2                                                  2             1
was                2          2                                                  1          18 2
                                                                                 2             2
Getting Started!

Inverted indexes
  - Permit fast search for individual terms

  - For each term, you get a list consisting of:
  • document ID
  • frequency of term in doc (optional)
  • position of term in doc   (optional)

  - These lists can be used to solve Boolean queries:
      – country -> d1, d2
      – manor -> d2
      – country AND manor -> d2

                                                        19
Getting Started!

Inverted Indexes for Web SE
  - Inverted indexes are still used, even though the web is so
      huge.

  - Some systems partition the indexes across different
     machines. Each machine handles different parts of the
     data.

  - Other systems duplicate the data across many machines;
     queries are distributed among the machines.

  - Most do a combination of these.



                                                                 20
Basic Web SE Architecture

        crawl the      Check for duplicates,
         crawl the
           web              store the
           web
                            documents
                                         DocIds

 user                                             create an
                                                   create an

query                                              inverted
                                                    inverted
                                                    index
                                                     index


        Show results
        Show results           Search             Inverted
          To user
          To user
                               engine               index
                               servers
                                                             21
Google’s Architecture
 Sorted barrels =
  inverted index
    Pagerank
computed from link
    structure;
 combined with IR
       rank
 IR rank depends
on TF, type of “hit”,
 hit proximity, etc.
      Billion
    documents
 Hundred million
  queries a day         22
Inside PageRank

Motivation
 Web:  heterogeneous and unstructured
 Free of quality control on the web
 Commercial interest to manipulate
  ranking
 Building A Open Lab for Scientists




                                         23
Inside PageRank

Motivation
 Most algo. From IR (eg:vector space)
 Only get content,neglect graphical
  structure




                                         24
Inside PageRank

Related Work
 Assumption: If the pages pointing to this page
   are good, then this is also a good page.
    – References: Kleinberg 98, Page et al. 98


 Draws upon earlier research in sociology and
   bibliometrics.
 • Kleinberg’s model includes “authorities” (highly
   referenced pages) and “hubs” (pages containing
   good reference lists).
 • Google model is a version with no hubs, and is
   closely related to work on influence weights by
   Pinski-Narin (1976).


                                                      25
Inside PageRank

PR: Bringing Order to Web
  Basic IDEA
  • Introduce a notion of page authority,which is
    Indep. Of the page content
  • Only take into the topological structure of web
  • Intuition: A page has high rank if sum of the ranks
    of the backlinks is high
  • Similar idea can be found in scientific
   citation




                                                          26
Inside PageRank



   • Pages with lots of back links are important
   • Back links come from important pages convey
     more importance to a page




   • Problem : Rank Sink (Dangling pages)




                                                   27
Inside PageRank




    Problem: this loop will accumulate rank but
        never distribute any rank outside!


                                                  28
Inside PageRank




                  29
Inside PageRank




  •Where
  • W = {wi,j} : the transition matrix
  • wi,j = 1/hj if there is hyperlink from
   j to i and wi,j = 0 otherwise




                                             30
Inside PageRank




 Consider this
 simple case,
 what will the
 transition matrix
 look like?




                     31
Inside PageRank




                  32
Inside PageRank




  •The sys. Is stable and x(t) always
   converges to the stationary solution
  •D is a dangling factor and 0 < d < 1
  •Just Jacobi algorithm to sovle linear
   sys.

                                           33
Inside PageRank



  •PR corresponds to prob. Distribution
   of a random walk on the web graph




  •The Escape term can be
   personalized!
                                          34
Inside PageRank



   A stochastic process is any sequence of
   experiments for which the outcome at any
   stage depends on chance. A Markov process
   is a stochastic process with following
   properties:

  •Possibe outcomes or states is finite
  •Prob. Of next depends only on
   previous
  •Prob. Are constant over time


                                               35
Inside PageRank



Theory 1 :
         if λ = 1 is a dominant
eigenvalue of a stochastic matrix A.
the the Markov chain with transition
A will converge to a steady-state.
The Perron theorem can be used to
show that if the transition matrix A
of a Markov process is positive then
λ = 1 is a dominant evalue of A
                                       36
Inside PageRank



Theory 2 :
if A is a positive n*n matrix,then A has a
positive reak evalue R with following
properties:

1.R has a positive evalue X
2.If λ is any other evalue of A ,then
   | λ| < R


                                             37
Outside PageRank




                   38
Outside PageRank




                   39
Reference
 • PageRank: Bringing Order to Life
 • An Atonmy of Large-scale hypetextul web SE
 • Inside PageRank
 • Combating Web Spam with TrustRank
 • Does Authority Mean Quality
 • What can you do with a Web in your Pocket
 • Modern Information Retrieval (Book)
 • Data Mining : Concepts & Techs (book)




                                                40
41

More Related Content

Viewers also liked (6)

Short history of google
Short history of googleShort history of google
Short history of google
 
Detail History of web 1.0 to 3.0
Detail History of web 1.0 to 3.0Detail History of web 1.0 to 3.0
Detail History of web 1.0 to 3.0
 
Web 1.0, Web 2.0 & Web 3.0
Web 1.0, Web 2.0 & Web 3.0Web 1.0, Web 2.0 & Web 3.0
Web 1.0, Web 2.0 & Web 3.0
 
Web 1.0, 2.0 y 3.0
Web 1.0, 2.0 y 3.0Web 1.0, 2.0 y 3.0
Web 1.0, 2.0 y 3.0
 
History of Google ppt
History of Google pptHistory of Google ppt
History of Google ppt
 
Presentation on-google
Presentation on-googlePresentation on-google
Presentation on-google
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 

Inside Search Engine - A case study

  • 1. Inside Search Engine A case study OF
  • 2. Outline • WWW of SE(SearchEngine) •A brief overview of SE history •Getting Started! •Basic Architecture of SE •Inside PageRank •Related work & Future 2
  • 3. Motivation Unedited – anyone can enter content – Quality issues; Spam Varied information types – Catalogs, dissertations, news reports, weather, pictures,videos… Different kinds of users • Lexis-Nexis: Paying, professional searchers • Online catalogs: Scholars searching scholarly literature • Web: Every type of person with every type of goal Scale • Hundreds of millions of searches/day; billions of docs 3
  • 4. Motivation What’s the situation without SE? 4
  • 5. Motivation “Necessity Is The mother of Invention” famous saying So,it’s a KDD(Knowledge Discovery from Data) process! 5
  • 6. Motivation Search Engine Saves Today! A Search Engine helps you find things on the Internet. Any time anyone looks up anything on the Internet! 6
  • 9. A brief History •Three major categories of SE – Full-text Search Engine – Dictonary Search Engine (generally speaking) – Meta Search Engine •Major Issues of SE - Understanding Search Queries - Understanding Website & Hyperlinks - Accuracy & Relevance - Honesty & Anti-Spam ! 9
  • 11. Getting Started! •Importance of Links – Internal links (links within your site) – Outbound links (sites you link to) – Inbound links (sites linking to you) •Good Websites - Key pages with only a few click - User Navigation - Links easy for Robot - Anchor text 11
  • 12. Getting Started! •Anchor text (descriptive) •Crawler (spider) - Main difficulities - Graph Theory - A simple process 12
  • 13. Getting Started! •Inverted Indexes (The IR Way ) •How I.I are created? •A Detailed Example of two Docs •I.I for Web Search Engines 13
  • 14. Getting Started! How I.I files are created? Term Doc # now 1 is 1 the 1 time 1 - Periodically rebuilt, static otherwise. for all 1 1 good 1 men 1 - Docs are parsed to extract tokens. These to 1 are saved with Doc ID come to 1 1 the 1 aid 1 of 1 Doc 1 Doc 2 their country 1 1 it 2 It was a dark and was a 2 2 Now is the time dark 2 stormy night in and stormy 2 2 for all good men night in 2 2 the country the 2 to come to the aid country manor 2 2 manor. The time the 2 of their country time was 2 2 was past midnight past midnight 2 2 14
  • 15. Getting Started! Term Doc # Term Doc # How I.I files now is 1 1 a aid 2 1 are created? the 1 all 1 time 1 and 2 for 1 come 1 all 1 country 1 good 1 country 2 men 1 dark 2 to 1 for 1 come 1 good 1 - After all documents have to 1 in 2 the 1 is 1 been parsed the inverted aid 1 it 2 file is sorted of their 1 1 manor men 2 1 alphabetically. country it 1 2 midnight night 2 2 was 2 now 1 a 2 of 1 dark 2 past 2 and 2 stormy 2 stormy 2 the 1 night 2 the 1 in 2 the 2 the 2 the 2 country 2 their 1 manor 2 time 1 the 2 time 2 time 2 to 1 was 2 to 1 past 2 was 2 midnight 2 was 15 2
  • 16. Getting Started! Term Doc # How I.I files a aid 2 1 Term a Doc # 2 Freq 1 are created? all 1 aid 1 1 and 2 all 1 1 come 1 and 2 1 country 1 come 1 1 country 2 country 1 1 dark 2 - Multiple term entries for 1 country dark 2 2 1 1 good 1 for a single document in 2 for 1 1 good 1 1 are merged. is it 1 2 in 2 1 manor 2 is 1 1 men 1 it 2 1 - Within-document term midnight night 2 2 manor 2 1 men 1 1 frequency now of 1 1 midnight 2 1 information is past 2 night now 2 1 1 1 stormy 2 compiled. the 1 of 1 1 the 1 past 2 1 the 2 stormy 2 1 the 2 the 1 2 their 1 the 2 2 time 1 their 1 1 time 2 time 1 1 to 1 time 2 1 to 1 to 1 2 was 2 was 2 216 was 2
  • 17. Getting Started! How I.I files are created? - Finally, the file can be split into • A Dictionary or Lexicon file and • A Postings file 17
  • 18. Getting Started! How I.I files are created? Term Doc # Freq a 2 1 Dictionary/Lexicon Postings aid 1 1 all 1 1 and 2 1 Term N docs Tot Freq Doc # Freq a 1 1 2 1 come 1 1 aid 1 1 1 1 country 1 1 all 1 1 1 1 country 2 1 and 1 1 2 1 dark 2 1 come 1 1 1 1 for 1 1 country 2 2 1 1 good 1 1 dark 1 1 2 1 in 2 1 for 1 1 2 1 good 1 1 1 1 is 1 1 in 1 1 1 1 it 2 1 is 1 1 2 1 manor 2 1 it 1 1 1 1 men 1 1 manor 1 1 2 1 midnight 2 1 men 1 1 2 1 night 2 1 midnight 1 1 1 1 now 1 1 night 1 1 2 1 of 1 1 now 1 1 2 1 of 1 1 1 1 past 2 1 past 1 1 1 1 stormy 2 1 stormy 1 1 2 1 the 1 2 the 2 4 2 1 the 2 2 their 1 1 1 2 their 1 1 time 2 2 2 2 time 1 1 to 1 2 1 1 time 2 1 was 1 2 1 1 to 1 2 2 1 was 2 2 1 18 2 2 2
  • 19. Getting Started! Inverted indexes - Permit fast search for individual terms - For each term, you get a list consisting of: • document ID • frequency of term in doc (optional) • position of term in doc (optional) - These lists can be used to solve Boolean queries: – country -> d1, d2 – manor -> d2 – country AND manor -> d2 19
  • 20. Getting Started! Inverted Indexes for Web SE - Inverted indexes are still used, even though the web is so huge. - Some systems partition the indexes across different machines. Each machine handles different parts of the data. - Other systems duplicate the data across many machines; queries are distributed among the machines. - Most do a combination of these. 20
  • 21. Basic Web SE Architecture crawl the Check for duplicates, crawl the web store the web documents DocIds user create an create an query inverted inverted index index Show results Show results Search Inverted To user To user engine index servers 21
  • 22. Google’s Architecture  Sorted barrels = inverted index  Pagerank computed from link structure; combined with IR rank  IR rank depends on TF, type of “hit”, hit proximity, etc.  Billion documents  Hundred million queries a day 22
  • 23. Inside PageRank Motivation  Web: heterogeneous and unstructured  Free of quality control on the web  Commercial interest to manipulate ranking  Building A Open Lab for Scientists 23
  • 24. Inside PageRank Motivation  Most algo. From IR (eg:vector space)  Only get content,neglect graphical structure 24
  • 25. Inside PageRank Related Work Assumption: If the pages pointing to this page are good, then this is also a good page. – References: Kleinberg 98, Page et al. 98 Draws upon earlier research in sociology and bibliometrics. • Kleinberg’s model includes “authorities” (highly referenced pages) and “hubs” (pages containing good reference lists). • Google model is a version with no hubs, and is closely related to work on influence weights by Pinski-Narin (1976). 25
  • 26. Inside PageRank PR: Bringing Order to Web Basic IDEA • Introduce a notion of page authority,which is Indep. Of the page content • Only take into the topological structure of web • Intuition: A page has high rank if sum of the ranks of the backlinks is high • Similar idea can be found in scientific citation 26
  • 27. Inside PageRank • Pages with lots of back links are important • Back links come from important pages convey more importance to a page • Problem : Rank Sink (Dangling pages) 27
  • 28. Inside PageRank Problem: this loop will accumulate rank but never distribute any rank outside! 28
  • 30. Inside PageRank •Where • W = {wi,j} : the transition matrix • wi,j = 1/hj if there is hyperlink from j to i and wi,j = 0 otherwise 30
  • 31. Inside PageRank Consider this simple case, what will the transition matrix look like? 31
  • 33. Inside PageRank •The sys. Is stable and x(t) always converges to the stationary solution •D is a dangling factor and 0 < d < 1 •Just Jacobi algorithm to sovle linear sys. 33
  • 34. Inside PageRank •PR corresponds to prob. Distribution of a random walk on the web graph •The Escape term can be personalized! 34
  • 35. Inside PageRank A stochastic process is any sequence of experiments for which the outcome at any stage depends on chance. A Markov process is a stochastic process with following properties: •Possibe outcomes or states is finite •Prob. Of next depends only on previous •Prob. Are constant over time 35
  • 36. Inside PageRank Theory 1 : if λ = 1 is a dominant eigenvalue of a stochastic matrix A. the the Markov chain with transition A will converge to a steady-state. The Perron theorem can be used to show that if the transition matrix A of a Markov process is positive then λ = 1 is a dominant evalue of A 36
  • 37. Inside PageRank Theory 2 : if A is a positive n*n matrix,then A has a positive reak evalue R with following properties: 1.R has a positive evalue X 2.If λ is any other evalue of A ,then | λ| < R 37
  • 40. Reference • PageRank: Bringing Order to Life • An Atonmy of Large-scale hypetextul web SE • Inside PageRank • Combating Web Spam with TrustRank • Does Authority Mean Quality • What can you do with a Web in your Pocket • Modern Information Retrieval (Book) • Data Mining : Concepts & Techs (book) 40
  • 41. 41