Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Building Googlebot
Youngjin Kim
October 15, 2013
http://www.creditwritedowns.com/2011/07/european-monetary-union-titanic.html
From the web to your query
● Query processing
1. Lookup keywords in the index => every relevant page
2. Rank pages and dis...
All of the pages on the web!?!
60 Trillion Pages And Counting!
Our local copy of the web
● Crawling
○ Googlebot
● Storage
○ Google File System (GFS), BigTable
● Processing
○ MapReduce
●...
Finding every page with googlebot
● Basic discovery crawl
1. Start with the set
of known links
2. Crawl every link
(pages ...
Key considerations in crawling
● Polite crawling
○ Do not overload websites and DNS (no DoS!)
○ Understand web serving inf...
Mirrors
● Hosts with exactly the same content
deview.kr
www.deview.kr

● Paths within hosts with the same content
www.cs.u...
Optimizing our crawling
● Efficient crawling requires duplicate handling
○ Predict whether a newly discovered link points ...
Duplicates in Dynamic Pages
● Duplicates are most common in dynamic links
http://foo.com/forum/viewtopic.php?t=3808&sid=12...
Equivalence rules and class names
● Equivalence rule for a cluster
○ Set of relevant parameters
○ Set of irrelevant parame...
Modified crawl algorithm
● Representative table
○ Equivalence class name => representative link
● Given a new link
1. Iden...
Equivalence rule generation
● Find every crawled link under a cluster
cluster = { link1 : content1, link2 : content2, ... ...
1. Insignificance analysis
● Group links by content
content1 = { link11, link21, ... }
content2 = { link21, link22, ... }
...
2. Significance analysis
● For each parameter
○ Remove the parameter from every link
■ Group content by remainder link
rem...
3. Parameter classification
● For each parameter
○ Compute content relevance (or irrelevance) value
Significance_Index
Con...
Example: P is content-irrelevant
http://foo.com/directory?P=1&Q=3
http://foo.com/directory?P=2&Q=3
Cluster

Content A

htt...
Example: Q is content-relevant
http://foo.com/directory?P=1&Q=3
http://foo.com/directory?P=2&Q=3
Cluster

Content A

http:...
Facing the Real World
● Limitations
○ Co-changing parameters
○ Noisy data
○ Parameters not used in the standard way
○ Need...
Defining duplicates
● Identical pages
● Identical visible content
● Essentially identical visible content
○ Ignore page ge...
Q&A
Thank You!
212 building googlebot - deview - google drive
212 building googlebot - deview - google drive
Upcoming SlideShare
Loading in …5
×

of

212 building googlebot - deview - google drive Slide 1 212 building googlebot - deview - google drive Slide 2 212 building googlebot - deview - google drive Slide 3 212 building googlebot - deview - google drive Slide 4 212 building googlebot - deview - google drive Slide 5 212 building googlebot - deview - google drive Slide 6 212 building googlebot - deview - google drive Slide 7 212 building googlebot - deview - google drive Slide 8 212 building googlebot - deview - google drive Slide 9 212 building googlebot - deview - google drive Slide 10 212 building googlebot - deview - google drive Slide 11 212 building googlebot - deview - google drive Slide 12 212 building googlebot - deview - google drive Slide 13 212 building googlebot - deview - google drive Slide 14 212 building googlebot - deview - google drive Slide 15 212 building googlebot - deview - google drive Slide 16 212 building googlebot - deview - google drive Slide 17 212 building googlebot - deview - google drive Slide 18 212 building googlebot - deview - google drive Slide 19 212 building googlebot - deview - google drive Slide 20 212 building googlebot - deview - google drive Slide 21 212 building googlebot - deview - google drive Slide 22 212 building googlebot - deview - google drive Slide 23 212 building googlebot - deview - google drive Slide 24 212 building googlebot - deview - google drive Slide 25
Upcoming SlideShare
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Next
Download to read offline and view in fullscreen.

0 Likes

Share

Download to read offline

212 building googlebot - deview - google drive

Download to read offline

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

212 building googlebot - deview - google drive

  1. 1. Building Googlebot Youngjin Kim October 15, 2013
  2. 2. http://www.creditwritedowns.com/2011/07/european-monetary-union-titanic.html
  3. 3. From the web to your query ● Query processing 1. Lookup keywords in the index => every relevant page 2. Rank pages and display the result ● Google's index of the web keyword => { page1, page2, ... } ● Building the index requires processing the current version of all of the pages on the web...
  4. 4. All of the pages on the web!?!
  5. 5. 60 Trillion Pages And Counting!
  6. 6. Our local copy of the web ● Crawling ○ Googlebot ● Storage ○ Google File System (GFS), BigTable ● Processing ○ MapReduce ● Data Centers ○ Job control, Fault-Tolerance, High-Speed Networking, Power/Cooling, etc.
  7. 7. Finding every page with googlebot ● Basic discovery crawl 1. Start with the set of known links 2. Crawl every link (pages change!) 3. Extract every new link, repeat Extract Links Crawl Status Web Page Crawl Pages
  8. 8. Key considerations in crawling ● Polite crawling ○ Do not overload websites and DNS (no DoS!) ○ Understand web serving infrastructure ● Prioritize among discovered links ○ Crawl is a giant queuing system ○ Predicting serving capacity ● Do not waste resources ○ Ignore spam/broken links ○ Skip links with duplicate content
  9. 9. Mirrors ● Hosts with exactly the same content deview.kr www.deview.kr ● Paths within hosts with the same content www.cs.unc.edu/Courses/comp426-f09/docs/tools/downloads/tomcat/ jakarta-tomcat-4.1.29/webapps/tomcat-docs www.cs.unc.edu/Courses/comp590-001-f08/docs/tools/downloads/tomcat/ jakarta-tomcat-4.1.29/webapps/tomcat-docs www.cs.unc.edu/Courses/comp590-001-f08/tools/downloads/tomcat/ jakarta-tomcat-4.1.29/webapps/tomcat-docs www.cs.unc.edu/Courses/jbs/tools/downloads/tomcat/ jakarta-tomcat/4.1.29/webapps/tomcat-docs ● Unrestricted mirroring across hosts and paths ○ Distributed graph mining
  10. 10. Optimizing our crawling ● Efficient crawling requires duplicate handling ○ Predict whether a newly discovered link points to duplicate content ○ Must happen before crawling useful(link, status_table) => { yes, no }
  11. 11. Duplicates in Dynamic Pages ● Duplicates are most common in dynamic links http://foo.com/forum/viewtopic.php?t=3808&sid=126bc5f2 http://foo.com/forum/viewtopic.php?t=3808&sid=d5b8483b http://foo.com/forum/viewtopic.php?t=3808&sid=3b1a8e27 http://foo.com/forum/viewtopic.php?t=3808&sid=2a21f059 ... ● Significance analysis ○ Parameter t is a relevant ○ Parameter sid is irrelevant ● Duplicate prediction http://foo.com/forum/viewtopic.php?t=3808&sid=ee5da24a Same Content
  12. 12. Equivalence rules and class names ● Equivalence rule for a cluster ○ Set of relevant parameters ○ Set of irrelevant parameters ● Equivalence class name ○ Remove irrelevant parameters ECN(link1) = ECN(link2) => Same content! ○ For the previous example ECN(http://foo.com/forum/viewtopic.php?t=3808&sid=ee5da24a) = http://foo.com/forum/viewtopic.php?t=3808
  13. 13. Modified crawl algorithm ● Representative table ○ Equivalence class name => representative link ● Given a new link 1. Identify cluster 2. Lookup equivalence rule 3. Apply rule to determine equivalence class name 4. Lookup table of representatives 5. Crawl link if no representative found
  14. 14. Equivalence rule generation ● Find every crawled link under a cluster cluster = { link1 : content1, link2 : content2, ... } ● Study evidence 1. Insignificance analysis 2. Significance analysis 3. Parameter classification 4. Equivalence rule construction rule(cluster) = { param1 : RELEVANT, param2 : IRRELEVANT, param3 : CONFLICT, ... }
  15. 15. 1. Insignificance analysis ● Group links by content content1 = { link11, link21, ... } content2 = { link21, link22, ... } ... ● For each parameter ○ For each content group with this parameter ■ If parameter values are not the same, add the number of links to the insignificance index
  16. 16. 2. Significance analysis ● For each parameter ○ Remove the parameter from every link ■ Group content by remainder link remainder1 = { content11, content21, ... } remainder2 = { content21, content22, ... } ... ■ Increment significance index by the number of unique contents minus 1
  17. 17. 3. Parameter classification ● For each parameter ○ Compute content relevance (or irrelevance) value Significance_Index Content_Relevance = Significance_Index + Insignificance_Index Insignificance_Index Content_Irrelevance = Significance_Index + Insignificance_Index ○ Sample criteria: 90/10 rule ■ If relevance > 90 => parameter is RELEVANT ■ If relevance < 10 => parameter is IRRELEVANT ■ Otherwise, parameter is CONFLICT
  18. 18. Example: P is content-irrelevant http://foo.com/directory?P=1&Q=3 http://foo.com/directory?P=2&Q=3 Cluster Content A http://foo.com/directory?P=1&Q=2 http://foo.com/directory?P=2&Q=2 http://foo.com/directory?P=3&Q=2 http://foo.com/directory?P=4&Q=2 Content B Insignificance Analysis of P Significance Analysis of P Content A Content B Q=3 Q=2 2 links, different Ps 4 links, different Ps 2 links, Content A 4 links, Content B P's insignificance index = 2 + 4 = 6 P's content-irrelevance value = 100% P's significance index = 0 P's content-relevance value = 0%
  19. 19. Example: Q is content-relevant http://foo.com/directory?P=1&Q=3 http://foo.com/directory?P=2&Q=3 Cluster Content A http://foo.com/directory?P=1&Q=2 http://foo.com/directory?P=2&Q=2 http://foo.com/directory?P=3&Q=2 http://foo.com/directory?P=4&Q=2 Content B Insignificance Analysis of Q Significance Analysis of Q Content A Content B P=1 P=2 2 links, same Q 4 links, same Q 2 links, Content A&B 2 links, Content A&B Q's insignificance index = 0 Q's content-irrelevance value = 0% Q's significance index = 1 + 1 = 2 Q's content-relevance value = 100%
  20. 20. Facing the Real World ● Limitations ○ Co-changing parameters ○ Noisy data ○ Parameters not used in the standard way ○ Need for continuous validation ● State-of-the-art ○ White-box vs black-box ● Search is not solved ○ Not even crawling is solved!
  21. 21. Defining duplicates ● Identical pages ● Identical visible content ● Essentially identical visible content ○ Ignore page generation time ○ Ignore breaking news side bar ○ etc. ● What is the right answer? Two pages should be considered duplicates if our users would consider them duplicates ● How to translate this notion into a checksum?
  22. 22. Q&A
  23. 23. Thank You!

Views

Total views

5,543

On Slideshare

0

From embeds

0

Number of embeds

2,490

Actions

Downloads

67

Shares

0

Comments

0

Likes

0

×