212 building googlebot - deview - google drive

Building Googlebot
Youngjin Kim
October 15, 2013

http://www.creditwritedowns.com/2011/07/european-monetary-union-titanic.html

From the web to your query
● Query processing
1. Lookup keywords in the index => every relevant page
2. Rank pages and display the result
● Google's index of the web
keyword => { page1, page2, ... }
● Building the index requires processing the current version of
all of the pages on the web...

All of the pages on the web!?!

60 Trillion Pages And Counting!

Our local copy of the web
● Crawling
○ Googlebot
● Storage
○ Google File System (GFS), BigTable
● Processing
○ MapReduce
● Data Centers
○ Job control, Fault-Tolerance, High-Speed Networking,
Power/Cooling, etc.

Finding every page with googlebot
● Basic discovery crawl
1. Start with the set
of known links
2. Crawl every link
(pages change!)
3. Extract every
new link, repeat

Extract Links

Crawl
Status

Web
Page

Crawl Pages

Key considerations in crawling
● Polite crawling
○ Do not overload websites and DNS (no DoS!)
○ Understand web serving infrastructure
● Prioritize among discovered links
○ Crawl is a giant queuing system
○ Predicting serving capacity
● Do not waste resources
○ Ignore spam/broken links
○ Skip links with duplicate content

Mirrors
● Hosts with exactly the same content
deview.kr
www.deview.kr

● Paths within hosts with the same content
www.cs.unc.edu/Courses/comp426-f09/docs/tools/downloads/tomcat/
jakarta-tomcat-4.1.29/webapps/tomcat-docs
www.cs.unc.edu/Courses/comp590-001-f08/docs/tools/downloads/tomcat/
www.cs.unc.edu/Courses/comp590-001-f08/tools/downloads/tomcat/
www.cs.unc.edu/Courses/jbs/tools/downloads/tomcat/
jakarta-tomcat/4.1.29/webapps/tomcat-docs

● Unrestricted mirroring across hosts and paths
○ Distributed graph mining

Optimizing our crawling
● Efficient crawling requires duplicate handling
○ Predict whether a newly discovered link points to
duplicate content
○ Must happen before crawling
useful(link, status_table) => { yes, no }

Duplicates in Dynamic Pages
● Duplicates are most common in dynamic links
http://foo.com/forum/viewtopic.php?t=3808&sid=126bc5f2
http://foo.com/forum/viewtopic.php?t=3808&sid=d5b8483b
http://foo.com/forum/viewtopic.php?t=3808&sid=3b1a8e27
http://foo.com/forum/viewtopic.php?t=3808&sid=2a21f059
...

● Significance analysis
○ Parameter t is a relevant
○ Parameter sid is irrelevant
● Duplicate prediction
http://foo.com/forum/viewtopic.php?t=3808&sid=ee5da24a

Same
Content

Equivalence rules and class names
● Equivalence rule for a cluster
○ Set of relevant parameters
○ Set of irrelevant parameters
● Equivalence class name
○ Remove irrelevant parameters
ECN(link1) = ECN(link2) => Same content!
○ For the previous example
ECN(http://foo.com/forum/viewtopic.php?t=3808&sid=ee5da24a) =
http://foo.com/forum/viewtopic.php?t=3808

Modified crawl algorithm
● Representative table
○ Equivalence class name => representative link
● Given a new link
1. Identify cluster
2. Lookup equivalence rule
3. Apply rule to determine equivalence class name
4. Lookup table of representatives
5. Crawl link if no representative found

Equivalence rule generation
● Find every crawled link under a cluster
cluster = { link1 : content1, link2 : content2, ... }
● Study evidence
1. Insignificance analysis
2. Significance analysis
3. Parameter classification
4. Equivalence rule construction
rule(cluster) = {
param1 : RELEVANT,
param2 : IRRELEVANT,
param3 : CONFLICT,
...
}

1. Insignificance analysis
● Group links by content
content1 = { link11, link21, ... }
content2 = { link21, link22, ... }
...
● For each parameter
○ For each content group with this parameter
■ If parameter values are not the same, add the number
of links to the insignificance index

2. Significance analysis
○ Remove the parameter from every link
■ Group content by remainder link
remainder1 = { content11, content21, ... }
remainder2 = { content21, content22, ... }
...
■ Increment significance index by the number of unique
contents minus 1

3. Parameter classification
○ Compute content relevance (or irrelevance) value
Significance_Index
Content_Relevance =
Significance_Index + Insignificance_Index
Insignificance_Index
Content_Irrelevance =
Significance_Index + Insignificance_Index

○ Sample criteria: 90/10 rule
■ If relevance > 90 => parameter is RELEVANT
■ If relevance < 10 => parameter is IRRELEVANT
■ Otherwise, parameter is CONFLICT

Example: P is content-irrelevant
http://foo.com/directory?P=1&Q=3
Cluster

Content A


Content B

Insignificance Analysis of P

Significance Analysis of P

Content A

Content B

Q=3

Q=2

2 links,
different Ps

4 links,
different Ps

2 links,
Content A

4 links,
Content B

P's insignificance index = 2 + 4 = 6
P's content-irrelevance value = 100%

P's significance index = 0
P's content-relevance value = 0%

Example: Q is content-relevant
Cluster

Content A


Content B

Insignificance Analysis of Q

Significance Analysis of Q

Content A

Content B

P=1

P=2

2 links,
same Q

4 links,
same Q

2 links,
Content A&B

2 links,
Content A&B

Q's insignificance index = 0
Q's content-irrelevance value = 0%

Q's significance index = 1 + 1 = 2
Q's content-relevance value = 100%

Facing the Real World
● Limitations
○ Co-changing parameters
○ Noisy data
○ Parameters not used in the standard way
○ Need for continuous validation
● State-of-the-art
○ White-box vs black-box
● Search is not solved
○ Not even crawling is solved!

Defining duplicates
● Identical pages
● Identical visible content
● Essentially identical visible content
○ Ignore page generation time
○ Ignore breaking news side bar
○ etc.
● What is the right answer?
Two pages should be considered duplicates
if our users would consider them duplicates
● How to translate this notion into a checksum?

212 building googlebot - deview - google drive

Recommended

Recommended

More Related Content

Similar to 212 building googlebot - deview - google drive

Similar to 212 building googlebot - deview - google drive (20)

More from NAVER D2

More from NAVER D2 (20)

Recently uploaded

Recently uploaded (20)

212 building googlebot - deview - google drive