SIGIR 2014
Gold Coast, Australia, 06-11 July 2014
Uncovering the Unarchived Web
Thaer Samar, Hugo Huurdeman, Anat Ben-David, Jaap Kamps, Arjen de Vries
Link Extraction
Input
Dutch Archive (2009-2012)

7 TB (compressed)

76,828 ARC files

147,641,512 documents
Seedlist info:

5,000 websites

Selection dates

Assigned UNESCO codes
Filtering & Deduplication

Focus on links of which the source was archived in 2012

Deduplication: Seeds are harvested at different frequencies

Deduplicated based on srcUrl, targetUrl, anchorText and hash of
source's content
General Framework
Introduction

Web archives contain more than Web pages: they contain page
sources, outlinks, anchor text, and timestamps of archive dates

Outlinks and their anchor text can be used to establish evidence
of pages which existed at crawling time that were not archived
Further Analysis
TLD distribution of inter-domain uncovered Web has
similarities to a broad Web crawl (Common Crawl)
TLD distribution of unarchived URLs
Conclusions

Uncovering pages of the Web that were not archived and would
have been lost forever

Recover representation of unarchived pages by exploiting link
graph and anchor text

Aggregating anchor text from all sources linking to the target

Information about the sources linking to the target:
•
number of (unique) sources
•
source categories based on the assigned UNESCO codes
•
indications whether a source is on the seedlist or not
Unique source URL & anchor word counts (inter-domain links)
Uncovered URL representations
Results
Representation Aggregation
For each link target :

Union all anchor text describing links pointing to one target

Count number of unique sources & UNESCO pointing to target

Count number of unique anchor text words used to link to target
Uncovered URLs Analysis

Distinguish between internal & external links

Internal link: source and target have same domain-name
(intra-domain): 8,692,308

External link: source and target have different domain-name
(inter-domain): 3,205,354
Categories of found URLs
1) Intentionally archived pages, they are from the seed list
2) Unintentionally archived pages, not from the seed list
(side-effect of crawling)
3) Aura: unarchived pages, we know they exist because there are
links to them from archived pages
The number of uncovered pages indirectly collected while crawling
is almost equal to the number of intentionally crawled pages!

SIGIR2014_poster

  • 1.
    SIGIR 2014 Gold Coast,Australia, 06-11 July 2014 Uncovering the Unarchived Web Thaer Samar, Hugo Huurdeman, Anat Ben-David, Jaap Kamps, Arjen de Vries Link Extraction Input Dutch Archive (2009-2012)  7 TB (compressed)  76,828 ARC files  147,641,512 documents Seedlist info:  5,000 websites  Selection dates  Assigned UNESCO codes Filtering & Deduplication  Focus on links of which the source was archived in 2012  Deduplication: Seeds are harvested at different frequencies  Deduplicated based on srcUrl, targetUrl, anchorText and hash of source's content General Framework Introduction  Web archives contain more than Web pages: they contain page sources, outlinks, anchor text, and timestamps of archive dates  Outlinks and their anchor text can be used to establish evidence of pages which existed at crawling time that were not archived Further Analysis TLD distribution of inter-domain uncovered Web has similarities to a broad Web crawl (Common Crawl) TLD distribution of unarchived URLs Conclusions  Uncovering pages of the Web that were not archived and would have been lost forever  Recover representation of unarchived pages by exploiting link graph and anchor text  Aggregating anchor text from all sources linking to the target  Information about the sources linking to the target: • number of (unique) sources • source categories based on the assigned UNESCO codes • indications whether a source is on the seedlist or not Unique source URL & anchor word counts (inter-domain links) Uncovered URL representations Results Representation Aggregation For each link target :  Union all anchor text describing links pointing to one target  Count number of unique sources & UNESCO pointing to target  Count number of unique anchor text words used to link to target Uncovered URLs Analysis  Distinguish between internal & external links  Internal link: source and target have same domain-name (intra-domain): 8,692,308  External link: source and target have different domain-name (inter-domain): 3,205,354 Categories of found URLs 1) Intentionally archived pages, they are from the seed list 2) Unintentionally archived pages, not from the seed list (side-effect of crawling) 3) Aura: unarchived pages, we know they exist because there are links to them from archived pages The number of uncovered pages indirectly collected while crawling is almost equal to the number of intentionally crawled pages!