SlideShare a Scribd company logo
Temporal Anchor Text as Proxy for User Queries
Thaer Samar, Arjen P. de Vries
Web Archiving 1/2
 The Web is a major source of published
information
 Content on the Web evolves and changes
continuously
 Many initiatives aim to archive the Web
 Petabytes of archived data
Web Archiving 2/2
 Web archives are incomplete
 Impossible to include all Web pages due to
crawling limitations e.g., [Masanès06]
 Depth-first crawl, focus only on selected web sites
 Breadth-first crawl, focus on the entire domain,
but not in depth
Reconstruct Queries
 Our study: evolution of anchor text over time
to reconstruct what was important in the past
 Information that would be similar to user queries
 Inspiration:
 Document titles can be used as an approximation
of user queries [Jin et al.]
 Anchor text exhibits characteristics similar to user
query and document title [Eiron & McCurley]
Queries in the Past
 User queries have usually not been preserved
 Impossible to reconstruct which queries the
user would have used to search the archive
 However, web archives contain more than the
Web page content
 E.g., page source, different timestamps (archive
date, last-modified date), link structure
Link evidence and anchor Text
 Link information represents the source URL,
destination URL, and the anchor text
 Anchor text is a short text describing the
destination page
 Has been shown to improve search effectiveness in a
large number of Information Retrieval studies
`
Source
http://www.cwi.nl
Destination
http://www.nwo.nl
‘NWO’
Data: Dutch Web Archive
 National Library of the Netherlands (KB)
 Depth-first (selective) Web archive
 Since 2007
 10+ TB
 8,000+ websites
 Our snapshot
 2009-2012
Link Processing
Filtering  text/html pages
 ~70% of archived
objects
URL: http://www.cwi.nl
Archive-Date: 20091201
Content-Type: text/html
<html>
<a href=http://www.nwo.nl> NWO </a>
</html>
Web Archive Record
Link Processing
Filtering  text/html pages
 ~70% of archived objects
Extraction
 Source URL
URL: http://www.cwi.nl
Archive-Date: 20091201
Content-Type: text/html
<html>
<a href=http://www.nwo.nl> NWO </a>
</html>
Web Archive Record
Link Processing
Filtering  text/html pages
 ~70% of archived objects
Extraction
 Source URL
 Destination URL
URL: http://www.cwi.nl
Archive-Date: 20091201
Content-Type: text/html
<html>
<a href=http://www.nwo.nl >NWO </a>
</html>
Web Archive Record
Link Processing
Filtering  text/html pages
 ~70% of archived objects
Extraction
 Source URL
 Destination URL
 Anchor text
URL: http://www.cwi.nl
Archive-Date: 20091201
Content-Type: text/html
<html>
<a href=http://www.nwo.nl> NWO </a>
</html>
Web Archive Record
Link Processing
Filtering  text/html pages
 ~70% of archived objects
Extraction
 Source URL
 Destination URL
 Anchor text
 Archive-date
(YYYYMM)
URL: http://www.cwi.nl
Archive-Date: 20091201
Content-Type: text/html
<html>
<a href=http://www.nwo.nl> NWO </a>
</html>
Web Archive Record
Link Processing
Filtering
 Pages of type text/html
 ~70% of archived objects
Extraction
 Source URL
 Destination URL
 Anchor text
 Crawl-date (YYYYMM)
 URL normalization; get host of
the source and the destination
 Clean spam e.g., rolex watches
Cleaning
Link Processing
Filtering
 Pages of type text/html
 ~70% of archived objects
Extraction
 Source URL
 Destination URL
 Anchor text
 Crawl-date (YYYYMM)
Cleaning
 URL normalization; get host of the source
and the destination
 Clean spam e.g., rolex watches
Partitioning  Based on one-year and one-month granularity
Link Processing
Filtering
 Pages of type text/html
 ~70% of archived objects
Extraction
 Source URL
 Destination URL
 Anchor text
 Crawl-date (YYYYMM)
Cleaning
 URL normalization; get host of the source
and the destination
 Clean spam e.g., rolex watches
Partitioning  Based on one-year and one-month granularity
Deduplication
 Remove duplicate links; due to crawling
frequency
 Same source, destination, and anchor text
Hosts Evolution
 Important hosts overtime
 Aggregate links based on the target host
 keep unique source hosts
 Multiple pages from same host linking to the same
target host are counted as one
 Rank hosts based on number of source hosts
linking to them
% of new hosts over the years
% New hosts in 2012 not
in {2009, 2010, and
2011}
Anchor Text Evolution
 Measure the importance of anchor text a over
time in time-partitioned links
 Aggregate by anchor text
 Compute the archive-based popularity
 Normalize by Maximum
% new anchor text over years
 Anchor text is new in specific partition if does
not appear in the previous partitions
 Based on one-year granularity
 59% new anchor text
 Based on one-month granularity
 34% new anchor text
WikiStats
 Views aggregation of Wikipedia (WP) pages
 From Jan 2008 to Jan 2015
 We focus on
 Feb 2009 to Dec 2012
 Similar to the period of our snapshot of the Dutch
Web archive
 Keep WP titles viewed >= 1,000 times
Matching anchor text to WP titles
 Pre-process WP titles like the anchor text
 Lowercase
 Stop-words removing
 One-year and one-month granularity partitions
 Collect titles by exact match with the anchors
 Assume anchor popularity equals WP page
popularity
Ranked anchor text with WP match
 Different rank cut-off
% overlap
decreases while
cut-off increases
~56 % in top-
1k has a match
Examples of popular anchor text (with match)
 Major cities in the Netherlands
 E.g., Amsterdam, Rotterdam, Groningen, and Utrecht
 Social web sites
 E.g., twitter, linkedin, flickr, and vimeo
 Major Dutch daily newspapers
 E.g., de Volkskrant, Telegraaf, and Trouw
 Dutch public broadcasting
 uitzending gemist
 Government web service
 E.g., belastingdienst
Discussion
 Our original goal was to identify historically
trending events from the link evolution
recorded in the archive
 Unfortunately we found only few examples
with our current analysis
 E.g., ‘‘canon’’ *
 However, important anchor text provides and
overview of important Dutch entities
* corresponding to an activity initiated by the government to define
the canonical historic events in Dutch history
Limitations & Future Work
 Exact text matching between anchor text and
WP title
 E.g., filmpje does not match WP title filmpje!
 Additional pre-processing
 Stemming, stopping, generalize from exact match to
match with low edit distance
 Our analysis is based on depth-first crawl of
few thousand of Dutch websites
 Breadth-first crawl such as [CommonCrawl]
References
 [Masanés06] J. Masanés. Web Archiving. Springer, 2006
 [Jin et al.] Rong Jin, Alexander G. Hauptmann, and ChengXiang Zhai.
Title language model for information retrieval. In SIGIR 2002
 Eiron & McCurley Nadav Eiron and Kevin S. McCurley. Analysis of
anchor text for web search. In SIGIR 2003
 [CommonCrawl] https://commoncrawl.org/
 [WikiStats] http://wikistats.ins.cwi.nl/
Limitations & Future Work
 Exact text matching between anchor text and
WP title
 E.g., filmpje does not match WP title filmpje!
 Additional pre-processing
 Stemming, stopping, generalize from exact match to
match with low edit distance
 Our analysis is based on depth-first crawl of
few thousand of Dutch websites
 Breadth-first crawl such as [CommonCrawl]

More Related Content

Similar to Temporal Anchor Text as Proxy for user Queries

Tpdl 2016 topic_coverage_indf-bf_crawls
Tpdl 2016 topic_coverage_indf-bf_crawlsTpdl 2016 topic_coverage_indf-bf_crawls
Tpdl 2016 topic_coverage_indf-bf_crawls
Thaer Samar
 
FYCOM Unit 1.pptx (2).pdf
FYCOM Unit 1.pptx (2).pdfFYCOM Unit 1.pptx (2).pdf
FYCOM Unit 1.pptx (2).pdf
ssuserc8e1481
 
Web Design Basics and HTML
Web Design Basics and HTMLWeb Design Basics and HTML
Web Design Basics and HTML
Rajesh Sanabada
 
FYCOM Unit 1.pptx
FYCOM Unit 1.pptxFYCOM Unit 1.pptx
FYCOM Unit 1.pptx
HemantBansal35
 
Web+html
Web+htmlWeb+html
Web+html
Hasankhankor
 
Web content mining
Web content miningWeb content mining
Web content mining
Akanksha Dombe
 
Training report on web developing
Training report on web developingTraining report on web developing
Training report on web developing
Jawhar Ali
 
Explore SharePoint 2010 Enterprise & Document Management features
Explore SharePoint 2010 Enterprise & Document Management features Explore SharePoint 2010 Enterprise & Document Management features
Explore SharePoint 2010 Enterprise & Document Management features
K.Mohamed Faizal
 
World wide web An Introduction
World wide web An IntroductionWorld wide web An Introduction
World wide web An Introduction
Sidrah Noor
 
Web Search Engine
Web Search EngineWeb Search Engine
Web Search Engine
Chidanand Byahatti
 
Knowledge Centers without a Taxonomy (KM World 2014)
Knowledge Centers without a Taxonomy (KM World 2014)Knowledge Centers without a Taxonomy (KM World 2014)
Knowledge Centers without a Taxonomy (KM World 2014)
Rob Kocher
 
COLLECTION METHODS
COLLECTION METHODSCOLLECTION METHODS
COLLECTION METHODS
Essam Obaid
 
DM110 - Week 2 - Blogs
DM110 - Week 2 - BlogsDM110 - Week 2 - Blogs
DM110 - Week 2 - Blogs
John Breslin
 
Web publishing
Web publishingWeb publishing
Web publishing
Kanav Sood
 
Raju html
Raju htmlRaju html
Content Analysis: Methods and Mentoring
Content Analysis: Methods and MentoringContent Analysis: Methods and Mentoring
Content Analysis: Methods and Mentoring
Chiara Fox Ogan
 
Html workshop 1
Html workshop 1Html workshop 1
Html workshop 1
Lee Scott
 
Internet
InternetInternet
Web Pages
Web PagesWeb Pages
Web Pages
Sayed Hamid Raza
 
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
Amazon Web Services
 

Similar to Temporal Anchor Text as Proxy for user Queries (20)

Tpdl 2016 topic_coverage_indf-bf_crawls
Tpdl 2016 topic_coverage_indf-bf_crawlsTpdl 2016 topic_coverage_indf-bf_crawls
Tpdl 2016 topic_coverage_indf-bf_crawls
 
FYCOM Unit 1.pptx (2).pdf
FYCOM Unit 1.pptx (2).pdfFYCOM Unit 1.pptx (2).pdf
FYCOM Unit 1.pptx (2).pdf
 
Web Design Basics and HTML
Web Design Basics and HTMLWeb Design Basics and HTML
Web Design Basics and HTML
 
FYCOM Unit 1.pptx
FYCOM Unit 1.pptxFYCOM Unit 1.pptx
FYCOM Unit 1.pptx
 
Web+html
Web+htmlWeb+html
Web+html
 
Web content mining
Web content miningWeb content mining
Web content mining
 
Training report on web developing
Training report on web developingTraining report on web developing
Training report on web developing
 
Explore SharePoint 2010 Enterprise & Document Management features
Explore SharePoint 2010 Enterprise & Document Management features Explore SharePoint 2010 Enterprise & Document Management features
Explore SharePoint 2010 Enterprise & Document Management features
 
World wide web An Introduction
World wide web An IntroductionWorld wide web An Introduction
World wide web An Introduction
 
Web Search Engine
Web Search EngineWeb Search Engine
Web Search Engine
 
Knowledge Centers without a Taxonomy (KM World 2014)
Knowledge Centers without a Taxonomy (KM World 2014)Knowledge Centers without a Taxonomy (KM World 2014)
Knowledge Centers without a Taxonomy (KM World 2014)
 
COLLECTION METHODS
COLLECTION METHODSCOLLECTION METHODS
COLLECTION METHODS
 
DM110 - Week 2 - Blogs
DM110 - Week 2 - BlogsDM110 - Week 2 - Blogs
DM110 - Week 2 - Blogs
 
Web publishing
Web publishingWeb publishing
Web publishing
 
Raju html
Raju htmlRaju html
Raju html
 
Content Analysis: Methods and Mentoring
Content Analysis: Methods and MentoringContent Analysis: Methods and Mentoring
Content Analysis: Methods and Mentoring
 
Html workshop 1
Html workshop 1Html workshop 1
Html workshop 1
 
Internet
InternetInternet
Internet
 
Web Pages
Web PagesWeb Pages
Web Pages
 
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
 

Recently uploaded

8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
by6843629
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
LengamoLAppostilic
 
HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1
Shashank Shekhar Pandey
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
Anagha Prasad
 
Pests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdfPests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdf
PirithiRaju
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
University of Hertfordshire
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
Advanced-Concepts-Team
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
İsa Badur
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
Aditi Bajpai
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
University of Maribor
 
The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
Carl Bergstrom
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
23PH301 - Optics - Optical Lenses.pptx
23PH301 - Optics  -  Optical Lenses.pptx23PH301 - Optics  -  Optical Lenses.pptx
23PH301 - Optics - Optical Lenses.pptx
RDhivya6
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
vluwdy49
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
Vandana Devesh Sharma
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
International Food Policy Research Institute- South Asia Office
 
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
Areesha Ahmad
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
Sérgio Sacani
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero WaterSharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Texas Alliance of Groundwater Districts
 

Recently uploaded (20)

8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
 
HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
 
Pests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdfPests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdf
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
 
The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
23PH301 - Optics - Optical Lenses.pptx
23PH301 - Optics  -  Optical Lenses.pptx23PH301 - Optics  -  Optical Lenses.pptx
23PH301 - Optics - Optical Lenses.pptx
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
 
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero WaterSharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
 

Temporal Anchor Text as Proxy for user Queries

  • 1. Temporal Anchor Text as Proxy for User Queries Thaer Samar, Arjen P. de Vries
  • 2. Web Archiving 1/2  The Web is a major source of published information  Content on the Web evolves and changes continuously  Many initiatives aim to archive the Web  Petabytes of archived data
  • 3. Web Archiving 2/2  Web archives are incomplete  Impossible to include all Web pages due to crawling limitations e.g., [Masanès06]  Depth-first crawl, focus only on selected web sites  Breadth-first crawl, focus on the entire domain, but not in depth
  • 4. Reconstruct Queries  Our study: evolution of anchor text over time to reconstruct what was important in the past  Information that would be similar to user queries  Inspiration:  Document titles can be used as an approximation of user queries [Jin et al.]  Anchor text exhibits characteristics similar to user query and document title [Eiron & McCurley]
  • 5. Queries in the Past  User queries have usually not been preserved  Impossible to reconstruct which queries the user would have used to search the archive  However, web archives contain more than the Web page content  E.g., page source, different timestamps (archive date, last-modified date), link structure
  • 6. Link evidence and anchor Text  Link information represents the source URL, destination URL, and the anchor text  Anchor text is a short text describing the destination page  Has been shown to improve search effectiveness in a large number of Information Retrieval studies ` Source http://www.cwi.nl Destination http://www.nwo.nl ‘NWO’
  • 7. Data: Dutch Web Archive  National Library of the Netherlands (KB)  Depth-first (selective) Web archive  Since 2007  10+ TB  8,000+ websites  Our snapshot  2009-2012
  • 8. Link Processing Filtering  text/html pages  ~70% of archived objects URL: http://www.cwi.nl Archive-Date: 20091201 Content-Type: text/html <html> <a href=http://www.nwo.nl> NWO </a> </html> Web Archive Record
  • 9. Link Processing Filtering  text/html pages  ~70% of archived objects Extraction  Source URL URL: http://www.cwi.nl Archive-Date: 20091201 Content-Type: text/html <html> <a href=http://www.nwo.nl> NWO </a> </html> Web Archive Record
  • 10. Link Processing Filtering  text/html pages  ~70% of archived objects Extraction  Source URL  Destination URL URL: http://www.cwi.nl Archive-Date: 20091201 Content-Type: text/html <html> <a href=http://www.nwo.nl >NWO </a> </html> Web Archive Record
  • 11. Link Processing Filtering  text/html pages  ~70% of archived objects Extraction  Source URL  Destination URL  Anchor text URL: http://www.cwi.nl Archive-Date: 20091201 Content-Type: text/html <html> <a href=http://www.nwo.nl> NWO </a> </html> Web Archive Record
  • 12. Link Processing Filtering  text/html pages  ~70% of archived objects Extraction  Source URL  Destination URL  Anchor text  Archive-date (YYYYMM) URL: http://www.cwi.nl Archive-Date: 20091201 Content-Type: text/html <html> <a href=http://www.nwo.nl> NWO </a> </html> Web Archive Record
  • 13. Link Processing Filtering  Pages of type text/html  ~70% of archived objects Extraction  Source URL  Destination URL  Anchor text  Crawl-date (YYYYMM)  URL normalization; get host of the source and the destination  Clean spam e.g., rolex watches Cleaning
  • 14. Link Processing Filtering  Pages of type text/html  ~70% of archived objects Extraction  Source URL  Destination URL  Anchor text  Crawl-date (YYYYMM) Cleaning  URL normalization; get host of the source and the destination  Clean spam e.g., rolex watches Partitioning  Based on one-year and one-month granularity
  • 15. Link Processing Filtering  Pages of type text/html  ~70% of archived objects Extraction  Source URL  Destination URL  Anchor text  Crawl-date (YYYYMM) Cleaning  URL normalization; get host of the source and the destination  Clean spam e.g., rolex watches Partitioning  Based on one-year and one-month granularity Deduplication  Remove duplicate links; due to crawling frequency  Same source, destination, and anchor text
  • 16. Hosts Evolution  Important hosts overtime  Aggregate links based on the target host  keep unique source hosts  Multiple pages from same host linking to the same target host are counted as one  Rank hosts based on number of source hosts linking to them
  • 17. % of new hosts over the years % New hosts in 2012 not in {2009, 2010, and 2011}
  • 18. Anchor Text Evolution  Measure the importance of anchor text a over time in time-partitioned links  Aggregate by anchor text  Compute the archive-based popularity  Normalize by Maximum
  • 19. % new anchor text over years  Anchor text is new in specific partition if does not appear in the previous partitions  Based on one-year granularity  59% new anchor text  Based on one-month granularity  34% new anchor text
  • 20. WikiStats  Views aggregation of Wikipedia (WP) pages  From Jan 2008 to Jan 2015  We focus on  Feb 2009 to Dec 2012  Similar to the period of our snapshot of the Dutch Web archive  Keep WP titles viewed >= 1,000 times
  • 21. Matching anchor text to WP titles  Pre-process WP titles like the anchor text  Lowercase  Stop-words removing  One-year and one-month granularity partitions  Collect titles by exact match with the anchors  Assume anchor popularity equals WP page popularity
  • 22. Ranked anchor text with WP match  Different rank cut-off % overlap decreases while cut-off increases ~56 % in top- 1k has a match
  • 23. Examples of popular anchor text (with match)  Major cities in the Netherlands  E.g., Amsterdam, Rotterdam, Groningen, and Utrecht  Social web sites  E.g., twitter, linkedin, flickr, and vimeo  Major Dutch daily newspapers  E.g., de Volkskrant, Telegraaf, and Trouw  Dutch public broadcasting  uitzending gemist  Government web service  E.g., belastingdienst
  • 24. Discussion  Our original goal was to identify historically trending events from the link evolution recorded in the archive  Unfortunately we found only few examples with our current analysis  E.g., ‘‘canon’’ *  However, important anchor text provides and overview of important Dutch entities * corresponding to an activity initiated by the government to define the canonical historic events in Dutch history
  • 25. Limitations & Future Work  Exact text matching between anchor text and WP title  E.g., filmpje does not match WP title filmpje!  Additional pre-processing  Stemming, stopping, generalize from exact match to match with low edit distance  Our analysis is based on depth-first crawl of few thousand of Dutch websites  Breadth-first crawl such as [CommonCrawl]
  • 26. References  [Masanés06] J. Masanés. Web Archiving. Springer, 2006  [Jin et al.] Rong Jin, Alexander G. Hauptmann, and ChengXiang Zhai. Title language model for information retrieval. In SIGIR 2002  Eiron & McCurley Nadav Eiron and Kevin S. McCurley. Analysis of anchor text for web search. In SIGIR 2003  [CommonCrawl] https://commoncrawl.org/  [WikiStats] http://wikistats.ins.cwi.nl/
  • 27. Limitations & Future Work  Exact text matching between anchor text and WP title  E.g., filmpje does not match WP title filmpje!  Additional pre-processing  Stemming, stopping, generalize from exact match to match with low edit distance  Our analysis is based on depth-first crawl of few thousand of Dutch websites  Breadth-first crawl such as [CommonCrawl]