Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Temporal Anchor Text as Proxy for User Queries
Thaer Samar, Arjen P. de Vries
Web Archiving 1/2
 The Web is a major source of published
information
 Content on the Web evolves and changes
continuous...
Web Archiving 2/2
 Web archives are incomplete
 Impossible to include all Web pages due to
crawling limitations e.g., [M...
Reconstruct Queries
 Our study: evolution of anchor text over time
to reconstruct what was important in the past
 Inform...
Queries in the Past
 User queries have usually not been preserved
 Impossible to reconstruct which queries the
user woul...
Link evidence and anchor Text
 Link information represents the source URL,
destination URL, and the anchor text
 Anchor ...
Data: Dutch Web Archive
 National Library of the Netherlands (KB)
 Depth-first (selective) Web archive
 Since 2007
 10...
Link Processing
Filtering  text/html pages
 ~70% of archived
objects
URL: http://www.cwi.nl
Archive-Date: 20091201
Conte...
Link Processing
Filtering  text/html pages
 ~70% of archived objects
Extraction
 Source URL
URL: http://www.cwi.nl
Arch...
Link Processing
Filtering  text/html pages
 ~70% of archived objects
Extraction
 Source URL
 Destination URL
URL: http...
Link Processing
Filtering  text/html pages
 ~70% of archived objects
Extraction
 Source URL
 Destination URL
 Anchor ...
Link Processing
Filtering  text/html pages
 ~70% of archived objects
Extraction
 Source URL
 Destination URL
 Anchor ...
Link Processing
Filtering
 Pages of type text/html
 ~70% of archived objects
Extraction
 Source URL
 Destination URL
...
Link Processing
Filtering
 Pages of type text/html
 ~70% of archived objects
Extraction
 Source URL
 Destination URL
...
Link Processing
Filtering
 Pages of type text/html
 ~70% of archived objects
Extraction
 Source URL
 Destination URL
...
Hosts Evolution
 Important hosts overtime
 Aggregate links based on the target host
 keep unique source hosts
 Multipl...
% of new hosts over the years
% New hosts in 2012 not
in {2009, 2010, and
2011}
Anchor Text Evolution
 Measure the importance of anchor text a over
time in time-partitioned links
 Aggregate by anchor ...
% new anchor text over years
 Anchor text is new in specific partition if does
not appear in the previous partitions
 Ba...
WikiStats
 Views aggregation of Wikipedia (WP) pages
 From Jan 2008 to Jan 2015
 We focus on
 Feb 2009 to Dec 2012
 S...
Matching anchor text to WP titles
 Pre-process WP titles like the anchor text
 Lowercase
 Stop-words removing
 One-yea...
Ranked anchor text with WP match
 Different rank cut-off
% overlap
decreases while
cut-off increases
~56 % in top-
1k has...
Examples of popular anchor text (with match)
 Major cities in the Netherlands
 E.g., Amsterdam, Rotterdam, Groningen, an...
Discussion
 Our original goal was to identify historically
trending events from the link evolution
recorded in the archiv...
Limitations & Future Work
 Exact text matching between anchor text and
WP title
 E.g., filmpje does not match WP title f...
References
 [Masanés06] J. Masanés. Web Archiving. Springer, 2006
 [Jin et al.] Rong Jin, Alexander G. Hauptmann, and Ch...
Limitations & Future Work
 Exact text matching between anchor text and
WP title
 E.g., filmpje does not match WP title f...
Upcoming SlideShare
Loading in …5
×

Temporal Anchor Text as Proxy for user Queries

498 views

Published on

Web archives preserve the fast changing web. While we can archive the web pages, the popularity of queries in the past has usually not been preserved. Previous studies have observed the importance of anchor text for improving the quality of text search, and have shown that anchor text is similar to real user queries and documents titles. Other studies have shown that documents titles are similar to the real user queries. In this paper, we propose an approach to reconstruct the information that would be provided by query log in the past using temporal anchor text. First, we study the link graph of four years of Web
archive in order to show how the target hosts and anchor text evolve over time. Second, we investigate the importance of anchor text over time. Our approach is to rank anchor text based on their popularity in the archive at specific time. Then,
we check the importance of the top ranked anchor text in the public Web at the same time. In order to achieve this, we used the WikiStats dataset which aggregates page views of Wikipedia pages. Using exact string matching between top
ranked anchor text and Wikipedia titles in the WikiStats dataset, we find a high percentage of overlap (approximately 57%). Our data strengthens the hypothesis that anchor text may be used as a proxy for actual query volume.

Published in: Science
  • Be the first to comment

Temporal Anchor Text as Proxy for user Queries

  1. 1. Temporal Anchor Text as Proxy for User Queries Thaer Samar, Arjen P. de Vries
  2. 2. Web Archiving 1/2  The Web is a major source of published information  Content on the Web evolves and changes continuously  Many initiatives aim to archive the Web  Petabytes of archived data
  3. 3. Web Archiving 2/2  Web archives are incomplete  Impossible to include all Web pages due to crawling limitations e.g., [Masanès06]  Depth-first crawl, focus only on selected web sites  Breadth-first crawl, focus on the entire domain, but not in depth
  4. 4. Reconstruct Queries  Our study: evolution of anchor text over time to reconstruct what was important in the past  Information that would be similar to user queries  Inspiration:  Document titles can be used as an approximation of user queries [Jin et al.]  Anchor text exhibits characteristics similar to user query and document title [Eiron & McCurley]
  5. 5. Queries in the Past  User queries have usually not been preserved  Impossible to reconstruct which queries the user would have used to search the archive  However, web archives contain more than the Web page content  E.g., page source, different timestamps (archive date, last-modified date), link structure
  6. 6. Link evidence and anchor Text  Link information represents the source URL, destination URL, and the anchor text  Anchor text is a short text describing the destination page  Has been shown to improve search effectiveness in a large number of Information Retrieval studies ` Source http://www.cwi.nl Destination http://www.nwo.nl ‘NWO’
  7. 7. Data: Dutch Web Archive  National Library of the Netherlands (KB)  Depth-first (selective) Web archive  Since 2007  10+ TB  8,000+ websites  Our snapshot  2009-2012
  8. 8. Link Processing Filtering  text/html pages  ~70% of archived objects URL: http://www.cwi.nl Archive-Date: 20091201 Content-Type: text/html <html> <a href=http://www.nwo.nl> NWO </a> </html> Web Archive Record
  9. 9. Link Processing Filtering  text/html pages  ~70% of archived objects Extraction  Source URL URL: http://www.cwi.nl Archive-Date: 20091201 Content-Type: text/html <html> <a href=http://www.nwo.nl> NWO </a> </html> Web Archive Record
  10. 10. Link Processing Filtering  text/html pages  ~70% of archived objects Extraction  Source URL  Destination URL URL: http://www.cwi.nl Archive-Date: 20091201 Content-Type: text/html <html> <a href=http://www.nwo.nl >NWO </a> </html> Web Archive Record
  11. 11. Link Processing Filtering  text/html pages  ~70% of archived objects Extraction  Source URL  Destination URL  Anchor text URL: http://www.cwi.nl Archive-Date: 20091201 Content-Type: text/html <html> <a href=http://www.nwo.nl> NWO </a> </html> Web Archive Record
  12. 12. Link Processing Filtering  text/html pages  ~70% of archived objects Extraction  Source URL  Destination URL  Anchor text  Archive-date (YYYYMM) URL: http://www.cwi.nl Archive-Date: 20091201 Content-Type: text/html <html> <a href=http://www.nwo.nl> NWO </a> </html> Web Archive Record
  13. 13. Link Processing Filtering  Pages of type text/html  ~70% of archived objects Extraction  Source URL  Destination URL  Anchor text  Crawl-date (YYYYMM)  URL normalization; get host of the source and the destination  Clean spam e.g., rolex watches Cleaning
  14. 14. Link Processing Filtering  Pages of type text/html  ~70% of archived objects Extraction  Source URL  Destination URL  Anchor text  Crawl-date (YYYYMM) Cleaning  URL normalization; get host of the source and the destination  Clean spam e.g., rolex watches Partitioning  Based on one-year and one-month granularity
  15. 15. Link Processing Filtering  Pages of type text/html  ~70% of archived objects Extraction  Source URL  Destination URL  Anchor text  Crawl-date (YYYYMM) Cleaning  URL normalization; get host of the source and the destination  Clean spam e.g., rolex watches Partitioning  Based on one-year and one-month granularity Deduplication  Remove duplicate links; due to crawling frequency  Same source, destination, and anchor text
  16. 16. Hosts Evolution  Important hosts overtime  Aggregate links based on the target host  keep unique source hosts  Multiple pages from same host linking to the same target host are counted as one  Rank hosts based on number of source hosts linking to them
  17. 17. % of new hosts over the years % New hosts in 2012 not in {2009, 2010, and 2011}
  18. 18. Anchor Text Evolution  Measure the importance of anchor text a over time in time-partitioned links  Aggregate by anchor text  Compute the archive-based popularity  Normalize by Maximum
  19. 19. % new anchor text over years  Anchor text is new in specific partition if does not appear in the previous partitions  Based on one-year granularity  59% new anchor text  Based on one-month granularity  34% new anchor text
  20. 20. WikiStats  Views aggregation of Wikipedia (WP) pages  From Jan 2008 to Jan 2015  We focus on  Feb 2009 to Dec 2012  Similar to the period of our snapshot of the Dutch Web archive  Keep WP titles viewed >= 1,000 times
  21. 21. Matching anchor text to WP titles  Pre-process WP titles like the anchor text  Lowercase  Stop-words removing  One-year and one-month granularity partitions  Collect titles by exact match with the anchors  Assume anchor popularity equals WP page popularity
  22. 22. Ranked anchor text with WP match  Different rank cut-off % overlap decreases while cut-off increases ~56 % in top- 1k has a match
  23. 23. Examples of popular anchor text (with match)  Major cities in the Netherlands  E.g., Amsterdam, Rotterdam, Groningen, and Utrecht  Social web sites  E.g., twitter, linkedin, flickr, and vimeo  Major Dutch daily newspapers  E.g., de Volkskrant, Telegraaf, and Trouw  Dutch public broadcasting  uitzending gemist  Government web service  E.g., belastingdienst
  24. 24. Discussion  Our original goal was to identify historically trending events from the link evolution recorded in the archive  Unfortunately we found only few examples with our current analysis  E.g., ‘‘canon’’ *  However, important anchor text provides and overview of important Dutch entities * corresponding to an activity initiated by the government to define the canonical historic events in Dutch history
  25. 25. Limitations & Future Work  Exact text matching between anchor text and WP title  E.g., filmpje does not match WP title filmpje!  Additional pre-processing  Stemming, stopping, generalize from exact match to match with low edit distance  Our analysis is based on depth-first crawl of few thousand of Dutch websites  Breadth-first crawl such as [CommonCrawl]
  26. 26. References  [Masanés06] J. Masanés. Web Archiving. Springer, 2006  [Jin et al.] Rong Jin, Alexander G. Hauptmann, and ChengXiang Zhai. Title language model for information retrieval. In SIGIR 2002  Eiron & McCurley Nadav Eiron and Kevin S. McCurley. Analysis of anchor text for web search. In SIGIR 2003  [CommonCrawl] https://commoncrawl.org/  [WikiStats] http://wikistats.ins.cwi.nl/
  27. 27. Limitations & Future Work  Exact text matching between anchor text and WP title  E.g., filmpje does not match WP title filmpje!  Additional pre-processing  Stemming, stopping, generalize from exact match to match with low edit distance  Our analysis is based on depth-first crawl of few thousand of Dutch websites  Breadth-first crawl such as [CommonCrawl]

×