ArcLink - IIPC GA 2013


Published on

ArcLink is an additional API support for Wayback Machines that extracts, preserves, and enables access to web archives.

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

ArcLink - IIPC GA 2013

  1. 1. ArcLinkAdditional API support for WaybackMachinesAhmed AlSumPhD CandidateOld Dominion University
  2. 2. IntroductionWhat is ArcLink?• ArcLink is a complete system to Extract, Preserve,and Access to Temporal Web Graph.What is Temporal Web Graph?• Link structure through the time, including inlinks andoutlinks.WG @t2WG @t1 TWG
  3. 3. Motivations
  4. 4. IIPC Use-casesArcSys displays the linking information (known incominglinks within the archive, outgoing links, internal links, etc.).
  5. 5. Serving Robots!• Alnoamany1 reported access to IA wayback machine asRobots outnumber Humanso 10:1 in terms of sessions,o 5:4 in terms of raw HTTP accesseso 4:1 in terms of megabytes transferred.Sessions101HTTPaccesses54MBTransferred411ALNOAMANY, Y., WEIGLE, M.C. AND NELSON, M.L., 2013. Access Patterns for Robots and Humans in Web Archives. In Proceedings of the13th ACM/IEEE-CS joint conference on Digital Libraries. JCDL ‘13.
  6. 6. Solved QuestionsWhat are the titles for www.vancouver2010.comthrough time?• Get the TimeMap• Do page-scraping for each memento• Extract the Title header
  7. 7. Solved Questions, but hardDate Title02 Dec 1998 2010 Olympic Bid05 Apr 2001 Welcome to the Vancouver - Whistler 2010 Bid Corporation Website18 Jan 2002 Welcome to the Vancouver/Whistler Winter 2010 Bid Corporation Website31 Mar 2002 Welcome to the Vancouver 2010 Bid Corporation Website23 Sep 2002 The official site of the Vancouver 2010 Bid Corporation04 Feb 2006 Vancouver 2010 - Welcome30 Apr 2009 Olympics | 2010 Vancouver Olympic Games Medals Results Schedule Sports03 Nov 20092010 Vancouver Olympic Games Medals Results Schedule Sports : Vancouver 2010 WinterOlympics18 Dec 20092010 Vancouver Olympic Games Medals Results Schedule Sports : Vancouver 2010 WinterOlympics and Paralympics07 Feb 2010 Vancouver Olympic Games Medals Results Sports : Vancouver 2010 Winter Olympics05 Mar 2010 Olympic Games Medals, Results, Sports : Vancouver 2010 Winter Olympics02 Feb 2011 Vancouver 2010 Winter Olympics | Olympic Games Photos, Videos, & News - Olympic.org11 May 2011 vancouver 2010 Winter Olympics | Olympic Videos, Photos, News16 Dec 2011 Jeux olympiques dhiver de vancouver 2010 | vancouver Vidéos, Photos, Media olympique21 Dec 2012 Vancouver 2010 Winter Olympics | Olympic Videos, Photos, News, Medals08 Jan 2013 Vancouver 2010 Winter Olympics | Olympic Video, Medals, News
  8. 8. Unsolved QuestionsWhat are the anchor-text that pointed through time?
  9. 9. Researchers usePage-scraping• Researchers crawled the web archive to build theircorpus. For example,o Weber1built Internet Archive Crawler (HistoryCrawl) toexamine the evolution of content and hyperlink networksbetween websites.o Brügger2discussed the challenges of crawling the webarchives as part of his study on Danish parliamenaryelections.1WEBER, M.S., 2012. Newspapers and the Long-Term Implications of Hyperlinking. Journal of Computer-Mediated Communication, 17(2),pp.187–201.2BRÜGGER, N., 2012. Historical Network Analysis of the Web. Social Science Computer Review.
  10. 10. It’s More than WAT filesWAT ArcLinkBatch Process on a set of WARCs Batch process on a set of URIsFor internal use For public useNo-way to integerate with othersWAT files in others locationsIt could be aggregated with othergraphsNo incremental update Support incremental updateAccess on WAT file level using Pig Access on URI level using Web service
  11. 11. DatasetWinter Olympics 2010 collection* 700GB+From Nov 2009To Mar 2010#URI-R 6.4M#URI-M 23.7M
  12. 12. System StagesFiltering – Extraction – Preservation - Access
  13. 13. System Stages
  14. 14. Filtering• Using CDX files to filter the URI to select themementos that will contribute to the Web Graph.• For example,o Exclude non-200 HTTP status codeo Exclude Images, style-sheets, videos, etco Exclude duplicate mementos• Technique: Using Pig Latin script on CDX files• Results: CDX was reduced to 25% of the original size
  15. 15. Extraction• Technique: Hadoop• Step 1: URI-ID generationo Canonicalized the URI into SURT formato Hash the canonicalized format usingSimHasho Completely distributed• Step 2: Define data sourceso WARCo Web archive UI𝑤𝑤𝑤. 𝑒𝑥𝑎𝑚𝑝𝑙𝑒. 𝑜𝑟𝑔/𝑓𝑜𝑜𝑒𝑥𝑎𝑚𝑝𝑙𝑒. 𝑜𝑟𝑔/𝑓𝑜𝑜𝑤𝑤𝑤1. 𝑒𝑥𝑎𝑚𝑝𝑙𝑒. 𝑜𝑟𝑔/𝑓𝑜𝑜𝑜𝑟𝑔, 𝑒𝑥𝑎𝑚𝑝𝑙𝑒)/𝑓𝑜𝑜𝑜𝑟𝑔, 𝑒𝑥𝑎𝑚𝑝𝑙𝑒)/𝑓𝑜𝑜 → 𝐴𝐵𝐶𝐷11Input Source Map Reduce Total sec2 TasksWayback 21,422 4,194 25,616WARC 13,327 2,770 16,098 (62%)5 TasksWayback 13,721 2,257 15,978WARC 8,304 1,746 10,051 (62%)
  16. 16. • ArcLink used database to save the webgraphStorageInsertion Performance Update Performance
  17. 17. Access> curl "http://localhost:8080/LinkService/linkQuery?"<?xml version="1.0"?><rdf:RDF xmlns:rdf=""xmlns:twg=""><rdf:Description rdf:about="">To be continued…curl commandXML clauseRDF clauseURI
  18. 18. Access – Outlinks<rdf:Description rdf:about="">…..<twg:hasOutlinks rdf:parseType="Collection"><rdf:Description rdf:about=“"><twg:type>href</twg:type> <twg:text>News</twg:text><twg:timestamp> <rdf:Bag><rdf:li>20091103011307</rdf:li><rdf:li>20100130003005</rdf:li> ...</rdf:Bag> </twg:timestamp> </rdf:Description><rdf:Description rdf:about=" http://"><twg:type>href</twg:type> <twg:text>Cross-Country Skiing</twg:text><twg:timestamp> <rdf:Bag><rdf:li>20091110011557</rdf:li> <rdf:li>20100227081100</rdf:li> ...</rdf:Bag> </twg:timestamp> </rdf:Description>.....</twg:hasOutlinks>To be continued…Outlink ToAnchorTxtTimestampOutlink ToAnchorTxtTimestamp
  19. 19. Access – Inlinks<rdf:Description rdf:about="">…..<twg:hasInlinks rdf:parseType="Collection"><rdf:Description rdf:about=""><twg:type>href</twg:type> <twg:text>Official Vancouver Games site</twg:text><twg:timestamp> <rdf:Bag><rdf:li>20100217101229</rdf:li></rdf:Bag> </twg:timestamp> </rdf:Description><rdf:Description rdf:about=""><twg:type>href</twg:type> <twg:text>VANOC 2010</twg:text><twg:timestamp> <rdf:Bag><rdf:li>20100220104902</rdf:li></rdf:Bag> </twg:timestamp> </rdf:Description>....</twg:hasInlinks></rdf:Description></rdf:RDF>Inlink FromAnchorTxtTimestampInlink FromAnchorTxtTimestamp
  20. 20. Cost of Scaling Up• Filteringo 𝑇𝑖𝑚𝑒 =𝑛106 ∗88𝑚(𝑠𝑒𝑐)o 𝑅𝑒𝑑𝑢𝑐𝑡𝑖𝑜𝑛 = 𝑛 ∗ 0.3 (𝑚𝑒𝑚𝑒𝑛𝑡𝑜𝑠)• Extractiono 𝑇𝑖𝑚𝑒 =𝑛106 ∗5.5𝑚(ℎ𝑟𝑠)• Storageo 𝑆𝑖𝑧𝑒 = 𝑛 ∗ 10%Internet Archive58.6 hrs72 * 109 mementos165 days500 TB• 𝑇𝑖𝑚𝑒 =𝑛106 ∗88𝑚(𝑠𝑒𝑐)• 𝑅𝑒𝑑𝑢𝑐𝑡𝑖𝑜𝑛 = 𝑛 ∗ 0.3 (𝑚𝑒𝑚𝑒𝑛𝑡𝑜𝑠)Filtering• 𝑇𝑖𝑚𝑒 =𝑛106 ∗5.5𝑚(ℎ𝑟𝑠)Extraction• 𝑆𝑖𝑧𝑒 = 𝑛 ∗ 10%Storage*Numbers based on Wayback Machine published statistics on Jan 2013 of 240B mementos with total size 5PB
  21. 21. ArcLink as Wayback Ext.Technical perspective• Both of them are URI-lookup• Both of them are built as java web applications.
  22. 22. ArcLink as Wayback Ext.User perspectiveArcLinkMementoWayback
  23. 23. ApplicationsLet’s solve our “Unsolved Questions”
  24. 24. Time-Indexed InlinksInformationDate Anchor Text04-Nov-09 vancouver2010.com11-Nov-09 vancouver2010.com18-Nov-09 vancouver2010.com16-Jan-10 Vancouver 2010 Olympic Games16-Jan-10 Vancouver 2010 Olympic Games23-Jan-10 vancouver2010.com23-Jan-10 2010 Vancouver Olympic Games Medals Results Schedule Sports30-Jan-10 2010 Vancouver Olympic Games Medals Results Schedule Sports30-Jan-10 vancouver2010.com30-Jan-10 Vancouver 2010 Olympic Games13-Feb-10 Vancouver 2010 Olympic Winter Games15-Feb-10 Vancouver 2010 Olympic Games18-Feb-10 Official Vancouver Games site19-Feb-10 vancouver2010.com20-Feb-10 Official Vancouver Games site21-Feb-10 VANOC 2010
  25. 25. Temporal Page RankNov-2009 Dec-2009 Jan-20101 - - - canadacode.vancouver2010.com5 - - - - i-credible.nl10 - vpzschaatsteam.nlFeb-2010 Mar-2010 Collection ( Nov-09 to Mar-10 )1 monlibe.liberation.fr2 /teamgb/team-behind-team-gb/
  26. 26. How to get it• Open source code on Google-Codeo• Try it Soon•• @aalsum
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.