Web archiving challenges and opportunities
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Web archiving challenges and opportunities

on

  • 187 views

This is my presentation for job interview as web archiving engineer at Stanford university libraries on Oct 25.

This is my presentation for job interview as web archiving engineer at Stanford university libraries on Oct 25.

Statistics

Views

Total Views
187
Views on SlideShare
186
Embed Views
1

Actions

Likes
0
Downloads
2
Comments
0

1 Embed 1

http://www.slideee.com 1

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • The notable exceptions of Japanese  {Bengali, Vietnamese} and German Portuguese

Web archiving challenges and opportunities Presentation Transcript

  • 1. WEB ARCHIVING CHALLENGES & OPPORTUNITIES PRESENTATIONFOR WEBARCHIVINGENGINEERINGPOSITION Ahmed AlSum PhD Candidate Old Dominion University
  • 2. Outline • Engineering Experience • IBM • Old Dominion University • Internet Archive • Web Archiving Challenges & Opportunities • Selection • Harvesting • Storage • Access • Community • Conclusions
  • 3. Cairo, Egypt 2006 - 2009
  • 4. CCSP Project • An internal IBM support portal that provides client-facing audiences a by-client, holistic view of client situations • Technologies: WebSphere Portal, DB2, deployed on zLinux machines
  • 5. Responsibilities • Software Engineer • Enterprise Applications with J2EE platform technologies for frontend (Servlets, JSP, Portlet APIs), and backend tasks based on EJB • Front-end components based on Web 20 technologies (AJAX based on dojo 1.0, and Java Script) • Lotus Sametime (Plugins and Bot development) • Software engineer team leader • Support project quality activities • Lead code review and static analysis activities
  • 6. Responsibilities • Administrator • Deploying Portal solutions on WebSphere Portal • WebSphere Portal Administration for standalone and clustered environment • Administration on Linux and Windows OS • DB2 server administration for single instance and multiple instances with HADR support • Customer support team lead • Leading customer support activities
  • 7. Certifications
  • 8. Sharing IBM Internal Solutions with Broader Community
  • 9. Norfolk, VA USA 2009 - 2013
  • 10. Memento • Memento is an HTTP extension to integrate the Past and the Current Web I Jacobs and N Walsh Architecture of the world wide web Technical report, W3C, 2004 http://wwww3org/TR/webarch/ Now T1 T2 T3
  • 11. Memento • Developer and administrator for Memento aggregator and proxies
  • 12. Memento Clients • Memento currently is I-D draft, it is promoted to move to RFC soon.
  • 13. San Francisco, CA USA 2012
  • 14. WAT Extraction • Web Archive Transformation (WAT) is a specification for structuring metadata generated by Web crawls • Technologies:
  • 15. WEB ARCHIVING Challenges and Opportunities
  • 16. Web Archive Life Cycle Hockx-Yu, H, 2011 The Past Issue of the Web In Proceedings of 3rd International Conference on Web Science pp 1–8
  • 17. Selection • Decide what to capture Everything, any domain National domains Delegate selection to partners Users’ favorites • We studied what is already captured
  • 18. How Much Of The Web Is Archived? S. G. Ainsworth, A. AlSum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries, JCDL '11, Ottawa, Canada 2011 See also: http://arxiv.org/abs/1212.6177
  • 19. Archive categories We have 3 categories of archives • Internet Archive (classic interface) • Search engine • Other archives Selection U K U S Public Archives, ca. Late 2010 / Early 2011
  • 20. 1000 URIs Ordered by First Observation Date Selection See also: http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is-archived.html
  • 21. Memento Distribution, ordered by the first observation date
  • 22. How Much of the Web is Archived? It Depends on Which Web… Selection Including SE cache Excluding SE Cache 90% 79% 97% 68% 88% 19% 35% 16% Changes since 2011: no more free SE APIs; greatly reduced IA quarantine period; 15 public web archives 2013 95% 92% 23% 26%
  • 23. Profiling Web Archive Coverage For Top-level Domain And Content Language A. AlSum, M. C. Weigle, M. L. Nelson, and H. Van de Sompel In Proceedings of the 17th International Conference on Theory and Practice of Digital Libraries, TPDL 2013, 2013 See also: http://arxiv.org/abs/1309.4008
  • 24. Where is it archived? Selection IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It
  • 25. Language Coverage Selection IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It
  • 26. Growth Rate Selection IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It Borrowed Portuguese material from IA Stopped archiving since 2008 Steady growth Stopped getting new URIs, but still crawling
  • 27. Selection Research Output • Some portions of the web are not well archived such as India and Africa. • Profiling helping us in Memento query routing. • IIPC proposal with Herbert Van de Sompel (LANL) and David Rosenthal (SUL). Selection
  • 28. Selection at SUL • Focus on the missing parts of the Web • Twitter - Crowdsource: • UK Web archive: Twittervana • Internet Memory: Collect URIs from twitter APIs • VA Tech: CTRNET project • Stanford Community • World News collection: 10 news website from each county • Tools: Selection
  • 29. Web Archive Life Cycle Hockx-Yu, H, 2011 The Past Issue of the Web In Proceedings of 3rd International Conference on Web Science pp 1–8
  • 30. Harvesting • Services • Archive-It • WAS @ CDLib • Dedicated servers • New tools See also: http://ws-dl.blogspot.com/2013/07/2013-07-10-warcreate-and-wail-warc.html
  • 31. Special Harvesting Techniques • Borrow old materials from other web archives • Ex Stanford WebBase Project* • 260 TB • 7 Billion webpages Harvesting *http://www-diglib.stanford.edu/~testbed/doc2/WebBase/
  • 32. Special Harvesting Techniques • Social Media • Focus on shared resources in the social media Harvesting Hany M SalahEldeen, Michael L Nelson, Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost?, Proceedings of TPDL 2012 http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html
  • 33. Special Harvesting Techniques • SiteStory - Transactional Archive Harvesting Justin F Brunelle, Michael L Nelson, Lyudmila Balakireva, Robert Sanderson, Herbert Van de Sompel, Evaluating the SiteStory Transactional Web Archive With the ApacheBench Tool, Proceedings of TPDL 2013 Sitestory: http://mementoweb.github.io/SiteStory/
  • 34. Harvesting • Challenges • Ajax and Web 2.0/3.0 • Streaming Media • URI challenges • Mobile Harvesting http://blog.dshr.org/2012/05/harvesting-and-preserving-future-web.html http://netpreserve.org/sites/default/files/resources/OverviewFutureWebWorkshop.pdf
  • 35. Web Archive Life Cycle Hockx-Yu, H, 2011 The Past Issue of the Web In Proceedings of 3rd International Conference on Web Science pp 1–8
  • 36. Storage (Format) • Flat files: • WARC files (ISO standard) • No-SQL db: • Hbase at Internet memory* • Storage at SUL: • We need to use both Storage *Philippe Rigaux, Understanding HBase— The data model, IM technology blog http://internetmemoryorg/en/indexphp/synapse/understanding_the_hbase_data_model/
  • 37. Storage (Infrastructure) • Wrong solution could be a disaster Storage
  • 38. Web Archive Life Cycle Hockx-Yu, H, 2011 The Past Issue of the Web In Proceedings of 3rd International Conference on Web Science pp 1–8
  • 39. Accessing Web Archive URI-Based WayBack Machine • Textbox to enter the requested URI • BubbleMap to show you the available mementos
  • 40. Accessing Web Archive Full-text search • Challenges: Temporal Page Rank, Rank per site or memento, Date filtering
  • 41. Accessing Web Archive • Thumbnail View • Trade-off between building the thumbnail in real time or pre-building Also, trade-off between representing the thumbnail by URI or by embedded binary data Can we build partial thumbnail map?
  • 42. Accessing Web Archive • Title View • Trade-off between, extracting all the titles and keeping it as a metadata about the memento and extracting the title from the HTML content on the real time Implemented using Simile: http://www.simile-widgets.org/timeline/
  • 43. Accessing Web Archive • Wayback Machine API • XML interface for the list of available Mementos
  • 44. Accessing Web Archive • Web Page Snapshot Replay • URI rewriting, javascript, a nd embedded resources
  • 45. Accessing Web Archive • Page Completeness Degree • The completeness degree could be calculated on the real time by using the preserved HTTP status for the embedded resources See also: http://arxiv.org/abs/1309.5503
  • 46. Accessing Web Archive • Reconstructing web site • Current approach is using the web archive public interface.
  • 47. Accessing Web Archive • Wayback Annotator • Create collections • Select and save relevant content to their collections • Annotate & mark important parts of archived web pages • Share their work and collaborate on archived content use http://netpreserve.org/sites/default/files/resources/Predstavitev_07.pdf http://netpreserve.org/sites/default/files/resources/Wayback_annotator_06.pdf
  • 48. Accessing Web Archive Collection-Based • In addition to browsing the collection, you can browse the URIs in this collection • Research questions: Collection overview
  • 49. Accessing Web Archive • Collection visualization • Term frequency algorithms should be normalized to take the mementos density in consideration http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis-visualizing.html
  • 50. Accessing Web Archive • Web Archive analytics See also: http://ilpubs.stanford.edu:8090/1037/1/arcspread.pdf • ArcSpread took a query from the user, extracted related information and displayed the results in spread sheet style.
  • 51. Who And What Links To The Internet Archive Y. Alnoamany, A. AlSum, M. C. Weigle, M. L. Nelson In Proceedings of 17th International Conference on Theory and Practice of Digital Libraries, TPDL 2013, 2013 (Best Student Paper) See also: http://arxiv.org/abs/1309.4016
  • 52. Serving Robots! • Log files analysis using Apache Pig • Access to IA wayback machine as Robots outnumber Humans • 10:1 in terms of sessions, • 5:4 in terms of raw HTTP accesses • 4:1 in terms of megabytes transferred Access Sessions 10 1 HTTP accesses 5 4 MB Transferred 4 1
  • 53. Where do Wayback Machine Users Come From? Website Percentage Description en.wikipedia.org 12.9% Wikipedia archive.org 11.9% IA Home Page reddit.com 10.2% Social News Web Site google.TLD 9.9% Search Engine info-poland.buffalo.edu 1.5% Polish Studies de.wikipedia.org 1.4% Wikipedia cracked.com 1.2% Humor Site snopes.com 1.1% Urban Legends Reference Pages facebook.com 0.9% Social Media crochetpatterncentral.com 0.9% Crocheting Hobbies Access
  • 54. Most Languages Self-Link Access
  • 55. ArcLink: Optimization Techniques To Build And Retrieve The Temporal Web Graph A. AlSum, M. L. Nelson IIPC GA 2013, Ljubljana, Slovenia In Proceedings of the 13th international ACM/IEEE joint conference on Digital libraries, JCDL '13, 2013 See also: http://arxiv.org/abs/1305.5959
  • 56. Easy Solved Questions Q: What are the available mementos for vancouver2010.com? Access
  • 57. Solved Questions, but hard Q: What are the HTML titles for vancouver2010com through time? A Page scraping for all mementos Access
  • 58. Impossible Questions Q What are the anchor-text that pointed to www.vancouver2010.com through time? Access … <a href=www.vancouver2010.com > Vancouver Olympics </a> …. … <a href=www.vancouver2010.com > Winter Olympics </a> … … <a href=www.vancouver2010.com > Vancouver 2010 </a> …
  • 59. ArcLink Access Google code: https://code.google.com/p/arcsys/
  • 60. Impossible Questions • Q What are the anchor-text that pointed to www.vancouver2010.com through time? Access
  • 61. Thumbnail Summarization Techniques For Web Archives A. AlSum, and M. L. Nelson Submitted for publication.
  • 62. Thumbnails Access Internet Archive UK Web archive
  • 63. Thumbnail Creation Challenges • Scalability in Time • IA may need 361 years to create thumbnail per each memento using one hundred machine • Scalability in Space • IA will need 355 TB to store 1 thumbnail per each memento • Page quality Access
  • 64. How many thumbnails do we need? Access www.unfi.com on the live Web
  • 65. How many thumbnails do we need? Access www.unfi.com on the live Web
  • 66. 40 Thumbnails are good. Access
  • 67. Same technique applied to apple.com Access
  • 68. From 8000 Mementos to 69 Thumbnails. Access
  • 69. iTunes cover application Access
  • 70. Community • I suggest to be a member in IIPC • Join the open Wayback Machine team • Join the Winter Olympics 2014 collaborative project, even as an observer
  • 71. Community • Web Archiving Workshops WAC 2011, Ottawa, Canada WAC 2012, Stanford, CA, USA WADL 2013, Indianapolis, IN, USATempWeb 2013, Rio de Janeiro, Brazil
  • 72. Tools to SUL Web Archive • Selection • Harvest • Analysis • Access
  • 73. Conclusions • Be Selective: Cover missing parts of the Web • Be Older: Include WebBase • Be Smart: Innovative services • Be Helpful: Researcher Framework/Dataset • Be Active: Participate in the WA communities • Make a difference aalsum@cs.odu.edu @aalsum
  • 74. BACKUP
  • 75. What is missing? IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It
  • 76. Thumbnail Features SimHash DOM tree Embedded resources Datetime
  • 77. Clustering technique