Web Archiving
A Brief Introduction
Sawood Alam
Department of Computer Science
Old Dominion University
Norfolk, Virginia - 23529 (USA)
About Me
Sawood Alam
Lexical Signature
Web, Digital Library, Web Archiving, Ruby on Rails, PHP,
XHTML, CSS, JavaScript, ExtJS, Urdu, RTL and Linux.
● BTech, Jamia Millia Islamia, India, 2008
● MSc, Old Dominion University, USA, 2013
● PhD, Old Dominion University, USA, Current
She Calls Me Dad!
Agenda
● Archiving and Web archiving
● Purpose and importance
● Scope of the web archiving
● Issues and challenges
● Tools and techniques
● Memento: Time Travel for the Web
● Archive X-Ray
● Research opportunities in Web archiving
● Our WSDL Research Group
What is an Archive?
● Accumulation of historical records
● Long term storage and preservation
● Less frequently used
● Physical or digital
What is Web Archiving?
● Periodic snapshots of web pages
● Preserving important events on the Web
● Making archived content accessible
Why do We Care Archiving?
Web contents decay rapidly!
● To preserve the history
● To tell a story
● For evidence
● For backup
● For personal satisfaction
Issues and Challenges
● Crawling
● Storage
● Retrieval
● Replay
● Accessibility
● Completeness
● Accuracy
● Credibility
Web Archiving Efforts
● Internet Archive
● Archive-It
● Wikipedia
● UK Web Archive
● Various national and non-profit archives
● Film, music and other multimedia archives
● Scholarly archives
● Personal archiving
Tools and Techniques
● Heritrix, PhantomJS, WGet, cURL
● OpenWayback, PyWB
● TimeTravel, MemGator
● CarbonDate, Warrick, Synchronicity
● Preserve Me!
● WARCreate,WAIL, Mink
● Browsertrix
● And many more...
Memento
<http://example.com>; rel="original",
<http://web.archive.org/web/20020120142510/http://example.com/>;
rel="memento";
datetime="Sun, 20 Jan 2002 14:25:10 GMT",
<http://web.archive.org/web/20020328012821/http://www.example.com/>;
rel="memento";
datetime="Thu, 28 Mar 2002 01:28:21 GMT",
<http://webarchive.loc.gov/all/20020803080544/http://www.example.com/>;
rel="memento";
datetime="Sat, 03 Aug 2002 08:05:44 GMT",
<http://wayback.archive-it.org/all/20091213015014/http://www.example.com/>;
rel="memento";
datetime="Sun, 13 Dec 2009 01:50:14 GMT",
Archive X-Ray!
● How much of the Web is archived?
● Profiling various archive services
● Predicting what they contain
● Routing Memento aggregator queries
Memento Aggregator
Memento Aggregator
Memento Aggregator
Memento Aggregator
Memento Aggregator
Memento Aggregator
Long Tail of Archives
Archive Profile
● High-level summary of an archive
● Predicts presence of mementos
● Provides statistics about the holdings
● Small in size and publicly available
● Easy to update and partially patch
● Useful for Memento query routing and
other things
com,cnn)/ {“frequency”: 40, “spread”: 2}
uk,co,bbc)/ {“frequency”: 20, “spread”: 1}
com,usatoday)/ {“frequency”: 5, “spread”: 1}
Research Opportunities
● Information retrieval
● Information visualization
● Client and server side archiving
● Archiving dynamic content
● Distributed archiving
● Discovering alternate long term archiving
techniques
● Predicting “Important” events on the Web
and archiving them timely
Web Science and Digital
Libraries Research Group
ws-dl.cs.odu.edu
ws-dl.blogspot.com
@WebSciDL
github.com/oduwsdl
flickr.com/photos/124419986@N07
WSDL Research Group
WSDL Research Group
WSDL Research Group
WSDL Research Group
WSDL Research Group
Sawood Alam
Department of Computer Science
Old Dominion University
Norfolk, Virginia - 23529 (USA)
salam@cs.odu.edu
ibnesayeed@gmail.com
@ibnesayeed
www.cs.odu.edu/~salam

Web Archiving: A Brief Introduction

  • 1.
    Web Archiving A BriefIntroduction Sawood Alam Department of Computer Science Old Dominion University Norfolk, Virginia - 23529 (USA)
  • 2.
    About Me Sawood Alam LexicalSignature Web, Digital Library, Web Archiving, Ruby on Rails, PHP, XHTML, CSS, JavaScript, ExtJS, Urdu, RTL and Linux. ● BTech, Jamia Millia Islamia, India, 2008 ● MSc, Old Dominion University, USA, 2013 ● PhD, Old Dominion University, USA, Current
  • 3.
  • 4.
    Agenda ● Archiving andWeb archiving ● Purpose and importance ● Scope of the web archiving ● Issues and challenges ● Tools and techniques ● Memento: Time Travel for the Web ● Archive X-Ray ● Research opportunities in Web archiving ● Our WSDL Research Group
  • 5.
    What is anArchive? ● Accumulation of historical records ● Long term storage and preservation ● Less frequently used ● Physical or digital
  • 6.
    What is WebArchiving? ● Periodic snapshots of web pages ● Preserving important events on the Web ● Making archived content accessible
  • 7.
    Why do WeCare Archiving? Web contents decay rapidly! ● To preserve the history ● To tell a story ● For evidence ● For backup ● For personal satisfaction
  • 8.
    Issues and Challenges ●Crawling ● Storage ● Retrieval ● Replay ● Accessibility ● Completeness ● Accuracy ● Credibility
  • 9.
    Web Archiving Efforts ●Internet Archive ● Archive-It ● Wikipedia ● UK Web Archive ● Various national and non-profit archives ● Film, music and other multimedia archives ● Scholarly archives ● Personal archiving
  • 10.
    Tools and Techniques ●Heritrix, PhantomJS, WGet, cURL ● OpenWayback, PyWB ● TimeTravel, MemGator ● CarbonDate, Warrick, Synchronicity ● Preserve Me! ● WARCreate,WAIL, Mink ● Browsertrix ● And many more...
  • 11.
    Memento <http://example.com>; rel="original", <http://web.archive.org/web/20020120142510/http://example.com/>; rel="memento"; datetime="Sun, 20Jan 2002 14:25:10 GMT", <http://web.archive.org/web/20020328012821/http://www.example.com/>; rel="memento"; datetime="Thu, 28 Mar 2002 01:28:21 GMT", <http://webarchive.loc.gov/all/20020803080544/http://www.example.com/>; rel="memento"; datetime="Sat, 03 Aug 2002 08:05:44 GMT", <http://wayback.archive-it.org/all/20091213015014/http://www.example.com/>; rel="memento"; datetime="Sun, 13 Dec 2009 01:50:14 GMT",
  • 12.
    Archive X-Ray! ● Howmuch of the Web is archived? ● Profiling various archive services ● Predicting what they contain ● Routing Memento aggregator queries
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
    Long Tail ofArchives
  • 20.
    Archive Profile ● High-levelsummary of an archive ● Predicts presence of mementos ● Provides statistics about the holdings ● Small in size and publicly available ● Easy to update and partially patch ● Useful for Memento query routing and other things com,cnn)/ {“frequency”: 40, “spread”: 2} uk,co,bbc)/ {“frequency”: 20, “spread”: 1} com,usatoday)/ {“frequency”: 5, “spread”: 1}
  • 21.
    Research Opportunities ● Informationretrieval ● Information visualization ● Client and server side archiving ● Archiving dynamic content ● Distributed archiving ● Discovering alternate long term archiving techniques ● Predicting “Important” events on the Web and archiving them timely
  • 22.
    Web Science andDigital Libraries Research Group ws-dl.cs.odu.edu ws-dl.blogspot.com @WebSciDL github.com/oduwsdl flickr.com/photos/124419986@N07
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
    Sawood Alam Department ofComputer Science Old Dominion University Norfolk, Virginia - 23529 (USA) salam@cs.odu.edu ibnesayeed@gmail.com @ibnesayeed www.cs.odu.edu/~salam