Web Archiving 
A Brief Introduction 
Sawood Alam 
Department of Computer Science 
Old Dominion University 
Norfolk, Virginia - 23529
About Me 
● BTech, Jamia Millia Islamia, India, 2008 
● MSc, Old Dominion University, USA, 2013 
● PhD, Old Dominion University, USA, Current 
Lexical Signature 
Web, Digital Library, Web Archiving, Ruby on Rails, PHP, 
XHTML, CSS, JavaScript, ExtJS, Urdu, RTL and Linux. 
Sawood Alam
Agenda 
● What is an archive? 
● What is Web archiving? 
● Why do we care about archiving? 
● Issues and challenges 
● Various archiving efforts 
● Tools and techniques 
● WSDL research group 
● My research: Archive X-Ray! 
● Research opportunities 
● Higher education: how to study abroad?
What is an Archive? 
● Accumulation of historical records 
● Long term storage and preservation 
● Less frequently used 
● Physical or digital
What is Web Archiving? 
● Periodic snapshots of web pages 
● Preserving important events on the Web 
● Making archived content accessible
Why do We Care Archiving? 
● To preserve the history 
● To tell a story 
● For evidence 
● For backup 
● For personal satisfaction 
Web contents decay rapidly!
Issues and Challenges 
● Crawling 
● Storage 
● Retrieval 
● Replay 
● Accessibility 
● Completeness 
● Accuracy 
● Credibility
Web Archiving Efforts 
● Internet Archive 
● Archive-It 
● Wikipedia 
● UK Web Archive 
● Various national and non-profit archives 
● Film, music and other multimedia archives 
● Scholarly archives 
● Personal archiving
Tools and Techniques 
● Heritrix 
● WGet 
● cURL 
● OpenWayback 
● Memento 
● CarbonDate 
● Warrick 
● Synchronicity 
● Preserve Me! 
● WARCreate and WAIL
WSDL Research Group 
● Web Science and Digital Libraries 
Research Group 
● Home Page: ws-dl.cs.odu.edu 
● Blog: ws-dl.blogspot.com 
● Twitter: @WebSciDL 
● Flickr: flickr.com/photos/124419986@N07
WSDL Research Group
Archive X-Ray! 
● How much of the Web is archived? 
● Profiling various archive services 
● Predicting what they contain 
● Routing Memento aggregator queries
Research Opportunities 
● Information retrieval 
● Information visualization 
● Client and server side archiving 
● Archiving dynamic content 
● Distributed archiving 
● Discovering alternate long term archiving 
techniques 
● Predicting “Important” events on the Web 
and archiving them timely
Higher Education Abroad 
● Select your field of interest 
● Find potential universities in your field 
● Approach professors 
● Approach alumni 
● GRE and TOEFL 
● Expenses and funding options 
○ Scholarship 
○ Assistantship and on-campus jobs 
○ Education loan and self financing
Sawood Alam 
Department of Computer Science 
Old Dominion University 
Norfolk, Virginia - 23529 
salam@cs.odu.edu 
ibnesayeed@gmail.com 
Twitter: @ibnesayeed 
www.cs.odu.edu/~salam

Web Archiving: A Brief Introduction

  • 1.
    Web Archiving ABrief Introduction Sawood Alam Department of Computer Science Old Dominion University Norfolk, Virginia - 23529
  • 2.
    About Me ●BTech, Jamia Millia Islamia, India, 2008 ● MSc, Old Dominion University, USA, 2013 ● PhD, Old Dominion University, USA, Current Lexical Signature Web, Digital Library, Web Archiving, Ruby on Rails, PHP, XHTML, CSS, JavaScript, ExtJS, Urdu, RTL and Linux. Sawood Alam
  • 3.
    Agenda ● Whatis an archive? ● What is Web archiving? ● Why do we care about archiving? ● Issues and challenges ● Various archiving efforts ● Tools and techniques ● WSDL research group ● My research: Archive X-Ray! ● Research opportunities ● Higher education: how to study abroad?
  • 4.
    What is anArchive? ● Accumulation of historical records ● Long term storage and preservation ● Less frequently used ● Physical or digital
  • 5.
    What is WebArchiving? ● Periodic snapshots of web pages ● Preserving important events on the Web ● Making archived content accessible
  • 6.
    Why do WeCare Archiving? ● To preserve the history ● To tell a story ● For evidence ● For backup ● For personal satisfaction Web contents decay rapidly!
  • 7.
    Issues and Challenges ● Crawling ● Storage ● Retrieval ● Replay ● Accessibility ● Completeness ● Accuracy ● Credibility
  • 8.
    Web Archiving Efforts ● Internet Archive ● Archive-It ● Wikipedia ● UK Web Archive ● Various national and non-profit archives ● Film, music and other multimedia archives ● Scholarly archives ● Personal archiving
  • 9.
    Tools and Techniques ● Heritrix ● WGet ● cURL ● OpenWayback ● Memento ● CarbonDate ● Warrick ● Synchronicity ● Preserve Me! ● WARCreate and WAIL
  • 10.
    WSDL Research Group ● Web Science and Digital Libraries Research Group ● Home Page: ws-dl.cs.odu.edu ● Blog: ws-dl.blogspot.com ● Twitter: @WebSciDL ● Flickr: flickr.com/photos/124419986@N07
  • 11.
  • 12.
    Archive X-Ray! ●How much of the Web is archived? ● Profiling various archive services ● Predicting what they contain ● Routing Memento aggregator queries
  • 13.
    Research Opportunities ●Information retrieval ● Information visualization ● Client and server side archiving ● Archiving dynamic content ● Distributed archiving ● Discovering alternate long term archiving techniques ● Predicting “Important” events on the Web and archiving them timely
  • 14.
    Higher Education Abroad ● Select your field of interest ● Find potential universities in your field ● Approach professors ● Approach alumni ● GRE and TOEFL ● Expenses and funding options ○ Scholarship ○ Assistantship and on-campus jobs ○ Education loan and self financing
  • 15.
    Sawood Alam Departmentof Computer Science Old Dominion University Norfolk, Virginia - 23529 salam@cs.odu.edu ibnesayeed@gmail.com Twitter: @ibnesayeed www.cs.odu.edu/~salam