SlideShare a Scribd company logo
1 of 38
Download to read offline
Introducing
Web Archiving and
WSDL Research Group
Sawood Alam
Department of Computer Science
Old Dominion University
Norfolk, Virginia - 23529 (USA)
About Me
Sawood Alam
Lexical Signature
Web, Digital Library, Web Archiving, Ruby on Rails, PHP, HTML,
CSS, JavaScript, ExtJS, Go, Urdu, RTL, Docker, and Linux.
● BTech, Jamia Millia Islamia, India, 2008
● MSc, Old Dominion University, USA, 2013
● PhD, Old Dominion University, USA, Current
She Calls Me Dad!
Agenda
● Archiving and Web archiving
● Purpose and importance
● Scope of the web archiving
● Issues and challenges
● Tools and techniques
● Memento: Time Travel for the Web
● Archive X-Ray
● Research opportunities in Web archiving
● Our WSDL Research Group
What is an Archive?
● Accumulation of historical records
● Long term storage and preservation
● Less frequently used
● Physical or digital
What is Web Archiving?
● Periodic snapshots of web pages
● Preserving important events on the Web
● Making archived content accessible
Why do We Care Archiving?
Web contents decay rapidly!
● To preserve the history
● To tell a story
● For evidence
● For backup
● For personal satisfaction
Issues and Challenges
● Crawling
● Storage
● Retrieval
● Replay
● Accessibility
● Completeness
● Accuracy
● Credibility
Web Archiving Efforts
● Internet Archive
● Archive-It
● Wikipedia
● UK Web Archive
● Various national and non-profit archives
● Film, music and other multimedia archives
● Scholarly archives
● Personal archiving
Tools and Techniques
● Heritrix, PhantomJS, WGet, cURL
● OpenWayback, PyWB
● TimeTravel, MemGator
● CarbonDate, Warrick, Synchronicity
● Preserve Me!
● WARCreate,WAIL, Mink
● Browsertrix
● And many more...
Memento
<http://example.com>; rel="original",
<http://web.archive.org/web/20020120142510/http://example.com/>;
rel="memento";
datetime="Sun, 20 Jan 2002 14:25:10 GMT",
<http://web.archive.org/web/20020328012821/http://www.example.com/>;
rel="memento";
datetime="Thu, 28 Mar 2002 01:28:21 GMT",
<http://webarchive.loc.gov/all/20020803080544/http://www.example.com/>;
rel="memento";
datetime="Sat, 03 Aug 2002 08:05:44 GMT",
<http://wayback.archive-it.org/all/20091213015014/http://www.example.com/>;
rel="memento";
datetime="Sun, 13 Dec 2009 01:50:14 GMT",
Archive X-Ray!
● How much of the Web is archived?
● Profiling various archive services
● Predicting what they contain
● Routing Memento aggregator queries
MemGator
https://github.com/oduwsdl/memgator
MemGator
http://memgator.cs.odu.edu:1208/
Memento Aggregator
Memento Aggregator
Memento Aggregator
Memento Aggregator
Memento Aggregator
Memento Aggregator
From: Michael Nelson [mailto:mln@cs.odu.edu]
Sent: Wednesday, December 02, 2015 12:33 PM
To: Jones, Gina
Cc: Rourke, Patrick; Grotke, Abigail
Subject: Re: WebSciDL
Hi Gina, I'll investigate. memgator is software that one my students
wrote, but I suspect the traffic you're seeing is b/c it is deployed in
http://oldweb.today/ can you share the IP addr from where you're
seeing the traffic? I presume the requests are for Memento TimeMaps?
It should not being actually scraping HTML pages.
regards,
Michael
On Wed, 2 Dec 2015, Jones, Gina wrote:
> Hi Michael, we have a slight configuration issue with the current OW
> set up for our webarchives. I think, from looking at the logs, that
> "MemGator:1.0-rc3 <@WebSciDL>" is really causing some issues on
our wayback.
> Do you know who is running this scraper? Itʼs not part of memento is
it?
>
> Gina Jones
> Web Archiving Team
> Library of Congress
From: Ilya Kreymer <ikreymer@gmail.com>
Date: Wed, 2 Dec 2015 10:33:56 -0800
Subject: high traffic on oldweb!
To: Herbert Van de Sompel <hvdsomp@gmail.com>, Sawood Alam
<ibnesayeed@gmail.com>
Hi Herbert, Sawood,
Herbert: Perhaps you are lucky that I am not using the LANL
aggregator, as the traffic has gotten really high, and also I was asked
to remove an archive due to the traffic it was causing temporarily..
I am thinking that ability to remove source archives quickly is an
important aspect of an aggregator.
Sawood: Hopefully yours will support something like this so I don't
need to restart the container to change the archivelist ;)
Ilya
Broadcasting is Bad
Memento Routing
Long Tail of Archives
While the IA was Down...
$ memgator -f cdxj example.org | cut -c-4 | grep -v "^@" | uniq -c
2 2002
1 2005
1 2008
6 2009
67 2010
17 2011
64 2012
108 2013
108 2014
186 2015
51 2016
Archive Profile
● High-level summary of an archive
● Predicts presence of mementos
● Provides statistics about the holdings
● Small in size and publicly available
● Easy to update and partially patch
● Useful for Memento query routing and
other things
com,cnn)/ {“frequency”: 40, “spread”: 2}
uk,co,bbc)/ {“frequency”: 20, “spread”: 1}
com,usatoday)/ {“frequency”: 5, “spread”: 1}
Research Opportunities
● Information retrieval
● Information visualization
● Client and server side archiving
● Archiving dynamic content
● Distributed archiving
● Discovering alternate long term archiving
techniques
● Predicting “Important” events on the Web
and archiving them timely
Web Science and Digital
Libraries Research Group
ws-dl.cs.odu.edu
ws-dl.blogspot.com
@WebSciDL
github.com/oduwsdl
flickr.com/photos/124419986@N07
ODU Sailing Center
WSDL Feast
WSDL Whiteboards
WSDL Surprise
WSDL Ping Pong Table
WSDL Travels
Sawood Alam
Department of Computer Science
Old Dominion University
Norfolk, Virginia - 23529 (USA)
salam@cs.odu.edu
ibnesayeed@gmail.com
@ibnesayeed
www.cs.odu.edu/~salam

More Related Content

What's hot

DBpedia Archive using Memento, Triple Pattern Fragments, and HDT
DBpedia Archive using Memento, Triple Pattern Fragments, and HDTDBpedia Archive using Memento, Triple Pattern Fragments, and HDT
DBpedia Archive using Memento, Triple Pattern Fragments, and HDT
Herbert Van de Sompel
 
IPTC News in JSON Spring 2013
IPTC News in JSON Spring 2013IPTC News in JSON Spring 2013
IPTC News in JSON Spring 2013
Stuart Myles
 

What's hot (20)

InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web ArchivingInterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
 
Avoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerAvoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorker
 
DBpedia Archive using Memento, Triple Pattern Fragments, and HDT
DBpedia Archive using Memento, Triple Pattern Fragments, and HDTDBpedia Archive using Memento, Triple Pattern Fragments, and HDT
DBpedia Archive using Memento, Triple Pattern Fragments, and HDT
 
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred RepresentationsScripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
 
Archive What I See Now: Personal Web Archiving with WARCs
Archive What I See Now: Personal Web Archiving with WARCsArchive What I See Now: Personal Web Archiving with WARCs
Archive What I See Now: Personal Web Archiving with WARCs
 
Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMap
 
The Web We Want
The Web We WantThe Web We Want
The Web We Want
 
Linked Open Data stuff
Linked Open Data stuffLinked Open Data stuff
Linked Open Data stuff
 
Semantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologistsSemantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologists
 
Flagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertierFlagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertier
 
A Framework for Aggregating Private and Public Web Archives
A Framework for Aggregating Private and Public Web ArchivesA Framework for Aggregating Private and Public Web Archives
A Framework for Aggregating Private and Public Web Archives
 
Memento 101
Memento 101Memento 101
Memento 101
 
To the Rescue of the Orphans of Scholarly Communication
To the Rescue of the Orphans of Scholarly CommunicationTo the Rescue of the Orphans of Scholarly Communication
To the Rescue of the Orphans of Scholarly Communication
 
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
 
IPTC News in JSON Spring 2013
IPTC News in JSON Spring 2013IPTC News in JSON Spring 2013
IPTC News in JSON Spring 2013
 
It is hard to compute fixity on archived web pages
It is hard to compute fixity on archived web pagesIt is hard to compute fixity on archived web pages
It is hard to compute fixity on archived web pages
 
Creating Pockets of Persistence
Creating Pockets of PersistenceCreating Pockets of Persistence
Creating Pockets of Persistence
 
Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count
 
Achieving Link Integrity for Managed Collections
Achieving Link Integrity for Managed CollectionsAchieving Link Integrity for Managed Collections
Achieving Link Integrity for Managed Collections
 
Web Data Management with RDF
Web Data Management with RDFWeb Data Management with RDF
Web Data Management with RDF
 

Viewers also liked

Visualizing Digital Collections at Archive-It - Jcdl 2012
Visualizing Digital Collections at Archive-It - Jcdl 2012Visualizing Digital Collections at Archive-It - Jcdl 2012
Visualizing Digital Collections at Archive-It - Jcdl 2012
Kalpesh Padia
 
MS Thesis Defense, Aug 2012 - Visualizing Digital Collections at Archive-It
MS Thesis Defense, Aug 2012 - Visualizing Digital Collections at Archive-ItMS Thesis Defense, Aug 2012 - Visualizing Digital Collections at Archive-It
MS Thesis Defense, Aug 2012 - Visualizing Digital Collections at Archive-It
Kalpesh Padia
 

Viewers also liked (6)

Discovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCIDDiscovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCID
 
Local Memory Project
Local Memory ProjectLocal Memory Project
Local Memory Project
 
Visualizing Digital Collections at Archive-It - Jcdl 2012
Visualizing Digital Collections at Archive-It - Jcdl 2012Visualizing Digital Collections at Archive-It - Jcdl 2012
Visualizing Digital Collections at Archive-It - Jcdl 2012
 
MS Thesis Defense, Aug 2012 - Visualizing Digital Collections at Archive-It
MS Thesis Defense, Aug 2012 - Visualizing Digital Collections at Archive-ItMS Thesis Defense, Aug 2012 - Visualizing Digital Collections at Archive-It
MS Thesis Defense, Aug 2012 - Visualizing Digital Collections at Archive-It
 
A Collaborative, Secure, and Private InterPlanetary Wayback Web Archiving Sys...
A Collaborative, Secure, and Private InterPlanetary Wayback Web Archiving Sys...A Collaborative, Secure, and Private InterPlanetary Wayback Web Archiving Sys...
A Collaborative, Secure, and Private InterPlanetary Wayback Web Archiving Sys...
 
Dockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationDockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to Containerization
 

Similar to Introducing Web Archiving and WSDL Research Group

Keeping Web Records Lg Web Network August 2009
Keeping Web Records Lg Web Network August 2009Keeping Web Records Lg Web Network August 2009
Keeping Web Records Lg Web Network August 2009
Cassie Findlay
 
Browser-Based Digital Preservation
Browser-Based Digital PreservationBrowser-Based Digital Preservation
Browser-Based Digital Preservation
Mat Kelly
 
Open (linked) bibliographic data edmund chamberlain (university of cambridge)
Open (linked) bibliographic data   edmund chamberlain (university of cambridge)Open (linked) bibliographic data   edmund chamberlain (university of cambridge)
Open (linked) bibliographic data edmund chamberlain (university of cambridge)
RDTF-Discovery
 

Similar to Introducing Web Archiving and WSDL Research Group (20)

Who Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web ArchivingWho Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
 
On the Change in Archivability of Websites Over Time
On the Change in Archivability of Websites Over TimeOn the Change in Archivability of Websites Over Time
On the Change in Archivability of Websites Over Time
 
Art and Science of Web Sites Performance: A Front-end Approach
Art and Science of Web Sites Performance: A Front-end ApproachArt and Science of Web Sites Performance: A Front-end Approach
Art and Science of Web Sites Performance: A Front-end Approach
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Keeping Web Records Lg Web Network August 2009
Keeping Web Records Lg Web Network August 2009Keeping Web Records Lg Web Network August 2009
Keeping Web Records Lg Web Network August 2009
 
Archival Technologies
Archival TechnologiesArchival Technologies
Archival Technologies
 
Practical Data Management - ACRL DCIG Webinar
Practical Data Management - ACRL DCIG WebinarPractical Data Management - ACRL DCIG Webinar
Practical Data Management - ACRL DCIG Webinar
 
Archival Technologies 2014
Archival Technologies 2014Archival Technologies 2014
Archival Technologies 2014
 
DepositMOre: Applying tools to increase full-text content in institutional re...
DepositMOre: Applying tools to increase full-text content in institutional re...DepositMOre: Applying tools to increase full-text content in institutional re...
DepositMOre: Applying tools to increase full-text content in institutional re...
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web Bundles
 
Digital Preservation - ODU
Digital Preservation - ODUDigital Preservation - ODU
Digital Preservation - ODU
 
Digital Preservation at ODU
Digital Preservation at ODUDigital Preservation at ODU
Digital Preservation at ODU
 
Browser-Based Digital Preservation
Browser-Based Digital PreservationBrowser-Based Digital Preservation
Browser-Based Digital Preservation
 
Caching Up and Down the Stack
Caching Up and Down the StackCaching Up and Down the Stack
Caching Up and Down the Stack
 
SPA2015: Hooman Beheshti – The Future of CDNs
SPA2015: Hooman Beheshti – The Future of CDNsSPA2015: Hooman Beheshti – The Future of CDNs
SPA2015: Hooman Beheshti – The Future of CDNs
 
Detecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARCDetecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARC
 
Intro to Semantic Web
Intro to Semantic WebIntro to Semantic Web
Intro to Semantic Web
 
Open (linked) bibliographic data edmund chamberlain (university of cambridge)
Open (linked) bibliographic data   edmund chamberlain (university of cambridge)Open (linked) bibliographic data   edmund chamberlain (university of cambridge)
Open (linked) bibliographic data edmund chamberlain (university of cambridge)
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunities
 

More from Sawood Alam

Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback Machine
Sawood Alam
 

More from Sawood Alam (17)

TrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web PagesTrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web Pages
 
CDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection InsightsCDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection Insights
 
Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback Machine
 
Supporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSupporting Web Archiving via Web Packaging
Supporting Web Archiving via Web Packaging
 
MementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkMementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination Framework
 
Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web Archives
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification Framework
 
Web ARChive (WARC) File Format
Web ARChive (WARC) File FormatWeb ARChive (WARC) File Format
Web ARChive (WARC) File Format
 
MemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoMemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in Go
 
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerClient-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorker
 
TPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web ArchivesTPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web Archives
 
Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web Archives
 
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
 
Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015
 
Profiling Web Archives IIPC GA 2015
Profiling Web Archives IIPC GA 2015Profiling Web Archives IIPC GA 2015
Profiling Web Archives IIPC GA 2015
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief Introduction
 
HTTP Mailbox - Asynchronous RESTful Communication
HTTP Mailbox - Asynchronous RESTful CommunicationHTTP Mailbox - Asynchronous RESTful Communication
HTTP Mailbox - Asynchronous RESTful Communication
 

Recently uploaded

哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
ydyuyu
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
ayvbos
 
一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理
F
 
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu DhabiAbu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Monica Sydney
 
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girlsRussian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Monica Sydney
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Monica Sydney
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
ydyuyu
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
ydyuyu
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
pxcywzqs
 

Recently uploaded (20)

哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
 
Call girls Service in Ajman 0505086370 Ajman call girls
Call girls Service in Ajman 0505086370 Ajman call girlsCall girls Service in Ajman 0505086370 Ajman call girls
Call girls Service in Ajman 0505086370 Ajman call girls
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
 
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac RoomVip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
 
一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu DhabiAbu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
 
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency""Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
 
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girlsRussian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
 
Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
 
Mira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call GirlsMira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
 

Introducing Web Archiving and WSDL Research Group

  • 1. Introducing Web Archiving and WSDL Research Group Sawood Alam Department of Computer Science Old Dominion University Norfolk, Virginia - 23529 (USA)
  • 2. About Me Sawood Alam Lexical Signature Web, Digital Library, Web Archiving, Ruby on Rails, PHP, HTML, CSS, JavaScript, ExtJS, Go, Urdu, RTL, Docker, and Linux. ● BTech, Jamia Millia Islamia, India, 2008 ● MSc, Old Dominion University, USA, 2013 ● PhD, Old Dominion University, USA, Current
  • 4. Agenda ● Archiving and Web archiving ● Purpose and importance ● Scope of the web archiving ● Issues and challenges ● Tools and techniques ● Memento: Time Travel for the Web ● Archive X-Ray ● Research opportunities in Web archiving ● Our WSDL Research Group
  • 5. What is an Archive? ● Accumulation of historical records ● Long term storage and preservation ● Less frequently used ● Physical or digital
  • 6. What is Web Archiving? ● Periodic snapshots of web pages ● Preserving important events on the Web ● Making archived content accessible
  • 7. Why do We Care Archiving? Web contents decay rapidly! ● To preserve the history ● To tell a story ● For evidence ● For backup ● For personal satisfaction
  • 8. Issues and Challenges ● Crawling ● Storage ● Retrieval ● Replay ● Accessibility ● Completeness ● Accuracy ● Credibility
  • 9. Web Archiving Efforts ● Internet Archive ● Archive-It ● Wikipedia ● UK Web Archive ● Various national and non-profit archives ● Film, music and other multimedia archives ● Scholarly archives ● Personal archiving
  • 10. Tools and Techniques ● Heritrix, PhantomJS, WGet, cURL ● OpenWayback, PyWB ● TimeTravel, MemGator ● CarbonDate, Warrick, Synchronicity ● Preserve Me! ● WARCreate,WAIL, Mink ● Browsertrix ● And many more...
  • 11. Memento <http://example.com>; rel="original", <http://web.archive.org/web/20020120142510/http://example.com/>; rel="memento"; datetime="Sun, 20 Jan 2002 14:25:10 GMT", <http://web.archive.org/web/20020328012821/http://www.example.com/>; rel="memento"; datetime="Thu, 28 Mar 2002 01:28:21 GMT", <http://webarchive.loc.gov/all/20020803080544/http://www.example.com/>; rel="memento"; datetime="Sat, 03 Aug 2002 08:05:44 GMT", <http://wayback.archive-it.org/all/20091213015014/http://www.example.com/>; rel="memento"; datetime="Sun, 13 Dec 2009 01:50:14 GMT",
  • 12. Archive X-Ray! ● How much of the Web is archived? ● Profiling various archive services ● Predicting what they contain ● Routing Memento aggregator queries
  • 13.
  • 14.
  • 15.
  • 16.
  • 25. From: Michael Nelson [mailto:mln@cs.odu.edu] Sent: Wednesday, December 02, 2015 12:33 PM To: Jones, Gina Cc: Rourke, Patrick; Grotke, Abigail Subject: Re: WebSciDL Hi Gina, I'll investigate. memgator is software that one my students wrote, but I suspect the traffic you're seeing is b/c it is deployed in http://oldweb.today/ can you share the IP addr from where you're seeing the traffic? I presume the requests are for Memento TimeMaps? It should not being actually scraping HTML pages. regards, Michael On Wed, 2 Dec 2015, Jones, Gina wrote: > Hi Michael, we have a slight configuration issue with the current OW > set up for our webarchives. I think, from looking at the logs, that > "MemGator:1.0-rc3 <@WebSciDL>" is really causing some issues on our wayback. > Do you know who is running this scraper? Itʼs not part of memento is it? > > Gina Jones > Web Archiving Team > Library of Congress From: Ilya Kreymer <ikreymer@gmail.com> Date: Wed, 2 Dec 2015 10:33:56 -0800 Subject: high traffic on oldweb! To: Herbert Van de Sompel <hvdsomp@gmail.com>, Sawood Alam <ibnesayeed@gmail.com> Hi Herbert, Sawood, Herbert: Perhaps you are lucky that I am not using the LANL aggregator, as the traffic has gotten really high, and also I was asked to remove an archive due to the traffic it was causing temporarily.. I am thinking that ability to remove source archives quickly is an important aspect of an aggregator. Sawood: Hopefully yours will support something like this so I don't need to restart the container to change the archivelist ;) Ilya Broadcasting is Bad
  • 27. Long Tail of Archives
  • 28. While the IA was Down... $ memgator -f cdxj example.org | cut -c-4 | grep -v "^@" | uniq -c 2 2002 1 2005 1 2008 6 2009 67 2010 17 2011 64 2012 108 2013 108 2014 186 2015 51 2016
  • 29. Archive Profile ● High-level summary of an archive ● Predicts presence of mementos ● Provides statistics about the holdings ● Small in size and publicly available ● Easy to update and partially patch ● Useful for Memento query routing and other things com,cnn)/ {“frequency”: 40, “spread”: 2} uk,co,bbc)/ {“frequency”: 20, “spread”: 1} com,usatoday)/ {“frequency”: 5, “spread”: 1}
  • 30. Research Opportunities ● Information retrieval ● Information visualization ● Client and server side archiving ● Archiving dynamic content ● Distributed archiving ● Discovering alternate long term archiving techniques ● Predicting “Important” events on the Web and archiving them timely
  • 31. Web Science and Digital Libraries Research Group ws-dl.cs.odu.edu ws-dl.blogspot.com @WebSciDL github.com/oduwsdl flickr.com/photos/124419986@N07
  • 36. WSDL Ping Pong Table
  • 38. Sawood Alam Department of Computer Science Old Dominion University Norfolk, Virginia - 23529 (USA) salam@cs.odu.edu ibnesayeed@gmail.com @ibnesayeed www.cs.odu.edu/~salam