The presentation given for the RRAC meeting on 10-20-2010. This is a summary of the research efforts in Digital Preservation at Old Dominion University.
The document discusses Memento, a framework that integrates archived ("past") web resources with the current web. It proposes using HTTP headers and URIs to enable time-based content negotiation, allowing users to access prior representations of resources via the same URI. This bridges the divide between the current and past web by making navigation to archived versions a normal part of HTTP interactions.
The document discusses the core concepts of the web and semantic web. It explains how linked data uses semantic web standards to identify things with URIs and describe them using attributes. An example is given describing a spacecraft using its URI and attributes like name, mass, launch date, site, and coordinates. Links connect related concepts like the spacecraft and its launch.
Richard Wallis is a technology evangelist at OCLC. The document discusses how Richard Wallis advocates for linked data and semantic web technologies. It provides examples of how identifiers and URIs can be used to identify things and link their various attributes on the semantic web.
The document discusses key concepts related to the internet and web technologies. It defines common terms like internet, IP address, domain name system, web browser, uniform resource locator. It explains important protocols for transmission and communication like TCP/IP, HTTP, FTP. It also discusses various internet applications and services like search engines, email, social networking sites, blogs, microblogs, real-time communication tools, and malware threats.
The document discusses how to find archived web pages from the past using two different services - http://web.archive.org and Memento. It explains that web.archive.org allows searching for and accessing cached pages of a website like CNN from specific past dates, while Memento aims to make navigating archived past web pages easier through its framework and development community.
Not All Mementos Are Created Equal: Measuring The Impact Of Missing MementosJustin Brunelle
This document discusses measuring the quality of archived webpages or "mementos" when some embedded resources like images or CSS are missing. It found that a "damage rating" approach, which considers factors like the size, position and importance of missing resources, was a better indicator of memento quality than just the percentage of missing resources. A study with Amazon Mechanical Turk users found they agreed more with damage ratings than percentage missing when comparing mementos. The quality of mementos in the Internet Archive was also found to generally improve over time despite more resources being missing.
Filling in the Blanks: Capturing Dynamically Generated ContentJustin Brunelle
JCDL 2012 Doctoral Consortium presentation by Justin F. Brunelle. Covers the problem Web 2.0 creates for preservation, and proposes a solution for client-side capture of content.
The document discusses Memento, a framework that integrates archived ("past") web resources with the current web. It proposes using HTTP headers and URIs to enable time-based content negotiation, allowing users to access prior representations of resources via the same URI. This bridges the divide between the current and past web by making navigation to archived versions a normal part of HTTP interactions.
The document discusses the core concepts of the web and semantic web. It explains how linked data uses semantic web standards to identify things with URIs and describe them using attributes. An example is given describing a spacecraft using its URI and attributes like name, mass, launch date, site, and coordinates. Links connect related concepts like the spacecraft and its launch.
Richard Wallis is a technology evangelist at OCLC. The document discusses how Richard Wallis advocates for linked data and semantic web technologies. It provides examples of how identifiers and URIs can be used to identify things and link their various attributes on the semantic web.
The document discusses key concepts related to the internet and web technologies. It defines common terms like internet, IP address, domain name system, web browser, uniform resource locator. It explains important protocols for transmission and communication like TCP/IP, HTTP, FTP. It also discusses various internet applications and services like search engines, email, social networking sites, blogs, microblogs, real-time communication tools, and malware threats.
The document discusses how to find archived web pages from the past using two different services - http://web.archive.org and Memento. It explains that web.archive.org allows searching for and accessing cached pages of a website like CNN from specific past dates, while Memento aims to make navigating archived past web pages easier through its framework and development community.
Not All Mementos Are Created Equal: Measuring The Impact Of Missing MementosJustin Brunelle
This document discusses measuring the quality of archived webpages or "mementos" when some embedded resources like images or CSS are missing. It found that a "damage rating" approach, which considers factors like the size, position and importance of missing resources, was a better indicator of memento quality than just the percentage of missing resources. A study with Amazon Mechanical Turk users found they agreed more with damage ratings than percentage missing when comparing mementos. The quality of mementos in the Internet Archive was also found to generally improve over time despite more resources being missing.
Filling in the Blanks: Capturing Dynamically Generated ContentJustin Brunelle
JCDL 2012 Doctoral Consortium presentation by Justin F. Brunelle. Covers the problem Web 2.0 creates for preservation, and proposes a solution for client-side capture of content.
This document summarizes information about Akhuwat, a nonprofit microfinance organization in Pakistan. It provides statistics on the number of families supported and amount disbursed in loans. It also describes Akhuwat's objectives of providing interest-free loans and training to help families become self-reliant. Additionally, it outlines Akhuwat's various social programs and ventures, including its "One Rupee A Day" campaign encouraging small daily donations to fund additional loans.
MadJoint Studios is a creative agency with over 20 years of experience working with various industries and organizations. It was founded by Abid Beli, Yasir Turk, and Faham Usman. The agency offers a wide range of services to help clients with their creative and marketing needs.
The document discusses advanced AngularJS directives. It begins with an overview of directive configuration options like restrict, priority, terminal, template, and controller. It then covers advanced topics like isolate scope, transclusion, and the difference between compile and link functions. The document includes code samples and encourages the reader to experiment with directives through code samples on Plunker. It concludes by providing additional resources for learning more about directives.
Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Repr...Justin Brunelle
This document summarizes Justin Brunelle's dissertation defense on archiving deferred web representations using a two-tiered crawling approach. It discusses how current archival tools are unable to fully capture dynamic and interactive web pages that use JavaScript to modify page content after load. The dissertation measures the impact of missing JavaScript resources on memento quality and proposes crawling pages using PhantomJS to execute scripts and archive the complete representation as seen by users. Future work is needed to scale this approach for archiving large portions of the deferred web at risk of being lost to history.
This document describes a mobile marketplace platform called Mmatcher that allows buyers and sellers to connect in real-time via SMS or mobile/web apps. Mmatcher uses artificial intelligence to match interests and locations of users looking to buy or sell goods and services. It aims to connect individuals and businesses across both rural and urban areas in emerging markets, where mobile phone access is widespread but internet access is limited.
To ensure connectivity of each and every village, Tehsil, Town, District, City, Province in Pakistan through a digital hub and provide a platform to every citizen of Pakistan to connect the world through channelizing the energy, creativity, passion and commitment of every young person of Pakistan.
Visit our website: http://supportdigitalpakistan.com
Facebook Group: http://facebook.com/groups/supportdigitalpakistan
Facebook Page: http://facebook.com/supportdigitalpak
An Evaluation of Caching Policies for Memento TimeMapsJustin Brunelle
This document evaluates caching policies for Memento TimeMaps. The authors observed changes to over 4,000 TimeMaps for 92 days and analyzed caching policies based on the changes. They found that TimeMaps either remained unchanged (77.4%) or increased in size through new archives or mementos (22.6%). An optimal TimeMap cache Time-To-Live of 15 days balances freshness with reduced load on archives by caching incrementally improving TimeMaps.
ChinaPakTrade is a trading company that helps Pakistani businessmen buy products from Chinese factories at reduced costs while saving on travel expenses. They arrange all aspects of business deals in China, including finding suppliers, negotiating prices, quality inspections, and shipping. Customers can work with ChinaPakTrade either remotely from Pakistan or by visiting China, during which ChinaPakTrade will arrange visas, hotels, and business meetings. In addition to trade services, ChinaPakTrade also offers Chinese visa sponsorship, translation, tour guiding in China, and assistance registering companies in China.
Justin F. Brunelle is a computer scientist who works at The MITRE Corporation and received his BS and MS in computer science from Old Dominion University. He is currently pursuing his PhD in digital preservation from ODU under Dr. Nelson, focusing on ensuring web pages are archived over time. Previously he conducted research in serious games and intelligent tutoring systems.
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...Justin Brunelle
This document proposes a two-tiered crawling approach using PhantomJS and Heritrix to better archive deferred representations, which are web pages that require JavaScript execution or user interaction to fully render. The approach uses PhantomJS to execute JavaScript and interact with deferred representations, while Heritrix crawls non-deferred representations for better performance. Test results found the PhantomJS frontier was 1.5 times larger but crawling was 10.5 times slower. The approach provides a better method for archiving deferred representations compared to the current workflow.
The document outlines best practices for building applications with AngularJS. It discusses the differences between single page apps built with AngularJS and traditional apps, recommending approaches like following AngularJS style guides. The document also summarizes upcoming features for AngularJS 2.0 like improved directives and server-side rendering. Resources are provided for tools like Grunt, Bower, and techniques like search engine optimization for single page apps.
Checkout our contributions: http://angularify.org/
Angular Best Practices presented at Techcamp Vietnam 2014 (techcamp.vn).
Github source code:
https://github.com/henrytao-me/angular-best-practices
RESOURCES & REFERENCES
AngularJS style guide
https://github.com/mgechev/angularjs-style-guide
Best practices from father of AngularJS
http://www.youtube.com/watch?v=ZhfUv0spHCY
ng-boilerplate
https://github.com/ngbp/ngbp
Airbnb prerender backbone
http://nerds.airbnb.com/weve-open-sourced-rendr-run-your-backbonejs-a/
AngularSEO from yearofmoo
http://www.yearofmoo.com/2012/11/angularjs-and-seo.html
AngularJS full testing with Karma from yearofmoo
http://www.yearofmoo.com/2013/01/full-spectrum-testing-with-angularjs-and-karma.html%23testing-controllers
AngularJS Video from egghead.io
http://egghead.io/lessons
AngularJS Best Practices – offically
https://github.com/angular/angular.js/wiki/Best-Practices
AngularJS Experiment
http://www.bennadel.com/blog/2439-My-Experience-With-AngularJS-The-Super-heroic-JavaScript-MVW-Framework.htm
Memento: TimeGates, TimeBundles, and TimeMapsMichael Nelson
Memento introduces a time dimension to the web by allowing archived ("memento") versions of resources to be accessed via content negotiation on the original resource URI or a "timegate" URI. It defines mechanisms for transparently negotiating the datetime dimension of a resource and discovering all archived versions ("timemap" and "timebundle" URIs) of a given original resource across multiple archives.
The Memento protocol is a lightweight HTTP framework an standard (IETF RFC 7089) for navigating Web clients to past representations of a resource. It has use cases in Web archiving, URI persistance and Linked Data publishing.
Summarize Your Archival Holdings With MementoMapSawood Alam
This document summarizes a presentation about using MementoMaps to efficiently route memento lookup requests to appropriate web archives. MementoMaps provide concise summaries of what URIs are held by each archive in order to avoid broadcasting requests to all archives. They can be generated from archive indexes, compacted for size, and published for discovery. Adopting MementoMaps could significantly reduce wasted lookup requests across archives.
The document provides an overview of browser-based digital preservation including:
- The current state of digital preservation which relies on web crawlers and archives like the Internet Archive. However, this approach is insufficient for preserving pages that are not popular, behind authentication, or use complex JavaScript.
- The requirements for new software to directly capture and preserve web pages from within the browser in order to address the limitations of current archival approaches.
- A proposed system called "WARCreate" that would leverage the Chrome extension API to capture web pages and resources and generate WARC files for preservation while maintaining the original browsing context.
The document introduces Memento, a framework that integrates past and present versions of web resources to enable time travel on the web. Memento achieves this through linking between original resources and their archived past versions, as well as between versions. This allows navigation between versions through standard HTTP requests. The framework has tools for both clients and servers and aims to preserve important parts of our cultural record that would otherwise be lost as web pages change and disappear over time.
Delivered by Richard Richard Wincewicz at Open Repositories OR2015, Indianapolis, IN, USA, June 2014.
An introduction to "Reference or Link Rot", the evidence for the extent of the problem, and remedies proposed by the Hiberlink project.
This presentation provides an overview of the Memento "Time Travel for the Web" framework that is aligned with the stable version of the Memento protocol, specified in RFC 7089.
Introduction to Web Programming - first courseVlad Posea
The document provides an introduction to a web programming course, outlining its objectives, what students will learn, and how they will be evaluated. Key points covered include:
- Students will understand web applications and develop basic skills in HTML, CSS, JavaScript.
- Evaluation will be based on exam scores, lab work, and individual study demonstrating understanding and skills.
- The course will cover the history of the web, how the HTTP protocol works, and core frontend technologies.
Introducing Web Archiving and WSDL Research GroupSawood Alam
My talk to introduce Web Archiving and the Web Science and Digital Libraries Research Group to some invited students from India for a summer workshop in Old Dominion University, Norfolk, VA
This document summarizes information about Akhuwat, a nonprofit microfinance organization in Pakistan. It provides statistics on the number of families supported and amount disbursed in loans. It also describes Akhuwat's objectives of providing interest-free loans and training to help families become self-reliant. Additionally, it outlines Akhuwat's various social programs and ventures, including its "One Rupee A Day" campaign encouraging small daily donations to fund additional loans.
MadJoint Studios is a creative agency with over 20 years of experience working with various industries and organizations. It was founded by Abid Beli, Yasir Turk, and Faham Usman. The agency offers a wide range of services to help clients with their creative and marketing needs.
The document discusses advanced AngularJS directives. It begins with an overview of directive configuration options like restrict, priority, terminal, template, and controller. It then covers advanced topics like isolate scope, transclusion, and the difference between compile and link functions. The document includes code samples and encourages the reader to experiment with directives through code samples on Plunker. It concludes by providing additional resources for learning more about directives.
Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Repr...Justin Brunelle
This document summarizes Justin Brunelle's dissertation defense on archiving deferred web representations using a two-tiered crawling approach. It discusses how current archival tools are unable to fully capture dynamic and interactive web pages that use JavaScript to modify page content after load. The dissertation measures the impact of missing JavaScript resources on memento quality and proposes crawling pages using PhantomJS to execute scripts and archive the complete representation as seen by users. Future work is needed to scale this approach for archiving large portions of the deferred web at risk of being lost to history.
This document describes a mobile marketplace platform called Mmatcher that allows buyers and sellers to connect in real-time via SMS or mobile/web apps. Mmatcher uses artificial intelligence to match interests and locations of users looking to buy or sell goods and services. It aims to connect individuals and businesses across both rural and urban areas in emerging markets, where mobile phone access is widespread but internet access is limited.
To ensure connectivity of each and every village, Tehsil, Town, District, City, Province in Pakistan through a digital hub and provide a platform to every citizen of Pakistan to connect the world through channelizing the energy, creativity, passion and commitment of every young person of Pakistan.
Visit our website: http://supportdigitalpakistan.com
Facebook Group: http://facebook.com/groups/supportdigitalpakistan
Facebook Page: http://facebook.com/supportdigitalpak
An Evaluation of Caching Policies for Memento TimeMapsJustin Brunelle
This document evaluates caching policies for Memento TimeMaps. The authors observed changes to over 4,000 TimeMaps for 92 days and analyzed caching policies based on the changes. They found that TimeMaps either remained unchanged (77.4%) or increased in size through new archives or mementos (22.6%). An optimal TimeMap cache Time-To-Live of 15 days balances freshness with reduced load on archives by caching incrementally improving TimeMaps.
ChinaPakTrade is a trading company that helps Pakistani businessmen buy products from Chinese factories at reduced costs while saving on travel expenses. They arrange all aspects of business deals in China, including finding suppliers, negotiating prices, quality inspections, and shipping. Customers can work with ChinaPakTrade either remotely from Pakistan or by visiting China, during which ChinaPakTrade will arrange visas, hotels, and business meetings. In addition to trade services, ChinaPakTrade also offers Chinese visa sponsorship, translation, tour guiding in China, and assistance registering companies in China.
Justin F. Brunelle is a computer scientist who works at The MITRE Corporation and received his BS and MS in computer science from Old Dominion University. He is currently pursuing his PhD in digital preservation from ODU under Dr. Nelson, focusing on ensuring web pages are archived over time. Previously he conducted research in serious games and intelligent tutoring systems.
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...Justin Brunelle
This document proposes a two-tiered crawling approach using PhantomJS and Heritrix to better archive deferred representations, which are web pages that require JavaScript execution or user interaction to fully render. The approach uses PhantomJS to execute JavaScript and interact with deferred representations, while Heritrix crawls non-deferred representations for better performance. Test results found the PhantomJS frontier was 1.5 times larger but crawling was 10.5 times slower. The approach provides a better method for archiving deferred representations compared to the current workflow.
The document outlines best practices for building applications with AngularJS. It discusses the differences between single page apps built with AngularJS and traditional apps, recommending approaches like following AngularJS style guides. The document also summarizes upcoming features for AngularJS 2.0 like improved directives and server-side rendering. Resources are provided for tools like Grunt, Bower, and techniques like search engine optimization for single page apps.
Checkout our contributions: http://angularify.org/
Angular Best Practices presented at Techcamp Vietnam 2014 (techcamp.vn).
Github source code:
https://github.com/henrytao-me/angular-best-practices
RESOURCES & REFERENCES
AngularJS style guide
https://github.com/mgechev/angularjs-style-guide
Best practices from father of AngularJS
http://www.youtube.com/watch?v=ZhfUv0spHCY
ng-boilerplate
https://github.com/ngbp/ngbp
Airbnb prerender backbone
http://nerds.airbnb.com/weve-open-sourced-rendr-run-your-backbonejs-a/
AngularSEO from yearofmoo
http://www.yearofmoo.com/2012/11/angularjs-and-seo.html
AngularJS full testing with Karma from yearofmoo
http://www.yearofmoo.com/2013/01/full-spectrum-testing-with-angularjs-and-karma.html%23testing-controllers
AngularJS Video from egghead.io
http://egghead.io/lessons
AngularJS Best Practices – offically
https://github.com/angular/angular.js/wiki/Best-Practices
AngularJS Experiment
http://www.bennadel.com/blog/2439-My-Experience-With-AngularJS-The-Super-heroic-JavaScript-MVW-Framework.htm
Memento: TimeGates, TimeBundles, and TimeMapsMichael Nelson
Memento introduces a time dimension to the web by allowing archived ("memento") versions of resources to be accessed via content negotiation on the original resource URI or a "timegate" URI. It defines mechanisms for transparently negotiating the datetime dimension of a resource and discovering all archived versions ("timemap" and "timebundle" URIs) of a given original resource across multiple archives.
The Memento protocol is a lightweight HTTP framework an standard (IETF RFC 7089) for navigating Web clients to past representations of a resource. It has use cases in Web archiving, URI persistance and Linked Data publishing.
Summarize Your Archival Holdings With MementoMapSawood Alam
This document summarizes a presentation about using MementoMaps to efficiently route memento lookup requests to appropriate web archives. MementoMaps provide concise summaries of what URIs are held by each archive in order to avoid broadcasting requests to all archives. They can be generated from archive indexes, compacted for size, and published for discovery. Adopting MementoMaps could significantly reduce wasted lookup requests across archives.
The document provides an overview of browser-based digital preservation including:
- The current state of digital preservation which relies on web crawlers and archives like the Internet Archive. However, this approach is insufficient for preserving pages that are not popular, behind authentication, or use complex JavaScript.
- The requirements for new software to directly capture and preserve web pages from within the browser in order to address the limitations of current archival approaches.
- A proposed system called "WARCreate" that would leverage the Chrome extension API to capture web pages and resources and generate WARC files for preservation while maintaining the original browsing context.
The document introduces Memento, a framework that integrates past and present versions of web resources to enable time travel on the web. Memento achieves this through linking between original resources and their archived past versions, as well as between versions. This allows navigation between versions through standard HTTP requests. The framework has tools for both clients and servers and aims to preserve important parts of our cultural record that would otherwise be lost as web pages change and disappear over time.
Delivered by Richard Richard Wincewicz at Open Repositories OR2015, Indianapolis, IN, USA, June 2014.
An introduction to "Reference or Link Rot", the evidence for the extent of the problem, and remedies proposed by the Hiberlink project.
This presentation provides an overview of the Memento "Time Travel for the Web" framework that is aligned with the stable version of the Memento protocol, specified in RFC 7089.
Introduction to Web Programming - first courseVlad Posea
The document provides an introduction to a web programming course, outlining its objectives, what students will learn, and how they will be evaluated. Key points covered include:
- Students will understand web applications and develop basic skills in HTML, CSS, JavaScript.
- Evaluation will be based on exam scores, lab work, and individual study demonstrating understanding and skills.
- The course will cover the history of the web, how the HTTP protocol works, and core frontend technologies.
Introducing Web Archiving and WSDL Research GroupSawood Alam
My talk to introduce Web Archiving and the Web Science and Digital Libraries Research Group to some invited students from India for a summer workshop in Old Dominion University, Norfolk, VA
Presented by Michele C. Weigle, June 4, 2015
Columbia University Web Archiving Collaboration: New Tools and Models
Work by Yasmin AlNoamany, Michele C. Weigle, and Michael L. Nelson
Avoiding Spoilers On MediaWiki Fan Sites Using MementoShawn Jones
A variety of fan-based wikis about episodic fiction (e.g., television shows, novels, movies) exist on the World Wide Web. These wikis provide a wealth of information about complex stories, but if readers are behind in their viewing they run the risk of encountering "spoilers" -- information that gives away key plot points before the intended time of the show's writers. Enterprising readers might browse the wiki in a web archive so as to view the page prior to a specic episode date and thereby avoid spoilers. Unfortunately, due to how web archives choose the "best" page, it is still possible to see spoilers (especially in sparse archives).
In this presentation we highlight the issues with avoiding spoilers using Memento. We show that for a sample of fan wiki pages there is as much as a 66% chance of encountering a spoiler. We also find, using logs from the Internet Archive, that 19% of actual requests to the Wayback Machine for wikia.com end in spoilers. We suggest a different heuristic for use with wikis and unveil the Memento MediaWiki Extension as a solution.
This document discusses various strategies and resources for archiving internet content for research purposes. It describes several existing large-scale web archives like the Internet Archive and Common Crawl, as well as national and institutional archives. It also outlines how researchers can collect targeted web archives using open-source tools or subscription-based services.
This document summarizes research into discovering lost web pages using techniques from digital preservation and information retrieval. Key points include:
- Web pages are frequently lost due to broken links or content being moved/removed, but copies may still exist in search engine caches or archives.
- Techniques like lexical signatures (representing a page's content in a few keywords) and analyzing page titles, tags and link neighborhoods can help characterize lost pages and find similar replacement content.
- Experiments showed that lexical signatures degrade over time but page titles are more stable, and combining techniques improves performance in locating replacement content. The goal is to develop a browser extension to help users find lost web pages.
Synchronicity: Just-In-Time Discovery of Lost Web PagesMichael Nelson
The document discusses techniques for discovering lost web pages using lexical signatures. It finds that lexical signatures generated from page titles and content evolve over time, with terms dropping out. Signatures perform best with 5-7 terms. Combining titles with signatures provides better discovery results than either alone. Future work includes predicting "good" titles and augmenting signatures with tags and link neighborhoods.
The Memento Protocol and Research Issues With Web ArchivingMichael Nelson
Michael L. Nelson
Old Dominion University
Web Science & Digital Libraries Research Group
www.cs.odu.edu/~mln/
University of Virginia Colloquium
2016-09-12
This document discusses web decay, which refers to web sources disappearing over time. It provides statistics on the percentage of decayed web citations in various studies. While web sources are commonly cited in scholarly works, their URLs can become invalid if the sites change or are removed. The Wayback Machine and other archives can help recover some decayed sources. When citing web sources, authors should check URLs work and consider archiving pages themselves to mitigate decay issues.
Tanvi Wadekar completed a 100-hour IT training course and project on the World Wide Web (WWW). The document defines WWW as an information system accessed via the internet that allows for the exchange of hypertext documents and other digital resources. It discusses the history of WWW, invented by Tim Berners-Lee in 1989, and its key components like browsers, servers, caches, and protocols. The working of WWW involves connecting to a server via HTTP, requesting an HTML page, and receiving a response before closing the connection. Common elements on WWW are discussed like web pages, bookmarks, directories, sites and URLs. [/SUMMARY]
1. Digital Preservation Research
at Old Dominion University
Justin F. Brunelle
The MITRE Corporation
Old Dominion University
(And hopefully MITRE, soon)
2. Why are we listening?
• Overview of the problem
• BRIEF introduction to ODU WSDL group
research
• Memento
• I’ll be skipping around, so don’t hesitate to
interrupt me
3. Digital Preservation
• Using the past Web
– Focus of our research
• Temporal Browsing
– Sessions in the past
• Recovering Lost Pages
– Is it really gone?
• 404s
– How to fix broken links?
4. 1
same URI
maps to same
or very similar
content at a
later time
2
same URI
maps to
different
content at a
later time
3
different URI
maps to same
or very similar
content at the
same or at a
later time
4
the content
can not be
found at
any URI
U1
C1
U1
C1
timeA B
U1
C2
U1
C1
timeA B
U2
C1
U1
C1
U1
404
timeA B
U1
??
U1
C1
timeA B
Change on the Web
5. Time to Talk About Saving
Everything?
Dinner for one or two costs more than 1TB disk Wikis have popularized versioning
Cool URIs (http://www.w3.org/Provider/Style/URI.html) are widely adopted, e.g.:
http://news.yahoo.com/s/ap/20100920/ap_on_el_se/us_alaska_senate
http://d.yimg.com/a/p/ap/20100918/capt.67567dbc0a874b689f0b4a5c392f379c-67567dbc0a874b689f0b4a5c392f379c-0.jpg
http://d.yimg.com/a/p/afp/20100918/thumb.photo_1284846332993-1-0.jpg
Also related projects with cool URI / permalink focus:
http://www.citability.org/
http://data.gov/
http://data.gov.uk/
6. Fortress Model
• Get a lot of money
• Buy lots of storage
• Hire lots of people
• “Look upon my archive ye Mighty, and
despair!”
7. Alternate Methods
• Lazy Preservation (McCown)
– “How much preservation do I get if I do absolutely
nothing?”
• Just-In-Time Preservation (Klein)
– Wait for it to disappear, then find a “good ‘nuff”
version
• Shared Infrastructure Preservation
– Push content to sites that might preserve it
• arXiv.org, IA, WebCite…
• Server Enhanced Preservation
– Create archival-ready resources
8. And Soon…
• Social Preservation
– Preserving resources using 3rd
party Web Services
– Repository for OAI-ORE ReMs
– Social network feel
– Lazy-esque, server-side reconstruction
9. But I digress…
• Few years away…
• Preliminary research
• And now back to the prior research…
24. Finding Archived Resources
Go to http://www.archive.org/ and search
http://cnn.com
On http://web.archive.org/web/*/http://cnn.com, select
desired datetime
24
27. Current and Past Web are Not
Integrated
27
• Current and Past Web based on
same technology.
• But, going from Current to
Past Web is a matter of (manual)
discovery.
• Memento wants to make going
from Current to Past Web a
(HTTP) protocol matter.
• Memento wants to integrate
Current And Past Web.
43. What does it all mean?
• Cutting edge technology
• Existing Infrastructure
• Redefining Web surfing
• MAJOR “real world” implications
44. Closing Thoughts
Preservation not for
privileged priesthood
http://doi.acm.org/10.1145/1592761.1592794
http://booktwo.org/notebook/wikipedia-historiography/
no more hoary stories
about format obsolescence:
http://blog.dshr.org/2010/09/reinforcing-my-point.html
Don't dessicate resources;
leave them on the web
Endless metadata is not
preservation…
archiving as branded service,
not infrastructure
http://blog.dshr.org/2010/06/jcdl-2010-keynote.html
45. Acknowledgements
• Slides borrowed from:
• Dr. Michael L. Nelson:
– http://www.slideshare.net/phonedude/my-point-of-view-
michael-l-nelson-web-archiving-cooperative
– http://www.slideshare.net/phonedude/review-of-web-
archiving
– http://www.slideshare.net/phonedude/memento-time-
travel-for-the-web
• Martin Klein:
– http://www.slideshare.net/phonedude/synchronicity-
justintime-discovery-of-lost-web-pages