A Research Agenda for "Obsolete Data or Resources"

A Research Agenda for
"Obsolete Data or Resources"

Michael L. Nelson
@phonedude_mln

A Research Agenda for "Obsolete Data or Resources"
Web Archiving Cooperative Workshop, Stanford, June 29, 2012

Biographical Side Note…


Growing Up in Virginia…


First Job: NASA Langley Research Center


My Research Group

Get Active
Be Lazy • modify server
• lazy preservation • enhance objects
• just-in-time preservation

Archive Quality Better Tools
• APIs and services • ajax archiving
• object quality • temporal intention
• personal archiving


Why Care About The Past?

From an anonymous WWW 2010 reviewer about our
Memento paper (emphasis mine):

"Is there any statistics to show that many or a good number of Web
users would like to get obsolete data or resources? "


Two Common Misconceptions
about Web Archiving

• Prior = old = obsolete = bad = contaminated
– who cares, old versions are to be removed

• The Internet Archive has every copy of
everything that has ever existed
– who cares, problem solved


Current pages about the past
don't have the same impact as
pages from the past


vs.

(thanks to Michele Weigle for the following Memento selection)

What have we, the archiving community,
done wrong?


Wrong Metaphor for Web Archives


Web Archives Are Not Destinations

This is a destination. This is not a destination.


Possible Metaphor for Web Archives?


Turn Archiving Into A Social Activity

see also: http://xkcd.com/1034/


Pinterest: A First Step?

http://media-cache-ec3.pinterest.com/upload/47639708527755289_AhxhItiQ_c.jpg
is a memento of:
http://3.bp.blogspot.com/_d0vByWRfhvU/S_Ygk_oX4xI/AAAAAAAACCQ/LXgC3S0KYEo/s400/_MG_8091.jpg
but there is no machine-readable indication of this relationship
repins are by-reference


Why doesn't the web have a better
notion of time?


TBL on Generic vs. Specific Resources

http://www.w3.org/DesignIssues/Generic.html


In The Beginning… there was the inode

struct stat {
dev_t st_dev; /* ID of device containing file */
ino_t st_ino; /* inode number */
mode_t st_mode; /* protection */
nlink_t st_nlink; /* number of hard links */
uid_t st_uid; /* user ID of owner */
gid_t st_gid; /* group ID of owner */
dev_t st_rdev; /* device ID (if special file) */
off_t st_size; /* total size, in bytes */
blksize_t st_blksize; /* blocksize for filesystem I/O */
blkcnt_t st_blocks; /* number of blocks allocated */
time_t st_atime; /* time of last access */
time_t st_mtime; /* time of last modification */
time_t st_ctime; /* time of last status change */
};


Limited Time Semantics…
% telnet www.digitalpreservation.gov 80
Trying 140.147.249.7...
Connected to www.digitalpreservation.gov.
Escape character is '^]'.
HEAD /images/ndiipp_header6.jpg HTTP/1.1
Host: www.digitalpreservation.gov
Connection: close

HTTP/1.1 200 OK
Date: Mon, 19 Jul 2010 21:41:04 GMT
Server: Apache
Last-Modified: Thu, 18 Jun 2009 16:25:54 GMT
ETag: "1bc861-10935-dca24880"
Accept-Ranges: bytes
Content-Length: 67893
Connection: close
Content-Type: image/jpeg

Connection closed by foreign host.

Time Semantics Becoming Less,
Not More Available
Trying 140.147.249.7...
HEAD / HTTP/1.1
Connection: close

HTTP/1.1 200 OK
Date: Mon, 19 Jul 2010 21:36:00 GMT
Server: Apache
Connection: close
Content-Type: text/html



The Past Links to the Present…

explicit HTML link;
no HTTP links;
opaque URI


The Past Links to the Present…

no HTML links;
no HTTP links;
implicit from URI


But the Present Does Not Link to the Past
no hints in HTML,
HTTP, or URI

Trying 140.147.249.7...
HEAD / HTTP/1.1
Connection: close

HTTP/1.1 200 OK
Date: Mon, 19 Jul 2010 21:36:00 GMT
Server: Apache
Connection: close
Content-Type: text/html



Linking the Past and the Present

• Codify existing methods to create linkage from the
past to the present
– easy: an archived version knows for which URI it is an
archived version
• Create a linkage from the present to the past
– hard: solve with a level of indirection from present to past


The Web with Time Dimension added by Memento

Web Archiving Cooperative Workshop, Stanford, June 29, 2012 28

The archival record is incomplete


Va Tech Shooting -- Only 3 Mementos

do you remember when
it was thought to be a
domestic disturbance of
limited scope and they
had a suspect in custody?


Palin Crosshairs and takebackthe20.com
This website
was published in
fall of 2010

January 8, 2011: later that day, takebackthe20.com
6 dead, 14 wounded is taken offline
including a critically (see: http://huff.to/QnHA6x -- it
injured Giffords notes that absence of the page
in the Wayback Machine without
mention of the 6-12 month
quarantine)


What Was The Original Image?

the present web mostly
agrees, but there are
variations on the theme…


Timemap for takebackthe20.com
% curl http://mementoproxy.cs.odu.edu/aggr/timemap/link/http://www.takebackthe20.com/
<http://mementoproxy.cs.odu.edu/aggr/timebundle/http://www.takebackthe20.com/>;rel="timebundle",
<http://www.takebackthe20.com/>;rel="original",
<http://http://mementoproxy.cs.odu.edu/aggr/timemap/link/http://www.takebackthe20.com/>
;rel="timemap";type="application/link-format",
<http://mementoproxy.cs.odu.edu/aggr/timegate/http://www.takebackthe20.com/>;rel="timegate",
<http://api.wayback.archive.org/memento/20100925222153/http://www.takebackthe20.com/>
;rel="first memento";datetime="Sat, 25 Sep 2010 22:21:53 GMT",
<http://api.wayback.archive.org/memento/20100926095121/http://www.takebackthe20.com/>;rel="memento"
;datetime="Sun, 26 Sep 2010 09:51:21 GMT",
;datetime="Fri, 01 Oct 2010 17:53:13 GMT",
[deletion of about 11 mementos]
;datetime="Thu, 02 Dec 2010 22:41:45 GMT",
;datetime="Thu, 02 Dec 2010 23:17:59 GMT",
<http://api.wayback.archive.org/memento/20101206123128/http://www.takebackthe20.com/>
;rel="last memento";datetime="Mon, 06 Dec 2010 12:31:28 GMT"

The last memento is about 1 month before the shooting.
Ironically, we can document the original image, but not
the post-shooting event. www.takebackthe20.com is now
an anti-Palin lapsed domain.


Reconciling the live web with
what we find in the archives


Richard Grenell Removing His Tweets


2010 Archive of Grennel's Site…


…But the 2008 Content Is Missing


2008 Content on Live Site
But Do You Trust It?


Sci-Fi / Alternate History

http://2012.talkingpointsmemo.com/2012/06/richard-mourdock-obamacare-youtube-accident.php


Sometimes Shared Social Media Persists…


Social media archives is more than
just fodder for The Daily Show…


An Intact Tweet From the Egyptian Revolution

slide from
Hany SalahEldeen
https://twitter.com/miss_amy_qb/status/32477898581483521

These Tweets Have Lost Their Content
and Their Meaning

https://twitter.com/aishes/status/32485352102952960
Missing ?

slide from https://twitter.com/omar_chaaban/status/32203697597452289
Hany SalahEldeen

Estimating Shared Resource Loss in Social Media
for Other Socially Significant Events

to appear in
TPDL 2012

More archives = more better


Wayback Machine

http://web.archive.org/web/20030129185239/http://www4.cnn.com/
http://web.archive.org/web/20030131093102/http://cnn.com/
http://web.archive.org/web/20040102095249/http://www3.cnn.com/
etc.


URI Rewriting Makes for Nice Archives

The link to: http://i.cdn.turner.com/cnn/2009/TRAVEL/10/26/overseas.visitors.travel/c1main.liberty.gi.jpg
using Javascript is dynamically rewritten to:
http://web.archive.org/web/20091027043308/http://i.cdn.turner.com/cnn/2009/TRAVEL/10/26/overseas.visitors.travel/c1main.liberty.gi.jpg


Many Archives/Caches Do Not Rewrite URIs

Cached version of cnn.com (html only):
http://webcache.googleusercontent.com/search?q=cache%3Acnn.com
But images, for example, are not relative to SE cache; they're still at:
http://i2.cdn.turner.com/cnn/2010/POLITICS/09/23/un.ahmadinejad.walkouts/t1main.ahmadinejad.afp.gi.jpg


Some Web Sites Are Just "scp -r"
(implicit archives!)

http://www.jcdl2007.org/ http://www.jcdl.org/archived-conf-sites/jcdl2007/


URI Rewriting is Great --
Until Something Goes Wrong…

http://web.archive.org/web/20080302121117/http://www.thecribs.com/

http://web.archive.org/web/20100923232312/http://www.thecribs.com/aa/banners/itunes.gif


Where Else Could …/itunes.gif Be?

Paradox: URI rewriting makes archives
useful for interactive browsing, but it
actively inhibits interoperability -- your
session becomes trapped in an archive

How can you escape the gravitational
pull of IA's Wayback Machine and other
large archives? You'd like to start an
archive, but yours will never be as "good"
as theirs…


Long Tail of Archives


More Archives, More Mementos!

1000 URIs sampled from delicious.com; 1 dot = 1 Memento (x-axis=date of Memento,
y-axis=URI of Original Resource); sorted by URI longevity
How Much of the Web
A Research Agenda for "Obsolete Data or Resources" is Archived? JCDL 2011

For Some Collections, Still Too Few Mementos To Be Found…

1000 URIs sampled from search engine result pages;
preference for popular pages removed.
note to self: it is better to be popular. How Much of the Web
A Research Agenda for "Obsolete Data or Resources" is Archived? JCDL 2011

More archives reduces
archival uncertainty


No Uncertainty With Self-Archiving Systems
foo.html has <img src=pic.gif>

t0 t1 t2 t3 t4 t5 t6 t7
| | | | | | | |
foo.html foo.html foo.html foo.html

pic.gif pic.gif pic.gif pic.gif


foo.html @ t4

t0 t1 t2 t3 t4 t5 t6 t7
| | | | | | | |


GET /foo.html GET /pic.gif
Accept-Datetime: t4 Accept-Datetime: t4

HTTP/1.1 200 OK HTTP/1.1 200 OK
Memento-Datetime: t4 Memento-Datetime: t0


foo.html @ t4

t0 t1 t2 t3 t4 t5 t6 t7
| | | | | | | |



HTTP/1.1 200 OK HTTP/1.1 200 OK

foo.html correct pic.gif correct


Uncertainty in Third-Party Archives

t0 t1 t2 t3 t4 t5 t6 t7
| | | | | | | |



Missed Updates

t0 t1 t2 t3 t4 t5 t6 t7
| | | | | | | |
foo.html foo.html foo.html foo.html foo.html

pic.gif pic.gif pic.gif pic.gif pic.gif pic.gif

red italics = missed updates


foo.html @ t4

t0 t1 t2 t3 t4 t5 t6 t7
| | | | | | | |



HTTP/1.1 200 OK HTTP/1.1 200 OK

foo.html correct pic.gif incorrect
(should be t4)


foo.html @ t4

t0 t1 t2 t3 t4 t5 t6 t7
| | | | | | | |



HTTP/1.1 200 OK HTTP/1.1 200 OK

foo.html correct pic.gif incorrect
(should be t4)
this combination (foo@t4, pic@t0) never existed!


Decrease Uncertainty With More Observations?

t0 t1 t2 t3 t4 t5 t6 t7
| | | | | | | |


red italics = missed updates


Reaching Through Time

% grep "^GET /web/20.*HTTP/1.1" cnn-ia-headers | awk -F"/" '{print $3}' | sort -u
20091026133351js_
20091026133356
20091026133359js_ first was: 2009-10-26 13:33:51
20091026133425
20091026133427 root was: 2009-10-27 04:33:08
20091026133430js_
20091026133438 end was: 2009-10-27 22:47:45
20091026133441
20091026133443 root - first ~= 15 hours
20091026133446
20091026133448 end - first ~= 33 hours
…[deletia]…
20091027220018
20091027220027
20091027220237
20091027220248
20091027224745
20100923125259 ???
20100923125330 ???

http://web.archive.org/web/20091027043308/http://www.cnn.com/index.html


~33 Hours? How About ~8 Years?

single archive only with multiple archives

see: http://spread.cs.odu.edu/root/http%3A%252F%252Fanthraxinvestigation.com%252Findex.html/


Publishing and archiving are in a race.
Publishing is winning.


Ajax = #noarchive

http://web.archive.org/web/*/http://maps.google.com/
http://web.archive.org/web/20091026210613/http://maps.google.com/
http://web.archive.org/web/20091026210613/http://maps.google.com/?output=html&oi=slow

Reaching Out From the Archive

% grep Host: cnn-ia-headers | wc -l
288
% grep Host: cnn-ia-headers | grep -v archive.org | wc -l
117
% grep Host: cnn-ia-headers | grep -v archive.org | sort -u
Host: ad.doubleclick.net
Host: ads.adsonar.com
Host: ads.cnn.com
Host: aranet.vo.llnwd.net
Host: b.scorecardresearch.com
Host: bs.serving-sys.com
Host: cnn.dyn.cnn.com
Host: ds.serving-sys.com
Host: gdyn.cnn.com
Host: i.cdn.turner.com
Host: i2.cdn.turner.com
Host: js.adsonar.com
Host: metrics.cnn.com
Host: pix04.revsci.net
Host: s0.2mdn.net
Host: symbolcomplete.marketwatch.com
Host: www.adfusion.com
http://web.archive.org/web/20091027043308/http://www.cnn.com/index.html


Embedded Resources

29 Mementos: http://mementoproxy.cs.odu.edu/aggr/timemap/link/http://www.youtube.com/user/wichitarecordings


How Much Of What We Share Is Preservable?

local copy of http://dctheatrescene.com/

same, but with no internet


Social Resources

http://www.flickr.com/photos/mic_n_2_sugars/84882320/
1 Memento: http://mementoproxy.cs.odu.edu/aggr/timemap/link/http://www.flickr.com/photos/mic_n_2_sugars/84882320/
http://farm1.static.flickr.com/37/84882320_67fc8915d5_z.jpg (Last-Modified: 10 Jan 2006…)
0 Mementos: http://mementoproxy.cs.odu.edu/aggr/timemap/link/http://farm1.static.flickr.com/37/84882320_67fc8915d5_z.jpg


Archiving a user experience,
not the user experience.


Personalized Resources

GET / HTTP/1.1
Host: bit.ly
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.2.10) Gecko/20100914 Firefox/3.6.10
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
Cookie: __utma=126736798.4156477295523165000.1251253806.1285119293.1285122783.59;
_bit=4c20df7a-003a5-07baf-91a08fa8;anon_u=cHN1X19jN2MwNjcxZC05MWNiLTQ3MmEtOGIxYy1hZDMyMWRlNzc1OTU=|
1284997489|06ac0cefc8ac369e0f9849b5fdfbbe8d077d0c65; user=cGhvbmVkdWRl|1284997489|
fdb7f02cacb3cb44416f54d83f3237ec0f7bd9b5; __utmz=126736798.1280940647.33.1.utmcsr=(direct)|utmccn=(direct)|
utmcmd=(none); _chartbeat2=ciuph6qrso6tn6w7; _xsrf=49bc661fc02845b3bcbe975d7c2f28de;
__utmb=126736798.3.10.1285122783; __utmc=126736798


Geolocated Resources

% curl -I http://www.craigslist.org
HTTP/1.1 302 Found
Set-Cookie: cl_b=12851300231056905752;path=/;domain=.craigslist.org;expires=01 Jan 2038 00:00:00 GMT
Location: http://geo.craigslist.org/

% curl -I http://geo.craigslist.org/
HTTP/1.1 302 Found
Content-Type: text/html; charset=iso-8859-1
Connection: close
Location: http://norfolk.craigslist.org
Date: Wed, 22 Sep 2010 04:33:56 GMT
Set-Cookie: cl_b=12851300363085180962;path=/;domain=.craigslist.org;expires=01 Jan 2038 00:00:00 GMT
Server: Apache

% traceroute geo.craigslist.org
traceroute to geo.craigslist.org (208.82.236.208), 64 hops max, 40 byte packets
1 ***
2 10.5.120.1 (10.5.120.1) 9.959 ms 23.004 ms 13.208 ms
3 nrfksysr02-atm151208.hr.hr.cox.net (68.10.8.117) 10.056 ms 10.561 ms 19.970 ms
4 nrfkdsrj01-ge500.0.rd.hr.cox.net (68.10.14.13) 11.142 ms 20.618 ms 10.293 ms
5 ashbbprj02-ae4.0.rd.as.cox.net (68.1.1.232) 15.368 ms 68.854 ms 20.153 ms
6 xe-3-0-0.cr2.dca2.above.net (64.125.26.241) 18.963 ms 23.674 ms 32.977 ms
7 xe-2-2-0.cr2.iah1.us.above.net (64.125.30.53) 46.201 ms 56.156 ms 46.783 ms
8 xe-1-1-0.mpr4.phx2.us.above.net (64.125.28.73) 82.616 ms 82.289 ms 84.383 ms
9 * 64.124.178.62.allocated.above.net (64.124.178.62) 80.893 ms 78.786 ms
10 511.ae9.ecore1p.craigslist.org (208.82.239.102) 95.958 ms 86.160 ms 90.115 ms
11 www.craigslist.org (208.82.236.208) 80.968 ms 91.470 ms 80.110 ms


Is there just a single web to archive?


Shadow Web: Mobile

46* Mementos: http://mementoproxy.cs.odu.edu/aggr/timemap/link/http://twitter.com/timoreilly
0 Mementos: http://mementoproxy.cs.odu.edu/aggr/timemap/link/http://mobile.twitter.com/timoreilly
* = 46 mementos in 2010, 22 mementos in 2012

Shadow Web: Mobile

17,000+ Mementos: http://mementoproxy.cs.odu.edu/aggr/timemap/link/http://www.cnn.com/
140+ Mementos: http://mementoproxy.cs.odu.edu/aggr/timemap/link/http://m.cnn.com/


Shadow Web: Linked Data

(this resource intentionally left blank)

http://en.wikipedia.org/wiki/DJ_Shadow http://dbpedia.org/resource/DJ_Shadow

Accept: text/html Accept: application/rdf+xml

http://dbpedia.org/page/DJ_Shadow http://dbpedia.org/data/DJ_Shadow

2 Mementos: http://mementoproxy.cs.odu.edu/aggr/timemap/link/http://dbpedia.org/resource/DJ_Shadow
0 Mementos: http://mementoproxy.cs.odu.edu/aggr/timemap/link/http://dbpedia.org/data/DJ_Shadow
0 Mementos: http://mementoproxy.cs.odu.edu/aggr/timemap/link/http://dbpedia.org/page/DJ_Shadow


A short wish list.


Use Memento-Datetime HTTP Header

% curl -I https://twitter.com/machawk1/status/218015444496416768
HTTP/1.1 200 OK
Date: Fri, 29 Jun 2012 15:50:55 GMT
Status: 200 OK
X-Transaction: 9a209a8deb15f4ba
X-Frame-Options: SAMEORIGIN
ETag: "4b79affd0f77a83019f619428f4ebaa5"
Expires: Tue, 31 Mar 1981 05:00:00 GMT
Last-Modified: Fri, 29 Jun 2012 15:50:55 GMT
Content-Type: text/html; charset=utf-8
X-Runtime: 0.20638
Cache-Control: no-cache, no-store, must-revalidate,
pre-check=0, post-check=0
Content-Length: 80501
Pragma: no-cache
Strict-Transport-Security: max-age=631138519
X-MID: 75f6c6061c2be34447493adc6c33317c61740b5f
Set-Cookie: [cookie stuff deleted]
X-XSS-Protection: 1; mode=block
Vary: Accept-Encoding
Server: tfe


Richer APIs for Archives
<?xml version="1.0"?>
<URLEnvelope>
<URL text="http://www.example.org" />
<outlinks>
<timestamp value="20122020202">
outlinks, <olink>
<href>http://www.anotherexample.org</href>
with context, <atext>click here</atext>
<window>In the following example click here we show you example</window>
datestamps </olink>
<olink>
<href>http://www.anotherexample2.org</href>
<atext>click here2</atext>
<window>In the following example click 2 we show you example</window>
</olink>
</timestamp>
</outlinks> possible now, but there is
<inlinks> a bootstrapping problem of
<ilink> proving value to the archive
inlinks, <href>http://www.myexample.org</href>
with context, <atext>Good Example</atext>
<window>In the following example click here we show you interesting
datestamps example</window>
<timestamp value="20122020202"/>
</ilink>
<ilink>
<href>http://www.myexample.org</href>
<atext>Good Example</atext>
<window>In the following example click here we show you interesting
example</window>
<timestamp value="201240402042"/>
</ilink>
</inlinks>
</URLEnvelope>


Closing thoughts.


Open Problem: Compelling Applications

• UI / usage idioms
– remember "lost in hyperspace"?
• What is right metaphor? A lot of times, people don't know what
– VCR controls through versions? they want until you show it to them.
-- May 25, 1998)
– "Track Changes" controls?
• Search like it's 1999
– yeah, we all want search for the Wayback Machine, but like
a dog chasing a truck…
• If you build the archive APIs, will the archive-based
mashups come?


Open Problem: Conceptual Gap

• What archives offer: access by URI +
timestamp
– "what did cnn.com look like on May 31, 2007?"
• What users want: concepts through time
– "how has public opinion about the `affordable
health care act' changed through time?"
• hint: tag clouds aren't enough
• to answer this question, you would likely have to find a
current page that talks about the past


Open Problem: Archival Authenticity

• Right now, we just implicitly trust Brewster and
everyone at the IA
• The only reason the politicians/pundits in previous
examples didn't cheat is because they didn't know it
was an option
– black hat archives
• What happens when there are multiple archives and
they disagree?
– spam archives?
– "soft 401s"?
– resolving archival disputes?
• esp. if different archives can legitimately see different
representations for the same URIs?


Open Problem: Monetizing The Archive

• Until $ can be made, archives will labor in the
shadows
• OTOH, without monetization archives are
relatively free of spammers, lawyers, and
other predators

Roger's Innovation Curve
http://en.wikipedia.org/wiki/Diffusion_of_innovations


Five Easy Pieces

Preservation not for no more hoary stories
privileged priesthood about format obsolescence:
http://blog.dshr.org/2010/09/reinforcing-my-point.html
http://doi.acm.org/10.1145/1592761.1592794
http://booktwo.org/notebook/wikipedia-historiography/

archiving as branded service,
not infrastructure
http://blog.dshr.org/2010/06/jcdl-2010-keynote.html

Don't dessicate resources; Endless metadata is not
leave them on the web
preservation…
http://www.dlib.org/dlib/december05/12contents.html
[too many to list]


A Research Agenda for "Obsolete Data or Resources"

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Viewers also liked

Viewers also liked (16)

Similar to A Research Agenda for "Obsolete Data or Resources"

Similar to A Research Agenda for "Obsolete Data or Resources" (20)

More from Michael Nelson

More from Michael Nelson (20)

Recently uploaded

Recently uploaded (20)

A Research Agenda for "Obsolete Data or Resources"