Filling in the Blanks: Capturing Dynamically Generated Content

Filling in the Blanks:
Capturing Dynamically
Generated Content
Justin F. Brunelle
Old Dominion University
Advisor: Dr. Michael L. Nelson

JCDL ‘12 Doctoral Consortium
06/10/2012

1

Problem!
• Which exists in the archive?
– Probably the unauthenticated version, right?
• What factors created “my” representation?
– Can I archive “my” representation?
• Am I seeing undead resources?
– Mix of live and archived content?
• How can we capture, share, and
archive user experiences?

4

Which version is in the Internet
Archive?

5

Which version is in WebCite?

6

Craigslist.org
$ curl -I -L http://www.craigslist.org
HTTP/1.1 302 Found
Set-Cookie: …
Location: http://geo.craigslist.org/

HTTP/1.1 302 Found
Content-Type: text/html; charset=iso-8859-1
Connection: close
Location: http://norfolk.craigslist.org
Date: Thu, 31 May 2012 23:26:27 GMT
Set-Cookie: …
Server: Apache

HTTP/1.1 200 OK
Connection: close
Cache-Control: max-age=3600, public
Last-Modified: Thu, 31 May 2012 23:13:46 GMT
Set-Cookie: …
Transfer-Encoding: chunked
Date: Thu, 31 May 2012 23:13:46 GMT
Vary: Accept-Encoding
Content-Type: text/html; charset=iso-8859-1;
X-Frame-Options: Allow-From https://forums.craigslist.org
Server: Apache
7
Expires: Fri, 01 Jun 2012 00:13:46 GMT

Live Resource
Accessed from Norfolk

8

Archived Resource
Submitted from Norfolk
• Submitted to WebCite from Norfolk

9

Live Norfolk Interactive Mapper

10
http://gisapp2.norfolk.gov/interactive_mapper/viewer.htm

Archived Norfolk Interactive
Mapper

11
http://web.archive.org/web/20100924020604/http://gisapp2.norfolk.gov/interactive_mapper/viewer.htm

Web 2.0
• Crawlers aren’t enough
• Relies on interaction/personalization
• Users may want to archive personal
content
• How do we capture user experiences?
– Justin’s vs. Dr. Nelson’s experience? Both?
• What about sharing browsing sessions?

12

Things are better
(but really worse)
• Better UI, worse archiving
• HTML5
• JavaScript
– document.write
• Cookies
• User Interaction
• GeoIP

13

Traditional Representation
generation
Dereference

URI

Resource
Identifies

Represents

Representation

From W3C Web Architecture 14

Representation through
content negotiation
Dereference Negotiate

URI

Resource
Identifies

Represents

Representation

From W3C Web Architecture 15

Web 2.0 Representation
Generation
Dereference

User
URI
Interaction

Client-
side
Resource script
Identifies

Represents

Representation

16

Prior Work
• Capture for Debugging and Security
– Mickens, 2010; Livshits, 2007, 2009, 2010; Dhawan, 2009
• Crawlers
– Mesbah, 2008; Duda, 2008; Lowet, 2009
• Caching dynamic content
– Benson, 2010; Karri, 2009; Boulos, 2010; Periyapatna,
2009; Sivasubramanian, 2007
• Walden’s paths
– http://www.csdl.tamu.edu/walden/
• IIPC Workshop 2012: Archiving the Future Web
– http://blog.dshr.org/2012/05/harvesting-and-preserving-future-web.html
17

Two Current Solutions
• Browser-based crawling
– Problematic at scale, misses post-render content, no
session spanning, misses “personal” browsing
– IA
– To be released – Heritrix 3.X
• Transactional Web Archiving
– Impact/depth is unknown, client-side changes are
missed, must have server/content author buy-in
– LANL
– http://theresourcedepot.org/
18

What can Justin do about it?
• How can we capture THE user
experience?
– How much user-shared content is archivable?
– What defines a dynamic representation?
• Infinitely Changing?
– How much dynamic content are archives missing?
– What tools are required to capture the
representation?
• Browser Add-on?
– How much will users contribute to the archives?
• Is this even possible? 19

Characteristics of a Potential
Solution
• Browser Add-on
• Crowd sourced
– User contributions to archives
• Opt-in representation archiving/sharing
• Capture client-side DOM
– JS, HTML, representation, etc.
• Capture client-side events and resulting
DOM
– Includes Ajax and post-render data
• Package and submit to archives 20

Dissertation Plan
BEGIN
Background Research
Coursework
Quals
Prevalence of Current
Unarchivable Resources State

Define test datasets (set of dynamic and static test pages)
Define factors/equations of dynamic representations – What
dynamic content can (and cannot) be captured for archiving?
Construction of software solution -- VCR for the Web: Record,
Rewind, Replay
Analysis of improved capture -- Client-side Archiving: Client-side
(human assisted) Capture vs. Traditional Crawlers vs. Headless clients
Explore how personalized archives can be combined with public web
archives
PhD Defense

Current Work:
How much can we archive?
• Sample from Bit.ly URIs from Twitter
• Load page in each environment:
– Live
– 3rd Party Archived
• Submit and load from WebCitation
– Locally stored
• wget –k -p and load from local drive
– Local only
• Load from local drive – no Internet access
23

Live
http://dctheatrescene.com/2009/06/11/toxic-avengers-john-rando/

24

Archived (WebCite)
http://www.webcitation.org/685EYfYEK

25

Locally Stored
http://localhost/dctheatrescene.com/2009/06/11/toxic-avengers-john-rando/

26

Local Only
(No Internet)
http://localhost/dctheatrescene.com/2009/06/11/toxic-avengers-john-rando/

• Missing:
12/78 without internet
• dctheatrescene.com/…/uperfish.args.js?e83a2c
• dctheatrescene.com/…/css/datatables.css?
ver=1.9.3

• Small files, bit impact

27

Thought Experiment

28

Click and drag to left

30

Submit to Archive

31

Future Research Questions
• What dynamism can (and cannot) be
captured for archiving?
• Client-side Archiving: Client-side Capture vs.
Traditional Crawlers
• Client-side contributions to Web Archives:
Archiving User Experiences

32

Conclusion
• Is dynamic content
archivable?
• How much are we
missing?
• Can you archive
your experience?
• For the betterment
of archives
• For personal
capture
33

References
• J. Mickens, J. Elson, and J. Howell. Mugshot: deterministic capture and replay for
JavaScript applications. In Proceedings of the 7th USENIX conference on Networked
systems design and implementation, NSDI'10, pages 11-11, Berkeley, CA, USA, 2010.
USENIX Association.
• K.Vikram, A. Prateek, and B. Livshits. Ripley: Automatically securing web 2.0 applications
through replicated execution. In Proceedings of the Conference on Computer and
Communications Security, November 2009.
• E. Kiciman and B. Livshits. Ajaxscope: A platform for remotely monitoring the client-side
behavior of web 2.0 applications. In the 21st ACM Symposium on Operating Systems
Principles (SOSP'07), SOSP '07, 2007.
• B. Livshits and S. Guarnieri. Gulfstream: Incremental static analysis for streaming
JavaScript applications. Technical Report MSR-TR-2010-4, Microsoft, January 2010.
• M. Dhawan and V. Ganapathy. Analyzing information flow in JavaScript-based browser
extensions. Annual Computer Security Applications Conference, pages 382 - 391, 2009.
• A. Mesbah, E. Bozdag, and A. van Deursen. Crawling Ajax by inferring user interface state
changes. In Web Engineering, 2008. ICWE '08. Eighth International Conference on, pages
122-134, July 2008.
• C. Duda, G. Frey, D. Kossmann, and C. Zhou. AjaxSearch: crawling, indexing and
searching Web 2.0 applications. Proc. VLDB Endow., 1:1440-1443, August 2008. 35
• D. Lowet and D. Goergen. Co-browsing dynamic web pages. In WWW, pages 941-950,

References
• S. Chakrabarti, S. Srivastava, M. Subramanyam, and M. Tiwari. Memex: A browsing
assistant for collaborative archiving and mining of surf trails. In Proceedings of the 26th
VLDB Conference, 26th VLDB, 2000.
• R. Karri. Client-side page element web-caching, 2009.
• E. Benson, A. M. 0002, D. R. Karger, and S. Madden. Sync kit: a persistent client-side
database caching toolkit for data intensive websites. In WWW, pages 121{130, 2010.
• M. N. K. Boulos, J. Gong, P. Yue, and J. Y. Warren. Web gis in practice viii: Html5 and the
canvas element for interactive online mapping. International journal of health geographics,
March 2010.
• S. Periyapatna. Total recall for Ajax applications firefox extension, 2009.
• S. Sivasubramanian, G. Pierre, M. van Steen, and G. Alonso. Analysis of caching and
replication strategies for web applications. IEEE Internet Computing, 11:60-66, 2007.

36

Web Archives
• “Web archiving is the process of
collecting portions of the World Wide
Web and ensuring the collection
is preserved … for future researchers,
historians, and the public. “
-- http://
en.wikipedia.org/wiki/Web_archiving

37

What does this have to do with
DLs?
• Improved coverage
• NARA regulation
• Improved “memory”
• Gathers missing User Experiences
– Or at least an adequate sub-sample

38

Envisioned Solution

SELECT PREVIOUS REPRESENTATION TO ARCHIVE:

User Event: User Event: User Event:
Text Entered Double Click Text Entered
Button Push

Ajax Event: Ajax Event: Ajax Event:
XMLResponse XMLResponse XMLResponse 39

Current Web Applications

41

Web Applications with Session
Archiver

42

Filling in the Blanks: Capturing Dynamically Generated Content

More Related Content

Viewers also liked

Similar to Filling in the Blanks: Capturing Dynamically Generated Content

More from Justin Brunelle

Recently uploaded

Filling in the Blanks: Capturing Dynamically Generated Content

Editor's Notes