Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Scripts in a Frame:
A Two-Tiered Approach for Archiving
Deferred Representations
Justin F. Brunelle
Dissertation Defense
F...
A simpler time…
2
Mass hysteria. Human sacrifices. Dogs and
cats living together.
3
<iframe><script>…</script></iframe>
4
t
5http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html
Missing resources (bad)
2008
6http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html
2008
2012
Missing resources (bad) and
Temporal viol...
Old ads are interesting
7
New ones are annoying…for now.
8
“Why are your parents wrestling?”
Today’s ads are
missing from the
archives
9
http://adserver.adtechus.com/addyn/3.0/5399.1/2394397/0/-
1/QUANTCAST;;size=30...
JavaScript is hard to replay
What happens when things are completely lost?
http://ws-dl.blogspot.com/2013/11/2013-11-28-re...
Remember SOPA? And the protest?
11
https://en.wikipedia.org/wiki/Stop_Online_Piracy_Act
https://en.wikipedia.org/wiki/Prot...
http://en.wikipedia.org/wiki/Main_Page January 18th, 2012 12
http://web.archive.org/web/20120118110520/http://en.wikipedia.org/wiki/Main_Page January 18th, 2012 13
14
Problem!
The archives contain the Web as
seen by crawlers
Why archive?
The Internet Archive has everything!
Why didn’t you back it up?
Participating institutions can hand over thei...
Crimean Conflict
Russian troops captured the Crimean Center for Investigative
Journalism
Gunman: "We will try to agree on ...
Archive-It to the rescue!
17
How?
 Masked
gunman have
your servers
 Where are
your backups?
 Transactional
archive? Too
late!
18
Preservation over H...
How?
 Masked
gunman have
your servers
 Where are
your backups?
 Transactional
archive? Too
late!
19
Preservation over H...
Any future discussion of the 21st
century will involve the web and
the web archives
20
Any future discussion of the 21st
century will involve the web and
the web archives
But JavaScript is hard to archive, res...
Any future discussion of the 21st
century will involve the web and
the web archives
But JavaScript is hard to archive, res...
23
Motivating Examples
Background Information
Research Questions
Measuring the Impact of JavaScript
Measuring Memento Qual...
Some Institutional Archives
24
Some Page-at-a-time Archivers
25
Some Archival Tools
26
1: http://warcreate.com/
2: http://matkelly.com/wail/
1
2
Memento Framework
27
http://mementoweb.org/guide/rfc/
Machine readable bidirectional link between the past and present web
28
29
30
 URI-R: Original
Resource Identifier
 URI-M: memento
Identifier
 URI-T:
TimeMap
Identifier
Page on the live web
Arch...
Web Architecture
31
Dereference a URI, get a
representation
JavaScript makes requests for new resources
after the initial page load
32
http://maps.google.com
Identifies
Represents
Deferred Representation
33
http://maps.google.com
Identifies
Represents
JavaScript != Deferred
34
Deferred
HTTP GETHTTP GET HTTP GETHTTP GET
onload
Nondeferred
HTTP GET
Web Browsing Process
35
 User-controlled
 Interaction
 Environment
variables → content
negotiation
 Client-controlled
...
Web Browsing Process
36
There is no longer “the”
representation.
At any given time, users
get “a” representation.
GeoIP: W...
The Internet Archive got everything, right?
37
Missing tiles, not interactive
38
HTTP GET Request for Resource R
HTTP 200 OK Response: R Content
Browser renders
and displays R
JavaScript requests
embedde...
HTTP GET Request for Resource R
HTTP 200 OK Response: R Content
Browser renders
and displays R
JavaScript requests
embedde...
HTTP GET Request for Resource R
HTTP 200 OK Response: R Content
Browser renders
and displays R
JavaScript requests
embedde...
42
Motivating Examples
Background Information
Research Questions
Measuring the Impact of JavaScript
Measuring Memento Qual...
Research Questions
RQ1. To what extent does JavaScript impact archival tools?
RQ2. How do we measure memento quality?
RQ3....
44
Motivating Examples
Background Information
Research Questions
Measuring the Impact of JavaScript
Measuring Memento Qual...
Zombies!
45
2008
2012
Measuring JavaScript
 1,000 URIs from Twitter
 1,000 URIs from Archive-it
Dataset available at http://www.cs.odu.edu/~jb...
Good
47
Good
48
Good
49
Meh
50
Meh
51
Bad
52
Bad
53
Bad
54
Bad
55
Bad
56
Leakage by archival tool
57Twitter has more leakage than Archive-It
Leakage by archival tool
58Wayback reduces leakage the most
Leakage -> Zombies
5912% increase in embedded mementos loaded via JavaScript
Leakage increasing over time
60Increased JavaScript -> increases in missing embedded resources
61
• 73.1% of all missing
embedded mementos are
loaded via JavaScript
• 33% increase in missing
embedded mementos from
Jav...
62
Motivating Examples
Background Information
Research Questions
Measuring the Impact of JavaScript
Measuring Memento Qual...
63
“Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources”, JCDL 2014,
International Journal of Di...
“Live” XKCD
• Missing 17% of embedded
resources
• Looks complete
64
“Live” XKCD
• Take three resources:
• Logo
• Main Comic
• Navigation Strip
• Relative importance?
• All present in “Live” ...
Damaging XKCD
• Created a local memento
• Removed the logo and navigation
strip
• Now missing 29% of
embedded resources
• ...
Damaging XKCD
• From our local memento
• Removed the Main Comic
• Now missing 24% of
embedded resources
• Human assessment...
Damaging XKCD
• From our local memento
• Removed the Main Comic
• Now missing 24% of
embedded resources
• Human assessment...
Image Importance
• Size (as percentage of all pixels)
69
Image Importance
• Size
• Position (in viewport?)
70
Image Importance
• Size
• Position
• Centrality (in the vertical or
horizontal center?)
71
Missing CSS
• More important
than thought
• Calculated the
amount of content
in each vertical
third
• If >=80% in left
col...
Methodology
• Defined Dm and Mm metrics
Mm =
𝑀𝑖𝑠𝑠𝑖𝑛𝑔 𝐸𝑚𝑏𝑒𝑑𝑑𝑒𝑑 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠
𝐴𝑙𝑙 𝐸𝑚𝑏𝑒𝑑𝑑𝑒𝑑 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠
Dm = 𝑖=1
𝑛 𝑚𝑖𝑠𝑠𝑖𝑛𝑔 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒...
Turk Results
74
Live vs Manually
Damaged Dm
Mementos from
Internet Archive
Agreement with Dm
Mementos from
Internet Archiv...
Damage in the Archives
75
Internet Archive WebCite
Mementos with deferred representations have 13.5% higher
damage rating
76
Motivating Examples
Background Information
Research Questions
Measuring the Impact of JavaScript
Measuring Memento Qual...
77
Current
Workflow
• Dereference URI-Rs
• Archive representation
• Extract embedded URI-Rs
• Repeat
78
Two-Tiered Crawling
“Archiving Deferred Representations
Using a Two-Tiered Crawling Approach”,
iPRES2015
“Adapting the ...
79
<script> tags alone are not indicative of a deferred
representation. JavaScript can be played back in the
archives!
Cur...
80
<script> tags alone are not indicative of a deferred
representation. JavaScript can be played back in the
archives!
Cur...
Comparing Performance
• Crawled 10,000 URI-Rs
Dataset available at http://www.cs.odu.edu/~jbrunelle/wsdl/10kuris.txt
• Com...
Performance: Frontier Size
82PhantomJS creates a 1.5x larger crawl frontier than Heritrix
Performance: Crawl Speed
83
Heritrix: ~2 URIs/second
PhantomJS: ~4 seconds/URI
Classifier
We are omitting a discussion about the classifier
for deferred vs. nondeferred representations
Please see Secti...
Descendants = States of deferred representations
reached through client-side events
85
Click Pan Zoom
Click Pan Zoom
Crawling descendants
• Interactions represented as N-ary tree G
• FSM: M = (S, s0, Σ, δ)
‒ S is the finite set of client s...
87http://www.bloomberg.com/bw/articles/2014-06-16/open-plan-offices-for-people-who-hate-open-plan-offices
s0
s1
s2
Interac...
88http://www.bloomberg.com/bw/articles/2014-06-16/open-plan-offices-for-people-who-hate-open-plan-offices
s0
s1
s2
Interac...
89
Interaction Trees are 2 Levels Deep
http://www.bloomberg.com/bw/articles/2014-06-16/open-plan-offices-for-people-who-ha...
90
Interaction Trees are 2 Levels Deep
http://www.bloomberg.com/bw/articles/2014-06-16/open-plan-offices-for-people-who-ha...
91
Interaction Trees are 2 Levels Deep
http://www.bloomberg.com/bw/articles/2014-06-16/open-plan-offices-for-people-who-ha...
Expanding the Crawl Frontier
92
Level s1 provides the greatest benefit to the crawl frontier
Nondeferred
Deferred
Crawling Descendants
93
New embedded resources at levels s1 are largely
unarchived
Crawling Descendants
94
Level s1 has the highest cost-benefit Return on Investment
Storage Impact of Two-Tiered Crawling
 IIPC-proposed JSON metadata of interactions, resulting descendants
–Potentially us...
96
Motivating Examples
Background Information
Research Questions
Measuring the Impact of JavaScript
Measuring Memento Qual...
Future Work
• Modeling user interactions, tendencies, and simulation
– Form filling
– Click and navigation likelihood
• Ev...
98
Motivating Examples
Background Information
Research Questions
Measuring the Impact of JavaScript
Measuring Memento Qual...
RQ1. To what extent does JavaScript impact
archival tools?
Contributions:
• Defined and identified zombie resources
• Adop...
RQ2. How do we measure memento quality?
Contributions:
• Mm is not accurate (worse than coin-flip)
• Created Dm metric
• D...
RQ3. How can we crawl, archive, and play
back deferred representations?
Contributions:
• Defined a framework for archiving...
Summary
• Measured the impact of JavaScript on the archives
• Quantified damage caused by JavaScript
• Measured the cost i...
Backups
103
104
Year RQ Venue Abbreviated Title Notes
2012 JCDL2012 Doctoral Consortium Capturing Dynamic Web
2013 JCDL2013 TimeMap Ca...
Publications
• Justin F. Brunelle “Filling in the Blanks: Capturing the
Dynamic Web”, JCDL 2012 Doctoral Consortium
• Just...
Mobile Mink: Merging Mobile and
Desktop Archived Webs
Wesley Jordan, Mat Kelly, Justin F. Brunelle,
Laura Vobrak, Michele ...
HTTP Request
$ curl -i -v http://www.cs.odu.edu/
> GET / HTTP/1.1
> User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libc...
HTTP Response
HTTP/1.1 200 OK
Server: nginx
Date: Tue, 25 Mar 2014 23:40:09 GMT
Content-Type: text/html
Transfer-Encoding:...
Client-side code modifies
the DOM
109
Internet Archive URI-M
110
http://web.archive.org/web/20140314130018/http://espn.go.com/
Archive Prefix Memento-DateTime U...
Deferred Representations
Representation is incomplete
Client-side code execution completes the build of the representation...
Web Browsing Process
112
Deferred
representations
Percent Missing vs. Weighted Damage
• 𝑀 𝑀 = Percent of embedded
resources missing
𝑀 𝑀 =
𝐸𝑚𝑏𝑒𝑑𝑑𝑒𝑑 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 𝑀𝑖𝑠𝑠𝑖𝑛𝑔
𝑇𝑜𝑡𝑎𝑙 𝐸...
• Measured Internet
Archive mementos
• Damage generally
improves over time
• Despite missing more
resources over time
Dama...
Expanding the crawl frontier
115
Click events lead to the most descendants
Related Work
116
Deep Web
• Deferred=Deep (Bergman, 2001)
• Mobile requires context (Schneider, 2013)
• Static → Dynamic Web (Rosenthal, 20...
Archive Quality
• SHARC, Quality Conscious Archiving (Spaniol, 2009)
• Quality of archives (Spaniol, 2009, 2009)
• Archive...
Monitoring for Security
• Ripley (Vikram, 2009)
• Mugshot (Mickens, 2010)
• ActionShot (Li, 2010)
• Ajax testing and state...
Publications
Master’s:
• Kyle Dempsey, Justin Brunelle, G. Tanner Jackson, Chutima Boonthum, Irwin
Levinstein, Danielle Mc...
Performance with classifier
121
Mobile Sites in the Archives
122
http://m.espn.go.com/wireless/http://espn.go.com/
“A Method for Identifying Personalized ...
Mobile Sites in the Archives
123
http://m.espn.go.com/wireless/http://espn.go.com/
URI-M:
http://web.archive.org/web/20140...
Collisions in the Archives
124
http://www.cnn.com/
URI-M? URI-T?
http://web.archive.org/web/[DATETIME]/http://www.cnn.com/
Need a better way to index mementos
• URI-R is no longer enough
• Environmental factors:
‒ Content negotiation
‒ Interacti...
Content Negotiation
 Server-side
interpretation of
client-provided
parameters
 Multiple
representations,
single resource...
Upcoming SlideShare
Loading in …5
×

Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations

1,099 views

Published on

Justin F. Brunelle's dissertation defense slides.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations

  1. 1. Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations Justin F. Brunelle Dissertation Defense February 5, 2016 Committee Members:  Michael L. Nelson  Michele C. Weigle  Elizabeth J. Vincelette  Irwin B. Levinstein
  2. 2. A simpler time… 2
  3. 3. Mass hysteria. Human sacrifices. Dogs and cats living together. 3 <iframe><script>…</script></iframe>
  4. 4. 4 t
  5. 5. 5http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html Missing resources (bad) 2008
  6. 6. 6http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html 2008 2012 Missing resources (bad) and Temporal violations (worse)
  7. 7. Old ads are interesting 7
  8. 8. New ones are annoying…for now. 8 “Why are your parents wrestling?”
  9. 9. Today’s ads are missing from the archives 9 http://adserver.adtechus.com/addyn/3.0/5399.1/2394397/0/- 1/QUANTCAST;;size=300x250;target=_blank;alias=p36- 17b4f9us2qmzc8bn;kvp36=p36-17b4f9us2qmzc8bn;sub1=p- 4UZr_j7rCm_Aj;kvl=172802;kvc=794676;kvs=300x250;kvi=c052a80 3d0b5476f0bd2f2043ef237e27cd48019;kva=p- 4UZr_j7rCm_Aj;rdclick=http://exch.quantserve.com/r?a=p- 4UZr_j7rCm_Aj;labels=_qc.clk,_click.adserver.rtb,_click.rand.85854; rtbip=192.184.64.144;rtbdata2=EAQaFUhSQmxvY2tfMjAxNlRheFNlY XNvbiCZiRcogsYKMLTAMDoSaHR0cDovL3d3dy5jbm4uY29tWihUUEh wYlUzM3ZqeFU5LTA1SGZEMk1SXzE0anBVcGU0d0dxTG10STFUdUs2I ECAAb_JicoFoAEBqAGhy7YCugEoVFBIcGJVMzN2anhVOS0wNUhmR DJNUl8xNGpwVXBlNHdHcUxtdEkxVMAB3ed3yAGUp7GUqSraAShjM DUyYTgwM2QwYjU0NzZmMGJkMmYyMDQzZWYyMzdlMjdjZDQ4M DE55QHvEWs- 6AFkmAK2wQqoAgWoAgawAgi6AgTAuECQwAICyAIA0ALe9baMj4Co s-oB
  10. 10. JavaScript is hard to replay What happens when things are completely lost? http://ws-dl.blogspot.com/2013/11/2013-11-28-replaying-sopa-protest.html 10
  11. 11. Remember SOPA? And the protest? 11 https://en.wikipedia.org/wiki/Stop_Online_Piracy_Act https://en.wikipedia.org/wiki/Protests_against_SOPA_and_PIPA
  12. 12. http://en.wikipedia.org/wiki/Main_Page January 18th, 2012 12
  13. 13. http://web.archive.org/web/20120118110520/http://en.wikipedia.org/wiki/Main_Page January 18th, 2012 13
  14. 14. 14 Problem! The archives contain the Web as seen by crawlers
  15. 15. Why archive? The Internet Archive has everything! Why didn’t you back it up? Participating institutions can hand over their databases. 15
  16. 16. Crimean Conflict Russian troops captured the Crimean Center for Investigative Journalism Gunman: "We will try to agree on the correct truthful coverage of events.” 16 http://gijn.org/2014/03/02/masked-gunmen-seize-crimean-investigative-journalism-center/
  17. 17. Archive-It to the rescue! 17
  18. 18. How?  Masked gunman have your servers  Where are your backups?  Transactional archive? Too late! 18 Preservation over HTTP
  19. 19. How?  Masked gunman have your servers  Where are your backups?  Transactional archive? Too late! 19 Preservation over HTTP
  20. 20. Any future discussion of the 21st century will involve the web and the web archives 20
  21. 21. Any future discussion of the 21st century will involve the web and the web archives But JavaScript is hard to archive, resulting in archives of content as seen by crawlers rather than as seen by users 21
  22. 22. Any future discussion of the 21st century will involve the web and the web archives But JavaScript is hard to archive, resulting in archives of content as seen by crawlers rather than as seen by users 22 Goal: Mitigate the impact of JavaScript on the archives by making crawlers behave like users
  23. 23. 23 Motivating Examples Background Information Research Questions Measuring the Impact of JavaScript Measuring Memento Quality Crawling Deferred Representations Future Work Conclusions
  24. 24. Some Institutional Archives 24
  25. 25. Some Page-at-a-time Archivers 25
  26. 26. Some Archival Tools 26 1: http://warcreate.com/ 2: http://matkelly.com/wail/ 1 2
  27. 27. Memento Framework 27 http://mementoweb.org/guide/rfc/ Machine readable bidirectional link between the past and present web
  28. 28. 28
  29. 29. 29
  30. 30. 30  URI-R: Original Resource Identifier  URI-M: memento Identifier  URI-T: TimeMap Identifier Page on the live web Archived version of a page List of archived pages
  31. 31. Web Architecture 31 Dereference a URI, get a representation
  32. 32. JavaScript makes requests for new resources after the initial page load 32 http://maps.google.com Identifies Represents
  33. 33. Deferred Representation 33 http://maps.google.com Identifies Represents
  34. 34. JavaScript != Deferred 34 Deferred HTTP GETHTTP GET HTTP GETHTTP GET onload Nondeferred HTTP GET
  35. 35. Web Browsing Process 35  User-controlled  Interaction  Environment variables → content negotiation  Client-controlled representation changes HTTP GET Request for Resource R HTTP 200 OK Response: R Content Browser renders and displays R JavaScript requests embedded resources Server returns embedded resources R updates its representation
  36. 36. Web Browsing Process 36 There is no longer “the” representation. At any given time, users get “a” representation. GeoIP: Washington, D.C. URI-R: http://www.wunderground.com/ GeoIP: Suffolk, VA URI-R: http://www.wunderground.com/
  37. 37. The Internet Archive got everything, right? 37
  38. 38. Missing tiles, not interactive 38
  39. 39. HTTP GET Request for Resource R HTTP 200 OK Response: R Content Browser renders and displays R JavaScript requests embedded resources Server returns embedded resources R updates its representation Web Browsing Process 39 Archival Tools stop here
  40. 40. HTTP GET Request for Resource R HTTP 200 OK Response: R Content Browser renders and displays R JavaScript requests embedded resources Server returns embedded resources R updates its representation Web Browsing Process 40 Archival Tools stop here
  41. 41. HTTP GET Request for Resource R HTTP 200 OK Response: R Content Browser renders and displays R JavaScript requests embedded resources Server returns embedded resources R updates its representation Web Browsing Process 41 Archival Tools stop here Still not solved!
  42. 42. 42 Motivating Examples Background Information Research Questions Measuring the Impact of JavaScript Measuring Memento Quality Crawling Deferred Representations Future Work Conclusions
  43. 43. Research Questions RQ1. To what extent does JavaScript impact archival tools? RQ2. How do we measure memento quality? RQ3. How can we crawl, archive, and play back deferred representations? 43
  44. 44. 44 Motivating Examples Background Information Research Questions Measuring the Impact of JavaScript Measuring Memento Quality Crawling Deferred Representations Future Work Conclusions 20152013
  45. 45. Zombies! 45 2008 2012
  46. 46. Measuring JavaScript  1,000 URIs from Twitter  1,000 URIs from Archive-it Dataset available at http://www.cs.odu.edu/~jbrunelle/jsDataSet.txt  Capture with tools  Study the archivability 46 “The impact of JavaScript on archivability”, 2015, International Journal of Digital Libraries ( )
  47. 47. Good 47
  48. 48. Good 48
  49. 49. Good 49
  50. 50. Meh 50
  51. 51. Meh 51
  52. 52. Bad 52
  53. 53. Bad 53
  54. 54. Bad 54
  55. 55. Bad 55
  56. 56. Bad 56
  57. 57. Leakage by archival tool 57Twitter has more leakage than Archive-It
  58. 58. Leakage by archival tool 58Wayback reduces leakage the most
  59. 59. Leakage -> Zombies 5912% increase in embedded mementos loaded via JavaScript
  60. 60. Leakage increasing over time 60Increased JavaScript -> increases in missing embedded resources
  61. 61. 61 • 73.1% of all missing embedded mementos are loaded via JavaScript • 33% increase in missing embedded mementos from JavaScript between 2005-2012 Leakage increasing over time
  62. 62. 62 Motivating Examples Background Information Research Questions Measuring the Impact of JavaScript Measuring Memento Quality Crawling Deferred Representations Future Work Conclusions 2015 2014
  63. 63. 63 “Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources”, JCDL 2014, International Journal of Digital Libraries, 2015 VS. 63
  64. 64. “Live” XKCD • Missing 17% of embedded resources • Looks complete 64
  65. 65. “Live” XKCD • Take three resources: • Logo • Main Comic • Navigation Strip • Relative importance? • All present in “Live” XKCD 65
  66. 66. Damaging XKCD • Created a local memento • Removed the logo and navigation strip • Now missing 29% of embedded resources • Human assessment: looks OK 66
  67. 67. Damaging XKCD • From our local memento • Removed the Main Comic • Now missing 24% of embedded resources • Human assessment: Not a usable memento 67
  68. 68. Damaging XKCD • From our local memento • Removed the Main Comic • Now missing 24% of embedded resources • Human assessment: Not a usable memento • Percent of missing embedded resources is not a suitable metric for memento quality 68
  69. 69. Image Importance • Size (as percentage of all pixels) 69
  70. 70. Image Importance • Size • Position (in viewport?) 70
  71. 71. Image Importance • Size • Position • Centrality (in the vertical or horizontal center?) 71
  72. 72. Missing CSS • More important than thought • Calculated the amount of content in each vertical third • If >=80% in left column and missing CSS, CSS is important • Only performed if stylesheets are missing 72
  73. 73. Methodology • Defined Dm and Mm metrics Mm = 𝑀𝑖𝑠𝑠𝑖𝑛𝑔 𝐸𝑚𝑏𝑒𝑑𝑑𝑒𝑑 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 𝐴𝑙𝑙 𝐸𝑚𝑏𝑒𝑑𝑑𝑒𝑑 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 Dm = 𝑖=1 𝑛 𝑚𝑖𝑠𝑠𝑖𝑛𝑔 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 𝑤 𝑖 𝑗=1 𝑛 𝑎𝑙𝑙 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 𝑤 𝑗 • Used Amazon Mechanical Turkers to assess web user perception of quality • Assessed Dm versus Mm in manually damaged pages • Assessed Dm versus Mm in the archives 73
  74. 74. Turk Results 74 Live vs Manually Damaged Dm Mementos from Internet Archive Agreement with Dm Mementos from Internet Archive Agreement with Mm 50/50 Chance
  75. 75. Damage in the Archives 75 Internet Archive WebCite Mementos with deferred representations have 13.5% higher damage rating
  76. 76. 76 Motivating Examples Background Information Research Questions Measuring the Impact of JavaScript Measuring Memento Quality Crawling Deferred Representations Future Work Conclusions 2015 2016
  77. 77. 77 Current Workflow • Dereference URI-Rs • Archive representation • Extract embedded URI-Rs • Repeat
  78. 78. 78 Two-Tiered Crawling “Archiving Deferred Representations Using a Two-Tiered Crawling Approach”, iPRES2015 “Adapting the Hypercube Model to Archive Deferred Representations at Web-Scale”, Technical Report, arXiv:1601.05142, 2016
  79. 79. 79 <script> tags alone are not indicative of a deferred representation. JavaScript can be played back in the archives! Current workflow not suitable for deferred representations Use PhantomJS to run JavaScript, interact with the representation Two-tiered crawling approach to optimize performance
  80. 80. 80 <script> tags alone are not indicative of a deferred representation. JavaScript can be played back in the archives! Current workflow not suitable for deferred representations Use PhantomJS to run JavaScript, interact with the representation Two-tiered crawling approach to optimize performance More URI-Rs in the crawl frontier Runs more slowly but more deeply
  81. 81. Comparing Performance • Crawled 10,000 URI-Rs Dataset available at http://www.cs.odu.edu/~jbrunelle/wsdl/10kuris.txt • Compare crawl speed & discovered frontier size • With and without classifier • Code available at https://github.com/jbrunelle/classifyDeferred/ 81
  82. 82. Performance: Frontier Size 82PhantomJS creates a 1.5x larger crawl frontier than Heritrix
  83. 83. Performance: Crawl Speed 83 Heritrix: ~2 URIs/second PhantomJS: ~4 seconds/URI
  84. 84. Classifier We are omitting a discussion about the classifier for deferred vs. nondeferred representations Please see Section 7.4 in the dissertation for a detailed discussion 84
  85. 85. Descendants = States of deferred representations reached through client-side events 85 Click Pan Zoom Click Pan Zoom
  86. 86. Crawling descendants • Interactions represented as N-ary tree G • FSM: M = (S, s0, Σ, δ) ‒ S is the finite set of client states ‒ s0 ϵ S is the initial state reached by dereferencing the URI-R and executing the initial on- load events ‒ e ϵ Σ defines the client-side event e as a member of the set of all events Σ ‒ δ : Sx Σ → S is the transition function in which a client-side event is executed and leads to a new state si, sj ϵ S δ(si, e) = sj e = client-side event j = i + 1 86 “Adapting the Hypercube Model to Archive Deferred Representations at Web-Scale”, Technical Report, arXiv:1601.05142, 2016
  87. 87. 87http://www.bloomberg.com/bw/articles/2014-06-16/open-plan-offices-for-people-who-hate-open-plan-offices s0 s1 s2 Interaction Trees are 2 Levels Deep
  88. 88. 88http://www.bloomberg.com/bw/articles/2014-06-16/open-plan-offices-for-people-who-hate-open-plan-offices s0 s1 s2 Interaction Trees are 2 Levels Deep
  89. 89. 89 Interaction Trees are 2 Levels Deep http://www.bloomberg.com/bw/articles/2014-06-16/open-plan-offices-for-people-who-hate-open-plan-offices s0 s1 s2
  90. 90. 90 Interaction Trees are 2 Levels Deep http://www.bloomberg.com/bw/articles/2014-06-16/open-plan-offices-for-people-who-hate-open-plan-offices s0 s1 s2
  91. 91. 91 Interaction Trees are 2 Levels Deep http://www.bloomberg.com/bw/articles/2014-06-16/open-plan-offices-for-people-who-hate-open-plan-offices s0 s1 s2
  92. 92. Expanding the Crawl Frontier 92 Level s1 provides the greatest benefit to the crawl frontier Nondeferred Deferred
  93. 93. Crawling Descendants 93 New embedded resources at levels s1 are largely unarchived
  94. 94. Crawling Descendants 94 Level s1 has the highest cost-benefit Return on Investment
  95. 95. Storage Impact of Two-Tiered Crawling  IIPC-proposed JSON metadata of interactions, resulting descendants –Potentially used to resolve URI-M collisions –16.5KB WARC metadata –143MB for total dataset  11.4 times larger for deferred vs nondeferred  Totals 5.12 times more storage per URI-R for total dataset 95 2013
  96. 96. 96 Motivating Examples Background Information Research Questions Measuring the Impact of JavaScript Measuring Memento Quality Crawling Deferred Representations Future Work Conclusions
  97. 97. Future Work • Modeling user interactions, tendencies, and simulation – Form filling – Click and navigation likelihood • Evaluating success of crawling deferred representations – Random walks through the archives – Dm vs Mm of mementos of deferred representations • Archival Halting Problem: How much is enough? – Mapping Applications – How many pans and zooms gets all the Norfolk, VA Google map tiles? – How many CNN.com pages get all the Google Ads? • Playing back WARCs with IIPC metadata of deferred representations and descendants 97
  98. 98. 98 Motivating Examples Background Information Research Questions Measuring the Impact of JavaScript Measuring Memento Quality Crawling Deferred Representations Future Work Conclusions
  99. 99. RQ1. To what extent does JavaScript impact archival tools? Contributions: • Defined and identified zombie resources • Adoption of JavaScript correlates with missing embedded resources in mementos • Defined deferred representations • Showed that deferred representations have reduced archivability 99 2012: ws-dl.blogspot.com 2013: TPDL2013 2015: iPRES2015 2015: IJDL 2015: IJDL Section 4.3 Ch. 5 Ch. 2 Ch. 5 For more information, reference:
  100. 100. RQ2. How do we measure memento quality? Contributions: • Mm is not accurate (worse than coin-flip) • Created Dm metric • Dm is closer to user perception than Mm • Mementos of deferred representations have higher Dm than nondeferred representations 100 2015: JCDL2015 2015: IJDL Special Issue Ch. 6 Section 6.6 For more information, reference:
  101. 101. RQ3. How can we crawl, archive, and play back deferred representations? Contributions: • Defined a framework for archiving deferred representations • Showed that the framework will crawl more slowly but more thoroughly • Defined descendants, showed that they are 2-levels deep • Showed the storage impact of crawling descendants and deferred representations 101 2015: iPRES2015 2016: arXiv:1601.05142 Ch. 7 Ch. 7 For more information, reference:
  102. 102. Summary • Measured the impact of JavaScript on the archives • Quantified damage caused by JavaScript • Measured the cost in time and space to archive JavaScript Provides policy makers information to make decisions regarding JavaScript handling in crawling and archiving Quantified an intuitive understanding of crawling deferred representations at web scale 102
  103. 103. Backups 103
  104. 104. 104 Year RQ Venue Abbreviated Title Notes 2012 JCDL2012 Doctoral Consortium Capturing Dynamic Web 2013 JCDL2013 TimeMap Caching 2013 RQ1 TPDL2013 Archivability Over Time 2013 TPDL2013 Transactional Archiving 2013 RQ1 DLib Magazine 19(11/12) Identifying Mementos 2014 RQ2 JCDL2014 Measuring Memento Damage Best Student Paper 2015 RQ1 International Journal of Digital Libraries Measuring Impact of JavaScript 2015 RQ2 International Journal of Digital Libraries Measuring Memento Damage JCDL2015 Special Issue 2015 JCDL2015 Merging Mobile and Desktop Best Poster 2015 RQ3 iPRES2015 Two-Tiered Crawling 2016 RQ3 Technical Report, arXiv:1601.05142 Hypercube Model for Archiving 2016 DLib Magazine 22(1/2) Archiving Corporate Intranets Publications
  105. 105. Publications • Justin F. Brunelle “Filling in the Blanks: Capturing the Dynamic Web”, JCDL 2012 Doctoral Consortium • Justin F. Brunelle, Michael L. Nelson “An Evaluation of Caching Policies for Memento TimeMaps”, JCDL 2013 • Mat Kelly, Justin F. Brunelle, Michele C. Weigle, Michael L. Nelson, “On the Change in Archivability of Websites Over Time”, TPDL 2013 • Justin F. Brunelle, Michael L. Nelson, Lyudmila Balakireva, Robert Sanderson, Herbert Van de Sompel, “Evaluating the SiteStory Transactional Web Archive With the ApacheBench Tool”, TPDL 2013 • Mat Kelly, Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, “A Method for Identifying Personalized Representations in Web Archives”, D-Lib Magazine, 19(11/12), 2013. • Justin F. Brunelle, Mat Kelly, Hany SalahEldeen, Michele C. Weigle, and Michael L. Nelson “Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources”, JCDL 2014 • Justin F. Brunelle, Mat Kelly, Michele C. Weigle, and Michael L. Nelson “The impact of JavaScript on archivability”, 2015, IJDL • Wesley Jordan, Mat Kelly, Justin F. Brunelle, Laura Vobrak, Michele C. Weigle, and Michael L. Nelson “Mobile Mink: Merging Mobile and Desktop Archived Webs”, JCDL 2015 • Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, “Archiving Deferred Representations Using a Two-Tiered Crawling Approach”, iPRES2015 • Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, “Adapting the Hypercube Model to Archive Deferred Representations at Web-Scale”, Technical Report, arXiv:1601.05142, 2016 • Justin F. Brunelle, Krista Ferrante, Eliot Wilczek, Michele C. Weigle, and Michael L. Nelson, “Leveraging Heritrix and the Wayback Machine on a corporate intranet: A case study on improving corporate archives”, DLib Magazine, 22(1/2) 2016 105
  106. 106. Mobile Mink: Merging Mobile and Desktop Archived Webs Wesley Jordan, Mat Kelly, Justin F. Brunelle, Laura Vobrak, Michele C. Weigle, Michael L. Nelson This work supported in part by the NEH HK-50181. This work was performed as part of Wesley Jordan’s mentorship at The MITRE Corporation. The author’s affiliation with The MITRE Corporation is provided for identification purposes only, and is not intended to convey or imply MITRE’s concurrence with, or support for, the positions, opinions or viewpoints expressed by the author. Acknowledgements http://bitly.com/MobileMink/ More about Mobile Mink Desktop URIs are much more prevalent than their mobile counterparts in the archives because crawlers use desktop user-agent strings. Corresponding Mobile URIs are archived less frequently even though the representations are different than their desktop counterparts. http://espn.go.com/ http://m.espn.go.com/ Same ESPN, different URIs, different HTML, different TimeMaps. .  Browse to a URI-R  Potential content- negotiation from user-agent  Access tool from the “Share” menu MobileMink merges TimeMaps of http://espn.go.com & http://m.espn.go.com/ Desktop and mobile webs differ and the linkage between them is lost in the archives  Discovers mobile and desktop URI-Rs  Uses Memento to get all available TimeMaps  Provides integrated TimeMap  Offers users ability to submit mobile and desktop URI-Rs to archives  Increases coverage of mobile URI-Rs in the archives
  107. 107. HTTP Request $ curl -i -v http://www.cs.odu.edu/ > GET / HTTP/1.1 > User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.13.1.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2 > Host: www.cs.odu.edu > Accept: */* > < HTTP/1.1 200 OK < Server: nginx < Date: Tue, 25 Mar 2014 23:42:38 GMT < Content-Type: text/html < Transfer-Encoding: chunked < Connection: keep-alive < 107
  108. 108. HTTP Response HTTP/1.1 200 OK Server: nginx Date: Tue, 25 Mar 2014 23:40:09 GMT Content-Type: text/html Transfer-Encoding: chunked Connection: keep-alive <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <!-- saved from url=(0036)http://www.cs.odu.edu/newcssite/new/ --> <!-- saved from url=(0019)http://sci.odu.edu/ --> <HTML xmlns:st1 = "urn:schemas-microsoft-com:office:smarttags"> <HEAD> <meta name="verify-v1" content="CXMn8RoyhZpl9fsKpbgxtiFw3kIdHD51r/ntbf1Rrcw=" > <TITLE>Department Of Computer Science</TITLE> 108
  109. 109. Client-side code modifies the DOM 109
  110. 110. Internet Archive URI-M 110 http://web.archive.org/web/20140314130018/http://espn.go.com/ Archive Prefix Memento-DateTime URI-R
  111. 111. Deferred Representations Representation is incomplete Client-side code execution completes the build of the representation 111
  112. 112. Web Browsing Process 112 Deferred representations
  113. 113. Percent Missing vs. Weighted Damage • 𝑀 𝑀 = Percent of embedded resources missing 𝑀 𝑀 = 𝐸𝑚𝑏𝑒𝑑𝑑𝑒𝑑 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 𝑀𝑖𝑠𝑠𝑖𝑛𝑔 𝑇𝑜𝑡𝑎𝑙 𝐸𝑚𝑏𝑒𝑑𝑑𝑒𝑑 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 • 𝐷 𝑀 = Damage rating of missing embedded resources 𝐷 𝑀 = 𝐷 𝑀 𝐴𝑐𝑡𝑢𝑎𝑙 𝐷 𝑀 𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙 𝐷 𝑀 𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙 = 𝑖=1 𝑛[𝐼|𝑀𝑀] 𝐷[𝐼|𝑀𝑀] (𝑖) 𝑛[𝐼|𝑀𝑀] + 𝑖=1 𝑛[𝐶] 𝐷[𝐶] (𝑖) 𝑛 𝐶 113 𝐼 = 𝐼𝑚𝑎𝑔𝑒 𝑀𝑀 = 𝑀𝑢𝑙𝑡𝑖𝑀𝑒𝑑𝑖𝑎 𝐶 = 𝐶𝑆𝑆
  114. 114. • Measured Internet Archive mementos • Damage generally improves over time • Despite missing more resources over time Damage in the Internet Archive 114
  115. 115. Expanding the crawl frontier 115 Click events lead to the most descendants
  116. 116. Related Work 116
  117. 117. Deep Web • Deferred=Deep (Bergman, 2001) • Mobile requires context (Schneider, 2013) • Static → Dynamic Web (Rosenthal, 2011)(IIPC, 2012) • Crawlers & deep Web (Ast, 2008) (B. He, 2007) (Y. He, 2013) • Google’s deep Web crawler (Madhavan, 2008) • Forms (Ntoulas, 2005) 117
  118. 118. Archive Quality • SHARC, Quality Conscious Archiving (Spaniol, 2009) • Quality of archives (Spaniol, 2009, 2009) • Archiveready (Banos, 2013, 2015) • Acid test (Kelly, 2014) • Block Importance (Ye, 2003) (Fersini, 2008) (Kohlschutter, 2010) 118
  119. 119. Monitoring for Security • Ripley (Vikram, 2009) • Mugshot (Mickens, 2010) • ActionShot (Li, 2010) • Ajax testing and states (Mesbah, 2007, 2008, 2009, 2009, 2012) • Crawling Ajax (Dincturk, 2013, 2014) 119
  120. 120. Publications Master’s: • Kyle Dempsey, Justin Brunelle, G. Tanner Jackson, Chutima Boonthum, Irwin Levinstein, Danielle McNamara. “MiBoard: Multiplayer Interactive Board Game”, AIED2009 • Justin F. Brunelle, Irwin B. Levinstein, Chutima Boonthum. “MiBoard: Metacognitive Training Through Gaming in iSTART”, 2009 VMASC Capstone Conference • Best paper in track • Justin F. Brunelle, Kyle B Dempsey, G. Tanner Jackson, Chutima Boonthum, Irwin B. Levinstein, Danielle S. McNamara. “MiBoard: Metacognitive Training Through Gaming”, SCiP2009 • Justin F. Brunelle, G. Tanner Jackson, Kyle Dempsey, Chutima Boonthum, Irwin B. Levinstein, Danielle S. McNamara. “Analysis of MiBoard as an iSTART Practice Tool”, FLAIRS-24, 2010 • Kyle Dempsey, G. Tanner Jackson, Justin Brunelle, Michael Rowe, Danielle McNamara. “MiBoard: Assessing Collaborative Learning Through Game- Based Practice”, FLAIRS-24, 2010 PhD: • Justin F. Brunelle “Filling in the Blanks: Capturing the Dynamic Web”, JCDL 2012 Doctoral Consortium • Justin F. Brunelle, Michael L. Nelson “An Evaluation of Caching Policies for Memento TimeMaps”, JCDL 2013 • Mat Kelly, Justin F. Brunelle, Michele C. Weigle, Michael L. Nelson, “On the Change in Archivability of Websites Over Time”, TPDL 2013 • Justin F. Brunelle, Michael L. Nelson, Lyudmila Balakireva, Robert Sanderson, Herbert Van de Sompel, “Evaluating the SiteStory Transactional Web Archive With the ApacheBench Tool”, TPDL 2013 • Mat Kelly, Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, “A Method for Identifying Personalized Representations in Web Archives”, D- Lib Magazine, 19(11/12), 2013. • Justin F. Brunelle, Mat Kelly, Hany SalahEldeen, Michele C. Weigle, and Michael L. Nelson “Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources”, JCDL 2014 • Best Student Paper, International Journal of Digital Libraries: JCDL2015 Special Issue • Justin F. Brunelle, Mat Kelly, Michele C. Weigle, and Michael L. Nelson “The impact of JavaScript on archivability”, 2015, IJDL • Wesley Jordan, Mat Kelly, Justin F. Brunelle, Laura Vobrak, Michele C. Weigle, and Michael L. Nelson “Mobile Mink: Merging Mobile and Desktop Archived Webs”, JCDL 2015 • Best Poster • Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, “Archiving Deferred Representations Using a Two-Tiered Crawling Approach”, iPRES2015 • Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, “Adapting the Hypercube Model to Archive Deferred Representations at Web-Scale”, Technical Report, arXiv:1601.05142, 2016 • Justin F. Brunelle, Krista Ferrante, Eliot Wilczek, Michele C. Weigle, and Michael L. Nelson, “Leveraging Heritrix and the Wayback Machine on a corporate intranet: A case study on improving corporate archives”, DLib Magazine, 2016 120
  121. 121. Performance with classifier 121
  122. 122. Mobile Sites in the Archives 122 http://m.espn.go.com/wireless/http://espn.go.com/ “A Method for Identifying Personalized Representations in Web Archives”, D-Lib Magazine, 2013
  123. 123. Mobile Sites in the Archives 123 http://m.espn.go.com/wireless/http://espn.go.com/ URI-M: http://web.archive.org/web/2014033 0125315/http://espn.go.com/ URI-M: http://web.archive.org/web/2014033012 5414/http://m.espn.go.com/wireless/
  124. 124. Collisions in the Archives 124 http://www.cnn.com/ URI-M? URI-T? http://web.archive.org/web/[DATETIME]/http://www.cnn.com/
  125. 125. Need a better way to index mementos • URI-R is no longer enough • Environmental factors: ‒ Content negotiation ‒ Interaction ‒ Personalization ‒ GeoIP 125
  126. 126. Content Negotiation  Server-side interpretation of client-provided parameters  Multiple representations, single resource 126 Resource URI Representation 2 Represents Representation 1 Represents Identifies Content Negotiation Mobile Desktop user-agent

×