JCDL 2015 Doctoral Consortium - A Framework for AggregatingPrivate and Public Web Archives

A Framework for Aggregating
Private and Public Web Archives
Mat Kelly
Old Dominion University, Norfolk, VA
Advisor: Michele C. Weigle
JCDL 2015 Doctoral Consortium
June 21, 2015

The Problem
JCDL 2015 Doctoral
Consortium
A Framework for Aggregating Private and
Public Web Archives
2
private
archive
private
archive
other
private
archive
other
private
archive

All Archives Cannot Be Aggregated
JCDL 2015 Doctoral
Consortium
Public Web Archives
3
private
archive
private
archive
other
private
archive
TimeMap
other
private
archive

JCDL 2015 Doctoral
Consortium
Public Web Archives
4

JCDL 2015 Doctoral
Consortium
Public Web Archives
5

JCDL 2015 Doctoral
Consortium
Public Web Archives
6

JCDL 2015 Doctoral
Consortium
Public Web Archives
7

JCDL 2015 Doctoral
Consortium
Public Web Archives
8

JCDL 2015 Doctoral
Consortium
Public Web Archives
9

JCDL 2015 Doctoral
Consortium
Public Web Archives
10

JCDL 2015 Doctoral
Consortium
Public Web Archives
11
t = k t = k-1≠

JCDL 2015 Doctoral
Consortium
Public Web Archives
12

JCDL 2015 Doctoral
Consortium
Public Web Archives
13

JCDL 2015 Doctoral
Consortium
Public Web Archives
14
1 year ago 2 year ago 10 year ago
…
180 year ago
TimeMap

JCDL 2015 Doctoral
Consortium
Public Web Archives
15
private
archive

Proactive Preservation
• Just-in-time WARC creation
• Personal and potentially private web archiving
• Mitigates deferral problem
JCDL 2015 Doctoral
Consortium
Public Web Archives
16

Public vs. Private
Web Archiving
• Public Web Archiving
– Relies on deferred capture
– Uses WARC, Memento, etc.
– Integrates with other public web archives
• Private Web Archiving
– Same tools, less overhead, less bureaucracy
– Uses WARC, Memento, etc.
– Does not integrate
Public Web Archives
17
JCDL 2015 Doctoral
Consortium

Typical Web Archive Access
1. Web User Interface
2. Memento
TimeGate
TimeMap
– Accept-Datetime (content negotiation)
Public Web Archives
18
URI-G
TimeMap
JCDL 2015 Doctoral
Consortium

Aggregating Multiple Web Archives
• Memento Aggregator
– Temporally Sorted TimeMap combined from
multiple archives
– Allows temporal gaps in one archive to be filled in
by another
TimeMap

Archive Supplementation
• More capturesgreater temporal coverage
• Content on Deep Web
• A large chunk of the Web is not preserved
– Tools’ inability
– Inconsistency over time due to personalization
Public Web Archives
20
JCDL 2015 Doctoral
Consortium

Concerns in Aggregating Private
Web Archives
• Privacy
• Inconsistency of page representation
– URI is insufficient key for access
Public Web Archives
21
JCDL 2015 Doctoral
Consortium
• Archival integrity
– Has private archives content been manipulated?

Why Individuals Might Want
Personalized Aggregations
• Show my private web archive captures
• Concerned about exposing sensitive info to
public
– But still want to view temporally inline
• Private/Restricted Archives are becoming ever
more common
Public Web Archives
22
JCDL 2015 Doctoral
Consortium

Temporal Supplementation
Public Web Archives
23
JCDL 2015 Doctoral
Consortium

My Archives Have
What They May Have Missed
Public Web Archives
24
JCDL 2015 Doctoral
Consortium

The Concerns Distilled
• Access Control
– And indicators for PWA
• Preservation of Private Content
• Interoperability without privacy compromise
Public Web Archives
25
JCDL 2015 Doctoral
Consortium

Web Archive Usage Pattern 1:
Direct Access
Public Web Archives
26
OR
TimeMap
JCDL 2015 Doctoral
Consortium

Web Archive Aggregation
• Better results for a URI due to more sources
for capture
Public Web Archives
27
TimeMap
JCDL 2015 Doctoral
Consortium

Previous Patterns: Status Quo
• Patterns 1 and 2 are status quo
– provided by framework
• Querying web archives currently only
considers public web content
– URI for lookup
• Framework introduces 2 new entities
– Memento Meta Aggregator (MMA)
– Private Web Archive Adapter (PWAA)
Public Web Archives
28
JCDL 2015 Doctoral
Consortium

Memento Meta Aggregator (MMA)
• Functional superset of (MA)
• Can act as intermediary client to relay MA
results to ultimate user
• Allows just-in-time (JIT) inclusion of archives
– as specified at query time
• Set of archives aggregated can be dynamic
– e.g., Results must not include IA
Public Web Archives
29
JCDL 2015 Doctoral
Consortium

MY CNN CAPTURES
Aggregating My Captures
Public Web Archives
30
MY BANK CAPTURES
JCDL 2015 Doctoral
Consortium
Various public web archives
My web archives

MY CNN CAPTURES
The Current Memento Aggregator
Public Web Archives
31
MY BANK CAPTURES
JCDL 2015 Doctoral
Consortium
100
30
10

MY CNN CAPTURES
Accessing the Aggregator
Public Web Archives
32
MY BANK CAPTURES
JCDL 2015 Doctoral
Consortium
100
30
10

MY CNN CAPTURES
Accessing the Aggregator
…does not include our archives
Public Web Archives
33
MY BANK CAPTURES
NOT AGGREGATED
NOT AGGREGATED
JCDL 2015 Doctoral
Consortium
100
30
10
140

Access via the Meta Aggregator
MY CNN CAPTURES
Pattern 3: Aggregator Relay
MY BANK CAPTURES
100
30
10
140140

MY CNN CAPTURES
Including Additional Archives in Aggregation
MY BANK CAPTURES
Access via the Meta Aggregator
…allows our archives to be included
100
30
10
15
140155

MY CNN CAPTURES
MMAs Allow Our Public Captures
to be Shared
Public Web Archives
36
MY BANK CAPTURES
JCDL 2015 Doctoral
Consortium
100
30
10
15
140155
155
155

MY CNN CAPTURES
Recursive MMA Access
Public Web Archives
37
MY BANK CAPTURES
…
Bob’s public
CAPTURES
The organization’s
public CAPTURES 1
The organization’s
public CAPTURES 2
contains
A B C D
Contains
B C D
Contains
C D
A
B C
D
JCDL 2015 Doctoral
Consortium
10
5
15
15
20
35
35
15
50
50

New Framework Entity 1:
Memento Meta Aggregator
• Allow dynamic and JIT set of archives
• Superset can be recursively constructed
• Sets can be shared
My public captures
can be integrated
with public web archives’
JCDL 2015 Doctoral
Consortium
Public Web Archives
38

Private Web Archive Adapter
(PWAA)
• Regulates access to Private Web Archives
(PWAs)
• Acts as token authorizer
• With credentials OK, relays results as if
querying the PWA directly
Public Web Archives
39
JCDL 2015 Doctoral
Consortium

MY CNN CAPTURES
User Establishes Access with PWA
Public Web Archives
40
MY BANK CAPTURES
GET TOKEN for PWA
Key: abcd1234
JCDL 2015 Doctoral
Consortium
100
30
10
3 captures
10,000 captures

MY CNN CAPTURES
MMA Relays Request
Public Web Archives
41
MY BANK CAPTURES
GET TOKEN for PWA
Key: abcd1234
JCDL 2015 Doctoral
Consortium
100
30
10
3 captures
10,000 captures

MY CNN CAPTURES
PWAA Accepts Request
Generates Reusable Token
Public Web Archives
42
MY BANK CAPTURES
ACCESS OK
Token: 4f33c64
JCDL 2015 Doctoral
Consortium
100
30
10
3 captures
10,000 captures

MY CNN CAPTURES
User Submits Request for URI-R
with Token
Public Web Archives
43
MY BANK CAPTURES
GET mementos for URI
Token: 4f33c64
JCDL 2015 Doctoral
Consortium
100
30
10
3 captures
10,000 captures

MY CNN CAPTURES
MMA Relays Request (again)
Public Web Archives
44
MY BANK CAPTURES
Token: 4f33c64
JCDL 2015 Doctoral
Consortium
100
30
10
3 captures
10,000 captures

MY CNN CAPTURES
PWAA Verified & Relays Request
MA Gets Mementos, per usual
Public Web Archives
45
MY BANK CAPTURES
Token: 4f33c64
OK
JCDL 2015 Doctoral
Consortium
100
30
10
3 captures
10,000 captures

MY CNN CAPTURES
Archives Return Mementos
Public Web Archives
46
MY BANK CAPTURES
Token: 4f33c64 OK
Returning mementos
Return mementos
For URI
JCDL 2015 Doctoral
Consortium
100
30
10
3 captures
10,000 captures

MY CNN CAPTURES
PWAA Relays TimeMap
Public Web Archives
47
MY BANK CAPTURES
TimeMap
TimeMap
TimeMap
JCDL 2015 Doctoral
Consortium
100
30
10
3 captures
10,000 captures
140
10,000
10,143 140 captures

MY CNN CAPTURES
MMA Annotates and Aggregates
Public Web Archives
48
MY BANK CAPTURES
TimeMap
TimeMap
TimeMap
JCDL 2015 Doctoral
Consortium
100
30
10
3 captures
10,000 captures
10,143
140 captures
3 captures
10,000 captures

MY CNN CAPTURES
Aggregating Public & Private Archives
Public Web Archives
49
MY BANK CAPTURES
TimeMap
JCDL 2015 Doctoral
Consortium
100
30
10
3 captures
10,000 captures
10,143 captures

MY CNN CAPTURES
Regulated Access Can Be Shared
Public Web Archives
50
MY BANK CAPTURES
Token: 4f33c64
Token: c5463b4
GET TOKEN for PWA
Key: 2265eef3
No/invalid token
returned
Access denied or
0 mementos
JCDL 2015 Doctoral
Consortium
3 captures
10,000 captures

Aggregating Multiple PWAs
JCDL 2015 Doctoral
Consortium
Public Web Archives
51
MY BANK CAPTURES
Linda’s Private
Captures
Bob’s Private
Captures
GET TOKENs for PWAs
Key: abcd1234, Archive: My
Key: cab45cbf, Archive: Linda
Key: b0b01b, Archive: Bob
3 captures
5 captures
10 captures
5
3
10

Aggregating Multiple PWAs
JCDL 2015 Doctoral
Consortium
Public Web Archives
52
MY BANK CAPTURES
Access OK
Token: 7790ca
Access OK
Token: b0b01b
ACCESS
DENIED
Linda’s Private
Captures
Bob’s Private
Captures
3 captures
5 captures
10 captures
5
3
10

PWAs Can Then be Aggregated
JCDL 2015 Doctoral
Consortium
Public Web Archives
53
MY BANK CAPTURES
Token: 7790ca, Archive: My
Token: null, Archive: Linda
Token: b0b01b, Archive: Bob
Linda’s Private
Captures
Bob’s Private
Captures
3 captures
5 captures
10 captures
5
3
10
3
10
ø13

Sample TimeMap
...
, <http://web.archive.org/web/20150228155703/https://facebook.com/>;rel="memento";
datetime="Sat, 28 Feb 2015 15:57:03 GMT"
, <http://web.archive.org/web/20150228163939/http://www.facebook.com/>;rel="memento";
, <http://web.archive.org/web/20150303162841/https://www.facebook.com/>;rel="memento";
datetime="Tue, 03 Mar 2015 16:28:41 GMT"
, <http://users2machine.local/web/20150305000101/https://www.facebook.com/>;rel="memento";
datetime="Thu, 05 Mar 2015 00:01:00 GMT"; key="e395935019ee467c797034ee410cc91e"
, <//wayback.archive-it.org/all/20150305215922/https://facebook.com/>;rel="memento";
,
<http://previouslyUnaggregated.org/web/20150306123457/https://www.facebook.com/>;rel="memen
to"; datetime="Wed, 06 Mar 2015 12:34:57 GMT"
...
JCDL 2015 Doctoral
Consortium
Public Web Archives
54
TimeMap

Access Token Included in TimeMap
...
,
<http://previouslyUnaggregated.org/web/20150306123457/https://www.facebook.com/>;rel="memen
to"; datetime="Wed, 06 Mar 2015 12:34:57 GMT"
...
JCDL 2015 Doctoral
Consortium
Public Web Archives
55
MY PRIVATE FACEBOOK CAPTURES

My Public Web Archive,
Now Aggregated
...
,
<http://previouslyUnaggregated.org/web/20150306123457/https://www.facebook.com/>;rel="meme
nto"; datetime="Wed, 06 Mar 2015 12:34:57 GMT"
...
JCDL 2015 Doctoral
Consortium
Public Web Archives
56
MY PUBLIC FACEBOOK CAPTURES

Evaluation Plan
• How effective is the Framework?
• Scalability ramifications of additional
infrastructure?
• Is public-private tokenization most suitable
method for persistent access?
• How can a single archive be sub-divided
between private/public and access controlled?
JCDL 2015 Doctoral
Consortium
Public Web Archives
57

Previous Work
Preservation and Replay
PDA 2013 - Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving
JCDL 2012 - WARCreate - Create Wayback-Consumable WARC Files from Any Webpage
Evaluating Capture
IJDL 2015 - Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources
IJDL 2015 - The Impact of JavaScript on Archivability
JCDL 2014 - Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources
JCDL 2014 - The Archival Acid Test: Evaluating Archive Performance on Advanced HTML and JavaScript
Dlib 2013 - A Method for Identifying Personalized Representations in the Archives
TPDL 2013 - On the Change in Archivability of Websites Over Time
Archival Integration
JCDL 2015 - Mobile Mink: Merging Mobile and Desktop Archived Webs
JCDL 2014 - Mink: Integrating the Live and Archived Web Viewing Experience Using Web Browsers and Memento
58
WARCreate – preserve from the browser
WAIL – private web archiving all-in-one suite
Mink – Integrate the live and archived web
SOFTWARE PRODUCTS
PUBLICATIONS

Current Work
• Other approaches of archival lookup beyond
URI
• Appropriate metadata to indicate private web
content in WARC files
• Existing integration attempts by private web
archives & individuals
Public Web Archives
59
JCDL 2015 Doctoral
Consortium

 Background Research
 PhD Requirements (Coursework, Qualifying Exam, etc.)
 Build preliminary framework model
 JCDL Doctoral Consortium
EXTENDED RESEARCH
• Research prevalence of private web archives
• Research access control methods in web archiving and other domains
• Investigate other access patterns and expound on those defined
• PhD Candidacy Exam describing merit of research plan
• Implement feedback received from candidacy exam committee
• Programmatically implement MMA and PWAA
CASE STUDIES (real-world application)
• Publicly Available Non-Aggregated Archives (e.g., Rhizome)
• Deep web preservation/access (bank account/Facebook feeds)
• DISSERTATION DEFENSE
Dissertation Plan

Preliminary Publication Plan
JCDL 2016 Evaluation of User Access Patterns for Private Web Archives
TPDL 2016 Methods in adding JIT Inclusion of Private Web Archives in Memento
ACM
SACMAT*
Research exploring tokenization and similar methods for archival access
establishment
iPres 2016 Research investigating URI clash & other needed identifiers for
distinguishing archived content from the “deep web” with archived
content from the public live web.
JCDL 2015 Doctoral
Consortium
Public Web Archives
61
* Symposium on Access Control Models and Technologies

Future Research Questions
• Can a PWAA perform content negotiation[1] on
the private-public spectrum?
• What level of security is needed?
– e.g., reporting UNAUTHORIZED vs. 0 mementos
Public Web Archives
62
JCDL 2015 Doctoral
Consortium
[1] RFC2295 https://www.ietf.org/rfc/rfc2295.txt

Summation
• Why?
– No means exists to integrate private and public web
archives.
• How to Evaluate?
– Does this framework fit real world needs? Scalable?
• When will I know I am done?
– Any public/private web archive* can be integrated.
JCDL 2015 Doctoral
Consortium
Public Web Archives
63
* -compliant

References
• D. Abrams, R. Baecker, and M. Chignell. Information Archiving with Bookmarks: Personal Web Space Construction and
Archiving. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 41–48, 1998.
• A. AlSum, M. Weigle, M. Nelson, and H. Van de Sompel. Profiling Web Archive Coverage for Top-Level Domain and Content
Language. International Journal on Digital Libraries, 14(3-4):149–166, 2014.
• J. F. Brunelle, M. Kelly, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. Not All Mementos Are Created Equal: Measuring The
Impact Of Missing Resources. In Proceedings of JCDL, pages 321–330, London, England, 2014.
• J. F. Brunelle, M. Kelly, M. C. Weigle, and M. L. Nelson. The Impact of JavaScript on Archivability. International Journal on
Digital Libraries, pages 1–23, 2015.
• J. F. Brunelle and M. L. Nelson. An Evaluation of Caching Policies for Memento TimeMaps. In Proceedings of JCDL, pages
267–276, 2013.
• D. Gomes, S. Freitas, and M. J. Silva. Design and Selection Criteria for a National Web Archive. In Research and Advanced
Technology for Digital Libraries, pages 196–207. Springer, 2006.
• D. Hardt. The OAuth 2.0 Authorization Framework. IETF RFC 6749, October 2012.
• M. Jones and D. Hardt. The OAuth 2.0 Authorization Framework: Bearer Token Usage. IETF RFC 6750, October 2012.
• M. Kelly, J. F. Brunelle, M. C. Weigle, and M. L. Nelson. A Method for Identifying Personalized Representations in the
Archives. D-Lib Magazine, 19(11/12), Nov/Dec 2013.
• M. Kelly, J. F. Brunelle, M. C. Weigle, and M. L. Nelson. On the Change in Archivability of Websites Over Time. In Proceedings
of the International Conference on Theory and Practice of Digital Libraries (TPDL), pages 35–47, Valletta, Malta, 2013.
• M. Kelly, M. L. Nelson, and M. C. Weigle. Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving Using
XAMPP. Poster and demo presented at Personal Digital Archiving, February 2013.
• M. Kelly, M. L. Nelson, and M. C. Weigle. The Archival Acid Test: Evaluating Archive Performance on Advanced HTML and
JavaScript. In Proceedings of JCDL, pages 25–28, London, England, September 2014.
JCDL 2015 Doctoral
Consortium
Public Web Archives
64

References
• M. Kelly and M. C. Weigle. WARCreate - Create Wayback-Consumable WARC Files from Any Webpage. In Proceedings of
JCDL, pages 437–438, Washington, DC, June 2012.
• C. C. Marshall. Rethinking Personal Digital Archiving, Part 1. D-Lib Magazine, 14(3/4), Mar/Apr 2008.
• C. C. Marshall. Rethinking Personal Digital Archiving, Part 2. D-Lib Magazine, 14(3/4), Mar/Apr 2008.
• J. Niu. Functionalities of Web Archives. D-Lib Magazine, 18(3/4), Mar/Apr 2012.
• M. Phillips. PANDORA, Australia’s Web Archive, and the Digital Archiving System that Supports It.
http://pandora.nla.gov.au/pandas.html, 2003.
• H. C.-H. Rao, Y.-F. Chen, and M.-F. Chen. A Proxy-based Personal Web Archiving Service. SIGOPS Oper. Syst. Rev., 35(1):61–72,
Jan. 2001.
• A. Rauber, M. Kaiser, and B. Wachter. Ethical Issues in Web Archive Creation and Usage-Towards a Research Agenda. In 8th
International Web Archiving Workshop (IWAW08), 2008.
• D. Rosenthal. Re-thinking Memento Aggregation. http://blog.dshr.org/2013/03/re-thinking-memento-aggregation.html,
2013.
• T. Schwarz, M. Baker, S. Bassi, B. Baumgart, W. Flagg, C. van Ingen, K. Joste, M. Manasse, and M. Shah. Disk Failure
Investigations at the Internet Archive. In Work-in-Progess session, NASA/IEEE Conference on Mass Storage Systems and
Technologies (MSST2006), 2006.
• S. Strodl, F. Motlik, K. Stadler, and A. Rauber. Personal & Soho Archiving. In Proceedings of JCDL, pages 115–123, 2008.
• M. Thelwall and L. Vaughan. A fair history of the Web? Examining country balance in the Internet Archive. Library &
Information Science Research, 26(2):162–176, 2004.
• B. Tofel. ‘Wayback’ for Accessing Web Archives. In 7th International Web Archiving Workshop (IWAW07), 2007.
• H. Van de Sompel, M. Nelson, and R. Sanderson. HTTP Framework for Time-Based Access to Resource States – Memento.
IETF RFC 7089, December 2013.
• T. Wang, M. Srivatsa, and L. Liu. Fine-Grained Access Control of Personal Data. In Proceedings of the 17th ACM Symposium
on Access Control Models and Technologies, pages 145–156, 2012.
JCDL 2015 Doctoral
Consortium
Public Web Archives
65

JCDL 2015 Doctoral Consortium - A Framework for Aggregating Private and Public Web Archives

More Related Content

What's hot

Similar to JCDL 2015 Doctoral Consortium - A Framework for Aggregating Private and Public Web Archives

More from Mat Kelly

Recently uploaded