Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Framework for Aggregating Public and Private Web Archives

894 views

Published on

Presented at the ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL) 2018

Published in: Education
  • Be the first to comment

  • Be the first to like this

A Framework for Aggregating Public and Private Web Archives

  1. 1. A Framework for Aggregating Private and Public Web Archives Mat Kelly, Michael L. Nelson, and Michele C. Weigle Old Dominion University Web Science & Digital Libraries Research Group {mkelly, mln, mweigle}@cs.odu.edu @machawk1 • @WebSciDL Joint Conference on Digital Libraries (JCDL) June 5, 2018, Fort Worth, TX
  2. 2. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Web Archiving 2
  3. 3. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Personal + Private Web Archiving 3 .com
  4. 4. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Aggregation for a Better Picture 4 at tA at tC at tD→Z
  5. 5. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Today’s Memento Aggregation 5 Archives Queried (A0 )
  6. 6. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Motivation 6 Archives Queried (A0 ) > Include personal archives > Include other non-aggregated archives
  7. 7. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Motivation 7 Archives Queried (A0 ) > Include personal archives > Include other non-aggregated archives
  8. 8. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Rapidly Changing Pages May Not Be Comprehensively Captured 8
  9. 9. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Missed Changes 9
  10. 10. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Archiving More Archives Provides a Better Picture of the Web 10
  11. 11. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Private & Public Archives May Differ for the Same URI 11
  12. 12. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Should Public Archives Really Capture the Private Web? 12
  13. 13. A Framework for Aggregating Private and Public Web Archives
  14. 14. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Outline ● Background and Related Work ● Memento Aggregation State of the Art ● More Expressive TimeMaps ● Query Precedence and Short-Circuiting ● Mementities & Mementity Dynamics ● Future Work and Conclusions 14
  15. 15. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Outline ● Background and Related Work ● Memento Aggregation State of the Art ● More Expressive TimeMaps ● Query Precedence and Short-Circuiting ● Mementities & Mementity Dynamics ● Future Work and Conclusions 15
  16. 16. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Background 16 Memento Guide: Introduction. http://www.mementoweb.org/guide/quick-intro/, January 2015.
  17. 17. • Memento-Datetime: Tue, 11 Sep 2001 20:03:18 GMT • Location: http://web.archive.org/web/20010911200318/http://www.cnn.com:80/ • Link: • Accept-Datetime: Tue, 11 Sep 2001 13:00:00 GMT • GET: http://web.archive.org/web/http://www.cnn.com Memento Request Example 17 Request cnn.com at Sept 11, 2001 at 9am EST URI-G G mementotimegateoriginaltimemap HTTP Request HTTP Response (302) URI-T T URI-R R URI-G G URI-M M
  18. 18. Dereferencing a TimeMap at URI-T 18 Request URI-T ● Date-based pagination ● Other formats for TimeMap first memento last mementomemento ...... ...... timemap timemap timemap TimeMapURI-T T URI-T T1 URI-T T URI-T Tt URI-M 1 URI-M m URI-M n URI-G G URI-R R
  19. 19. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Background - Privacy and Security ● Web users question trusting institutions to preserve private Web contents1 ● OAuth 2.0 (RFC 6749) facilitates authentication cohesion of entities 19 1 Marshall and Shipman., “On the Institutional Archiving of Social Media”, JCDL 2012
  20. 20. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Outline ● Background and Related Work ● Memento Aggregation State of the Art ● More Expressive TimeMaps ● Query Precedence and Short-Circuiting ● Mementities & Mementity Dynamics ● Future Work and Conclusions 20
  21. 21. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Memento Aggregation State of the Art 21
  22. 22. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Memento Aggregation - MementoWeb 22 Also available via CLI: $ curl http://timetravel.mementoweb.org/timemap/link/http://nasa.gov
  23. 23. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX ● Open Source Memento Aggregator - github.com/oduwsdl/memgator ● Easy personal/local deployment ● Specify archive list on launch ○ Easily configurable JSON → ○ Use default collection if not specified ● TimeMap Formats: ○ Link ○ JSON ○ CDXJ Memento Aggregation - MemGator 23 * Alam and Nelson, “MemGator - A Portable Concurrent Memento Aggregator: Cross-Platform CLI and Server Binaries in Go”, JCDL 2016 runningon
  24. 24. 24 CDXJ: An Alternative TimeMap Format Link (RFC 7089) TimeMap CDXJ TimeMap Original URI (URI-R) Other TimeMaps (URI-Ts) TimeGate (URI-G) Relative Relations <http://matkelly.com>; rel="original", <http://localhost:1208/timemap/link/http://matkelly.com>; rel="self"; type="application/link-format", <http://web.archive.org/web/20060514123511/http://www.mat kelly.com:80/>; rel="first memento"; datetime="Sun, 14 May 2006 12:35:11 GMT", <http://web.archive.org/web/20060516213852/http://www.mat kelly.com/>; rel="memento"; datetime="Tue, 16 May 2006 21:38:52 GMT", ... <http://web.archive.org/web/20180128152125/http://matkell y.com>; rel="memento"; datetime="Sun, 28 Jan 2018 15:21:25 GMT", <http://web.archive.org/web/20180319141920/http://matkell y.com/>; rel="last memento"; datetime="Mon, 19 Mar 2018 14:19:20 GMT", <http://localhost:1208/timemap/link/http://matkelly.com>; rel="timemap"; type="application/link-format", <http://localhost:1208/timemap/json/http://matkelly.com>; rel="timemap"; type="application/json", <http://localhost:1208/timemap/cdxj/http://matkelly.com>; rel="timemap"; type="application/cdxj+ors", <http://localhost:1208/timegate/http://matkelly.com>; rel="timegate" !context ["http://tools.ietf.org/html/rfc7089"] !id {"uri": "http://localhost:1208/timemap/cdxj/http://matkelly.com"} !keys ["memento_datetime_YYYYMMDDhhmmss"] !meta {"original_uri": "http://matkelly.com"} !meta {"timegate_uri": "http://localhost:1208/timegate/http://matkelly.com"} !meta {"timemap_uri": {"link_format": "http://localhost:1208/timemap/link/http://matkelly.com", "json_format": "http://localhost:1208/timemap/json/http://matkelly.com", "cdxj_format": "http://localhost:1208/timemap/cdxj/http://matkelly.com"}} 20060514123511 {"uri": "http://web.archive.org/web/20060514123511/http://www.matkelly.com:80/", "rel": "first memento", "datetime": "Sun, 14 May 2006 12:35:11 GMT"} 20060516213852 {"uri": "http://web.archive.org/web/20060516213852/http://www.matkelly.com/", "rel": "memento", "datetime": "Tue, 16 May 2006 21:38:52 GMT"} ... 20180128152125 {"uri": "http://web.archive.org/web/20180128152125/http://matkelly.com", "rel": "memento", "datetime": "Sun, 28 Jan 2018 15:21:25 GMT"} 20180319141920 {"uri": "http://web.archive.org/web/20180319141920/http://matkelly.com/", "rel": "last memento", "datetime": "Mon, 19 Mar 2018 14:19:20 GMT"} See Alam, “CDXJ: An Object Resource Stream Serialization Format”, 2015
  25. 25. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Outline ● Background and Related Work ● Memento Aggregation State of the Art ● More Expressive TimeMaps ● Query Precedence and Short-Circuiting ● Mementities & Mementity Dynamics ● Future Work and Conclusions 25
  26. 26. More Expressive TimeMaps ● Memento Quality (e.g., Damage)1 ● How Many Captures?2 ● How Many Are Identical?2,3 ● Other Attributes of Mementos... 26 1 Brunelle et al., JCDL 2014, IJDL 2015 2 Kelly et al., JCDL 2017 3 AlSum and Nelson, ECIR 2014
  27. 27. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Additional TimeMap Attributes Content-based Attributes Derived Attributes Access Attributes 27
  28. 28. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX TimeMap Enrichment: Content-Based Attributes ● Status Code1 ● Content-Digest ○ In WARC & CDX ○ Not all archives expose CDX ● Would allow more info about mementos without requiring comprehensive dereferencing 28 Kelly et al., “Impact of URI Canonicalization on Memento Count”, JCDL 2017, arXiv 1703.03302
  29. 29. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX ● Thumbnails (e.g, via SimHash)1 ○ Calculation based on root memento’s HTML ● Memento Damage (JCDL 2014, IJDL)2 ○ Requires dereferencing embedded resources TimeMap Enrichment: Derived Attributes 29 1 AlSum and Nelson, Thumbnail Summarization Techniques for Web Archives, ECIR 2014, pp. 299-310. 2 Brunelle et al., “The Impact of JavaScript on Archivability,” IJDL, 17(2), pp. 95-117. January 2016. apple.com, many duplicate mementos!
  30. 30. ● How to distinguish Private captures In a TimeMap? TimeMap Enrichment: Access Attributes first memento mementomemento ...... ...... timemap timemap timemap TimeMapURI-T T URI-T T1 URI-T T URI-T Tt URI-M 1 URI-M m URI-M n URI-G G URI-R R URI-M m+1 last memento ...
  31. 31. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX TimeMap Enrichment - in a CDXJ TimeMap 31 19981212013921 { "uri": "http://localhost:8080/20101116060516/http://facebook.com/", "rel": "memento", "datetime": "Tue, 16 Nov 2010 06:05:16 GMT", "status_code": 200, "digest": "sha1:LK26DRRQJ4WATC6LBVF3B3Z4P2CP5ZZ7", "damage": 0.24, "simhash": "6551110622422153488", "content-language": "en-US", "access": { "type": "Blake2b", "token": "c6ed419e74907d220c69858614d86...ef0a3a88a41" } } Line breaks added for clarity, CDXJ records occupy a single line Content-based attributes Derived Attributes Access Attributes
  32. 32. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Outline ● Background and Related Work ● Memento Aggregation State of the Art ● More Expressive TimeMaps ● Query Precedence and Short-Circuiting ● Mementities & Mementity Dynamics ● Future Work and Conclusions 32
  33. 33. Query Precedence 33 “Check my archive first, then Carol’s, then all public archives.” 1 2 3 3 ● More control of querying in series and parallel See Atkins’, “Paywalls in the Internet Archive”, March 2018
  34. 34. Query Precedence 34 “Check my archive first, then Carol’s, then all public archives.” 1 ● More control of querying in series and parallel See Atkins’, “Paywalls in the Internet Archive”, March 2018
  35. 35. Query Precedence 35 “Check my archive first, then Carol’s, then all public archives.” 1 2 ● More control of querying in series and parallel See Atkins’, “Paywalls in the Internet Archive”, March 2018
  36. 36. Query Precedence 36 “Check my archive first, then Carol’s, then all public archives.” 1 2 3 3 ● More control of querying in series and parallel See Atkins’, “Paywalls in the Internet Archive”, March 2018
  37. 37. Query Short-Circuiting 37 “Check private archives first. Iff you find no captures, only then check the public archives. 1 1 2 2 ● May give priority to archive relevancy. ● Series halt when threshold met.
  38. 38. Query Short-Circuiting 38 “Check private archives first. Iff you find no captures, only then check the public archives. 1 1 2 2 ● May give priority to archive relevancy. ● Series halt when threshold met.
  39. 39. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Outline ● Background and Related Work ● Memento Aggregation State of the Art ● More Expressive TimeMaps ● Query Precedence and Short-Circuiting ● Mementities & Mementity Dynamics ● Future Work and Conclusions 39
  40. 40. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Mementities ● Memento + Entity (entity term already overused) 40 Time Gate Introduced in this Framework Conventional Memento Mementities
  41. 41. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Memento Meta-Aggregator (MMA) 41 functional ⊇
  42. 42. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX MMA: Archive Selection 42
  43. 43. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX MMA: User-Driven Archival Specification 43 ✓✓
  44. 44. MMAα : from MA2 , MA1 and WA6 MMAβ : from WA7 and WA8 MMAγ : from MMAβ , MA5 , and WA1 44 MMA Aggregation sources @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX
  45. 45. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX ● Personal Archive Aggregation ● MMA Chaining ● Client-Side Aggregation Preference MMA Dynamics By-Example 45 ALICE BOB CAROL MALLORY
  46. 46. MMA Dynamics - Personal Archive Aggregation 46 FB bank flickrbbc homepage Public videos Personal Archive Aggregation
  47. 47. Alice Saves the Web 47Personal Archive Aggregation
  48. 48. Alice Wants to See Her Captures Temporally Inline 48 at tA at tD→Z Personal Archive Aggregation
  49. 49. Mementity Dynamics - Alice & Her Archives (WAA ) 49Personal Archive Aggregation
  50. 50. →{ } Alice Deploys MMAA 50
  51. 51. Carol Asks MMAA for CNN 51 →{ } MMA Chaining
  52. 52. MMAA returns CNN Memento {MA , MIA } 52 →{ }→{ } MMA Chaining
  53. 53. Carol Wants to Aggregate Her Own Captures 53 →{ } at tC at tD→Z,A →{ } (M(WAC )) MMA Chaining
  54. 54. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Carol Creates MMAC to Access WAC and MMAA 54 →{ }→{ } →{ } MMA Chaining
  55. 55. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Carol Asks MMAC For CNN 55 →{ }→{ } →{ } MMA Chaining
  56. 56. MMAA returns CNN Memento {MA , MIA ,MC } 56 →{ }→{ } →{ }
  57. 57. Client-Side Aggregation Preference Bob May Request M(CNN) From MMAA or MMAC 57 →{ }→{ } →{ }
  58. 58. Client-Side Aggregation Preference Bob Prefers to Exclude IA Captures 58 ✓✓ ...and does not want to setup his own MMA
  59. 59. Client-Side Aggregation Preference GET /archives/ Bob Requests Supported Archives 59 →{ }
  60. 60. Client-Side Aggregation Preference Bob Customizes the Set in the JSON 60 →{ } ✓✓
  61. 61. Bob Requests CNN for His Custom Set 61 →{ } ( ) Client-Side Aggregation Preference
  62. 62. Client-Side Aggregation Preference MMA Complies or Ignores Preference 62 →{ } →{ } ✓
  63. 63. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Hooray, Aggregation! 63 !
  64. 64. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX 64
  65. 65. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Hooray, Aggregation! 65 HTTP 401 Consult URI-P ✓
  66. 66. Private Web Archive Adapter (PWAA) ● Auth Layer for to encourage Private Web archive aggregation ● Typical OAuth 2.0 flow ● Auth role cohesive to PWAA ● Persistent access through tokenization 66
  67. 67. PWAA - Sharing Tokens 67
  68. 68. PWAA - Previously Authorized 68
  69. 69. PWAA - Unauthorized Request 69
  70. 70. PWAA - Sharing Tokens 70
  71. 71. Alice Passes Associative Token to MMA 71
  72. 72. MMA requests URI-R... 72 ...relays token where applicable
  73. 73. Private Archive Validates with PWAA 73
  74. 74. PWAA Confirms Token 74
  75. 75. Private Archive Returns Captures 75
  76. 76. MMA Aggregates, Associates Token 76
  77. 77. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX 77
  78. 78. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX ⊆StarGate ● Content negotiation in Web archives beyond time ● “Star” ~ wildcard (*) → any dimension of negotiation ● Allow for queries like: Only show me memento… ○ That are not redirects (content-based attribute HTTP Status ≠ 3XX) ○ Of a sufficient quality (derived attribute Memento Damage < 0.4) ○ Are from personal Web archives (access attribute indicate Facebook.com memento is not a login page) 78 Time Gate functional
  79. 79. Implicit Filtering via MMA or Directly (a la TG) 79 →{ } →{ } ✓ filtering filtering
  80. 80. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Negotiation in the Privacy Dimension (via short circuiting) 80 Get URI-Ms for URI-R only from personal Web archives privateOnly 1 4 2 3 ACCESS ATTRIBUTE
  81. 81. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Retrieved TimeMap Negotiation on Content-Based or Derived Attributes (with response filtering) 81 Get URI-Ms for URI-R of good quality that are unique MD < 0.25, unique(simhash) Abbreviated TimeMap with filtering applied 1 1 2 3 DERIVED ATTRIBUTE CONTENT-BASED ATTRIBUTE &
  82. 82. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Future Work and Conclusions • Aggregation with private web archives • Client-side archive specification • TimeMap Caching Ramifications 82 • Authentication layer to systematically interface with private Web archives • Password-less approaches • Archival negotiation in dimensions beyond time • Time/Space Complexity • Elegance of Expression
  83. 83. @machawk1 A Framework for Aggregating Private and Public Web Archives JCDL 2018 • June 5, 2018 • Fort Worth, TX Ongoing Research Supported By... ❖ NEH grant #HK-50181-14 ❖ IMLS grant #RE-33-16-0107-16 ❖ SIGIR Travel Grant 83 Some artwork based on Agata Krych CC BY-SA 4.0, derivatives available at https://github.com/machawk1/jcdl2018-artwork
  84. 84. A Framework for Aggregating Private and Public Web Archives Mat Kelly, Michael L. Nelson, and Michele C. Weigle Old Dominion University Web Science & Digital Libraries Research Group {mkelly, mln, mweigle}@cs.odu.edu @machawk1 • @WebSciDL Joint Conference on Digital Libraries (JCDL) June 5, 2018, Fort Worth, TX

×