Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving

1,164 views

Published on

InterPlanetary Wayback (IPWB) facilitates permanence and collaboration in web archives by disseminating the contents of WARC files into the InterPlanetary File System (IPFS) network. IPFS is a peer-to-peer content-addressable file system that inherently allows deduplication and facilitates opt-in replication. IPWB splits the header and payload of WARC response records before disseminating into IPFS to leverage the deduplication, builds a CDXJ index with references to the IPFS hashes returns, and combines the headers and payload from IPFS at the time of replay. We also explore the possibility of an index-free, fully decentralized collaborative web archiving system as the next step.

Published in: Science
  • Be the first to comment

  • Be the first to like this

InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving

  1. 1. Sawood Alam, Mat Kelly, Michele C. Weigle, Michael L. Nelson Web Science and Digital Libraries Research Group Old Dominion University Norfolk, Virginia, USA @WebSciDL InterPlanetary Wayback The Next Step Towards Decentralized Web Archiving IPFS Lab Day, Decentralized Web Summit, 2018 San Francisco, CA (USA) August 3, 2018 http://github.com/oduwsdl/ipwb
  2. 2. Content Addressing http://foo.com/spaceDog.jpg http://example.org/yuri.jpg QmZAD4xeeNeYF3TmwWgBXypLKTiCGwGRMXHW7MtheWKtw4 QmZAD4xeeNeYF3TmwWgBXypLKTiCGwGRMXHW7MtheWKtw4 === $ ipfs cat QmZAD4xeeNeYF3TmwWgBXypLKTiCGwGRMXHW7MtheWKtw4 > doge.jpg 2@ibnesayeed
  3. 3. Rendered HTML vs. Source Code 3@ibnesayeed
  4. 4. HTTP Response vs. WARC Record 4 HTTP headers Payload WARC headers @ibnesayeed
  5. 5. How Wayback Machine Works? Archival Indexer Archival Index (e.g., CDXJ) Replay Engine processes Outputs references reads (file, offset) read archived content Present WARC content to user 5@ibnesayeed
  6. 6. Memento: Time Dimension to the Web https://tools.ietf.org/html/rfc7089 6@ibnesayeed
  7. 7. Why IPWB? ● Persistence of archived web dependent on resilience of organizations ● Availability of data is subject to censorship ● Redundancy in web archive files of exact duplicate content ● Lack of public participation in web archiving ● Discoverability issue of small web archives 7@ibnesayeed
  8. 8. Indexing 8@ibnesayeed
  9. 9. WARC Creation HTTP Header & Payload Extraction Push to IPFS Generate CDXJ WARC-CDXJ Correspondence 9@ibnesayeed
  10. 10. HTTP HEADER BLOCK HTTP PAYLOAD BLOCK WARC Creation HTTP Header & Payload Extraction Push to IPFS Generate CDXJ WARC-CDXJ Correspondence 10@ibnesayeed
  11. 11. QmcN9eWwRF73dZj5BgT4x8jeEcFr xurX1hot8QwCbMi9PB Qmczh9YnB4U1ptPeqxcaTZA4aMm uNUswTLTWzXntvbp9sL HEADER DIGEST PAYLOAD DIGEST WARC Creation HTTP Header & Payload Extraction Push to IPFS Generate CDXJ WARC-CDXJ Correspondence 11 HTTP HEADER BLOCK HTTP PAYLOAD BLOCK @ibnesayeed
  12. 12. QmcN9eWwRF73dZj5BgT4x8jeEcFrxurX1hot8QwCb Mi9PB Qmczh9YnB4U1ptPeqxcaTZA4aMmuNUswTLTWzX ntvbp9sL HEADER DIGEST com,example,ipwb)/ 20180802012013 { "locator":"urn:ipfs/QmcN9eWwRF73dZj5BgT4x8jeEcFrxurX1hot8QwCbMi9PB/Qmczh9YnB4U1ptPeqxca TZA4aMmuNUswTLTWzXntvbp9sL", "mime_type": "text/html", "status_code": 200, “other_fields”: “other values...” } // * This is a single-line record, line breaks and indentation are added for readability only PAYLOAD DIGEST CDXJ: http://ws-dl.blogspot.com/2015/09/2015-09-10-cdxj-object-resource-stream.html WARC Creation HTTP Header & Payload Extraction Push to IPFS Generate CDXJ Record WARC-CDXJ Correspondence 12@ibnesayeed
  13. 13. edu,odu,cs)/~salam/dweb/ 20180802012013 { "locator": "urn:ipfs/QmcN9eWwRF73dZj5.../Qmczh9YnB4U1ptPe...", "mime_type": "text/html", "status_code": "200" } edu,odu,cs)/~salam/dweb/style.css 20180802012013 { "locator": "urn:ipfs/QmU1k71bT6ibZBSd.../QmbvUAo9U31wSdvA...", "mime_type": "text/css", "status_code": "200" } edu,odu,cs)/~salam/dweb/wsdl-logo.png 20180802012013 { "locator": "urn:ipfs/QmTjfMxFGvbP4nwF.../QmYMKZbnk53kuPJi...", "mime_type": "image/png", "status_code": "200" } WARC Creation HTTP Header & Payload Extraction Push to IPFS Generate CDXJ WARC-CDXJ Correspondence 13@ibnesayeed
  14. 14. Replay 14@ibnesayeed
  15. 15. https://www.cs.odu.edu/~salam/dweb/ 15 edu,odu,cs)/~salam/dweb/ 20180802012013 { "locator": "urn:ipfs/QmcN9eWwRF.../Qmczh9YnB4...", "mime_type": "text/html", "status_code": "200" } edu,odu,cs)/~salam/dweb/style.css 20180802012013 { "locator": "urn:ipfs/QmU1k71bT6.../QmbvUAo9U3...", "mime_type": "text/css", "status_code": "200" } edu,odu,cs)/~salam/dweb/wsdl-logo.png 20180802012013 { "locator": "urn:ipfs/QmTjfMxFGv.../QmYMKZbnk5...", "mime_type": "image/png", "status_code": "200" } Fetch from IPFS Reroute ResourcesReconstruct ResponseLookup in CDXJ @ibnesayeed
  16. 16. edu,odu,cs)/~salam/dweb/ 20180802012013 { "locator": "urn:ipfs/QmcN9eWwRF73dZj5.../Qmczh9YnB4U1ptPe...", "mime_type": "text/html", "status_code": "200" } 16 HTTP HEADER BLOCK HTTP PAYLOAD BLOCK Fetch from IPFS Reroute ResourcesReconstruct ResponseLookup in CDXJ @ibnesayeed
  17. 17. 17 HTTP HEADER BLOCK HTTP PAYLOAD BLOCK Fetch from IPFS Reroute ResourcesReconstruct ResponseLookup in CDXJ Reconstruct @ibnesayeed
  18. 18. Fetch from IPFS Reroute Resources 18 ● https://oduwsdl.github.io/Reconstructive/ ● http://ws-dl.blogspot.com/2018/01/2018-01-08-introducing-reconstructive.html ● Avoids zombies (live-leakage) ● Adds an unobtrusive archival banner (Custom HTML Element) Reconstruct ResponseLookup in CDXJ @ibnesayeed
  19. 19. 19 IPWB Indexing and Replay @ibnesayeed
  20. 20. Decentralization 20@ibnesayeed
  21. 21. Current Issues ● IPFS is permanent, but not persistent ● DHT-based IPNS is history-unaware ● CDXJ index, a critical piece of replay, is centralized 21@ibnesayeed
  22. 22. Persistence ● Data persistence is critical for web archiving ● A decentralized storage with sufficient replication is needed ● Memory organizations should contribute storage infrastructure ● Qri, Filecoin, IPFS-Cluster, IPFS-Sync etc. can be helpful 22@ibnesayeed
  23. 23. IPNS: InterPlanetary Naming System URI IPFS Hash http://example.org/yuri.jpg http://example.com/style.css http://example.com/logo.png http://example.com/style.css How about changes and history? 23@ibnesayeed
  24. 24. IPNS Blockchain ● URI → Latest hash ● URI + DateTime → A historical hash ● URI → List historical hashes with times https://github.com/oduwsdl/IPNS-Blockchain Owner URI Time Hash PrevBlock Pub K1 URI1 T1 H1 1234567... Pub K2 URI2 T2 H2 0000000... Pub K3 URI3 T3 H3 9876543... Owner URI Time Hash PrevBlock Pub K1 URI1 T4 H5 5463728... Pub K3 URI4 T5 H6 0000000... IPNS + Blockchain + Memento 24@ibnesayeed
  25. 25. Lazy Relationship Evaluation /<namespace>/about/<URI> IP-LD to the rescue? https://github.com/oduwsdl/ipwb/issues/61 Memento Memento Memento MementoOf (Active Relation) HasMemento (Lazy Evaluation) 25@ibnesayeed
  26. 26. Evaluation 26@ibnesayeed
  27. 27. ● Reported IPFS slowness https://github.com/ipfs/go-ipfs/issues/1216 ○ Has since been fixed, but we did not evaluate again 570 files per minute~10% overhead 27 Storage Space and Time Overhead @ibnesayeed
  28. 28. Replay Time ● 600 requests in 222 seconds ● Slower than PyWB (which took 5.26 seconds) ● File vs. rich object based retrieval ● Never expiring cache 28 https://github.com/ibnesayeed/ipfsapi-concurrency-test @ibnesayeed
  29. 29. Future Works ● Evaluate the improved IPFS on large dataset ● Evaluate deduplication ● Implement an index-free collaborative archiving system ● Utilize IPNS to reference URI-Rs with datetime 29@ibnesayeed
  30. 30. Conclusions ● A proof of concept system to leverage a novel approach to archiving and retrieval ● Storage and time costs evaluation and qualitative analysis ● It can only work for small archives in its current state ● A path to answer “who will archive the archives?” ● More work to be done to make it a truly decentralized archiving system 30@ibnesayeed
  31. 31. InterPlanetary Wayback The Next Step Towards Decentralized Web Archiving @WebSciDL http://github.com/oduwsdl/ipwb Supported in part by Protocol Labs, AMF 11600663, and NSF IIS-1526700 Sawood Alam, Mat Kelly, Michele C. Weigle, Michael L. Nelson

×