Sawood Alam, Mat Kelly, Michele C. Weigle, Michael L. Nelson
Web Science and Digital Libraries Research Group
Old Dominion University
Norfolk, Virginia, USA
@WebSciDL
InterPlanetary Wayback
The Next Step Towards Decentralized Web Archiving
IPFS Lab Day, Decentralized Web Summit, 2018
San Francisco, CA (USA)
August 3, 2018
http://github.com/oduwsdl/ipwb
Content Addressing
http://foo.com/spaceDog.jpg
http://example.org/yuri.jpg
QmZAD4xeeNeYF3TmwWgBXypLKTiCGwGRMXHW7MtheWKtw4
QmZAD4xeeNeYF3TmwWgBXypLKTiCGwGRMXHW7MtheWKtw4
===
$ ipfs cat QmZAD4xeeNeYF3TmwWgBXypLKTiCGwGRMXHW7MtheWKtw4 > doge.jpg
2@ibnesayeed
Rendered HTML vs. Source Code
3@ibnesayeed
HTTP Response vs. WARC Record
4
HTTP headers
Payload
WARC headers
@ibnesayeed
How Wayback Machine Works?
Archival
Indexer
Archival Index
(e.g., CDXJ) Replay Engine
processes
Outputs references
reads (file, offset)
read archived content
Present WARC
content to user
5@ibnesayeed
Memento: Time Dimension to the Web
https://tools.ietf.org/html/rfc7089 6@ibnesayeed
Why IPWB?
● Persistence of archived web dependent on resilience of
organizations
● Availability of data is subject to censorship
● Redundancy in web archive files of exact duplicate content
● Lack of public participation in web archiving
● Discoverability issue of small web archives
7@ibnesayeed
Indexing
8@ibnesayeed
WARC Creation HTTP Header & Payload Extraction Push to IPFS Generate CDXJ WARC-CDXJ Correspondence
9@ibnesayeed
HTTP HEADER
BLOCK
HTTP PAYLOAD
BLOCK
WARC Creation HTTP Header & Payload Extraction Push to IPFS Generate CDXJ WARC-CDXJ Correspondence
10@ibnesayeed
QmcN9eWwRF73dZj5BgT4x8jeEcFr
xurX1hot8QwCbMi9PB
Qmczh9YnB4U1ptPeqxcaTZA4aMm
uNUswTLTWzXntvbp9sL
HEADER DIGEST
PAYLOAD DIGEST
WARC Creation HTTP Header & Payload Extraction Push to IPFS Generate CDXJ WARC-CDXJ Correspondence
11
HTTP HEADER
BLOCK
HTTP PAYLOAD
BLOCK
@ibnesayeed
QmcN9eWwRF73dZj5BgT4x8jeEcFrxurX1hot8QwCb
Mi9PB
Qmczh9YnB4U1ptPeqxcaTZA4aMmuNUswTLTWzX
ntvbp9sL
HEADER DIGEST
com,example,ipwb)/ 20180802012013 {
"locator":"urn:ipfs/QmcN9eWwRF73dZj5BgT4x8jeEcFrxurX1hot8QwCbMi9PB/Qmczh9YnB4U1ptPeqxca
TZA4aMmuNUswTLTWzXntvbp9sL",
"mime_type": "text/html",
"status_code": 200,
“other_fields”: “other values...”
}
// * This is a single-line record, line breaks and indentation are added for readability only
PAYLOAD DIGEST
CDXJ: http://ws-dl.blogspot.com/2015/09/2015-09-10-cdxj-object-resource-stream.html
WARC Creation HTTP Header & Payload Extraction Push to IPFS Generate CDXJ Record WARC-CDXJ Correspondence
12@ibnesayeed
edu,odu,cs)/~salam/dweb/ 20180802012013 {
"locator": "urn:ipfs/QmcN9eWwRF73dZj5.../Qmczh9YnB4U1ptPe...",
"mime_type": "text/html",
"status_code": "200"
}
edu,odu,cs)/~salam/dweb/style.css 20180802012013 {
"locator": "urn:ipfs/QmU1k71bT6ibZBSd.../QmbvUAo9U31wSdvA...",
"mime_type": "text/css",
"status_code": "200"
}
edu,odu,cs)/~salam/dweb/wsdl-logo.png 20180802012013 {
"locator": "urn:ipfs/QmTjfMxFGvbP4nwF.../QmYMKZbnk53kuPJi...",
"mime_type": "image/png",
"status_code": "200"
}
WARC Creation HTTP Header & Payload Extraction Push to IPFS Generate CDXJ WARC-CDXJ Correspondence
13@ibnesayeed
Replay
14@ibnesayeed
https://www.cs.odu.edu/~salam/dweb/
15
edu,odu,cs)/~salam/dweb/ 20180802012013 {
"locator": "urn:ipfs/QmcN9eWwRF.../Qmczh9YnB4...",
"mime_type": "text/html",
"status_code": "200"
}
edu,odu,cs)/~salam/dweb/style.css 20180802012013 {
"locator": "urn:ipfs/QmU1k71bT6.../QmbvUAo9U3...",
"mime_type": "text/css",
"status_code": "200"
}
edu,odu,cs)/~salam/dweb/wsdl-logo.png 20180802012013 {
"locator": "urn:ipfs/QmTjfMxFGv.../QmYMKZbnk5...",
"mime_type": "image/png",
"status_code": "200"
}
Fetch from IPFS Reroute ResourcesReconstruct ResponseLookup in CDXJ
@ibnesayeed
edu,odu,cs)/~salam/dweb/ 20180802012013 {
"locator": "urn:ipfs/QmcN9eWwRF73dZj5.../Qmczh9YnB4U1ptPe...",
"mime_type": "text/html",
"status_code": "200"
}
16
HTTP HEADER
BLOCK
HTTP PAYLOAD
BLOCK
Fetch from IPFS Reroute ResourcesReconstruct ResponseLookup in CDXJ
@ibnesayeed
17
HTTP HEADER
BLOCK
HTTP PAYLOAD
BLOCK
Fetch from IPFS Reroute ResourcesReconstruct ResponseLookup in CDXJ
Reconstruct
@ibnesayeed
Fetch from IPFS Reroute Resources
18
● https://oduwsdl.github.io/Reconstructive/
● http://ws-dl.blogspot.com/2018/01/2018-01-08-introducing-reconstructive.html
● Avoids zombies (live-leakage)
● Adds an unobtrusive archival banner (Custom HTML Element)
Reconstruct ResponseLookup in CDXJ
@ibnesayeed
19
IPWB Indexing and Replay
@ibnesayeed
Decentralization
20@ibnesayeed
Current Issues
● IPFS is permanent, but not persistent
● DHT-based IPNS is history-unaware
● CDXJ index, a critical piece of replay, is centralized
21@ibnesayeed
Persistence
● Data persistence is critical for web archiving
● A decentralized storage with sufficient replication is needed
● Memory organizations should contribute storage infrastructure
● Qri, Filecoin, IPFS-Cluster, IPFS-Sync etc. can be helpful
22@ibnesayeed
IPNS: InterPlanetary Naming System
URI IPFS Hash
http://example.org/yuri.jpg
http://example.com/style.css
http://example.com/logo.png
http://example.com/style.css
How about changes and history?
23@ibnesayeed
IPNS Blockchain
● URI → Latest hash
● URI + DateTime → A historical hash
● URI → List historical hashes with times
https://github.com/oduwsdl/IPNS-Blockchain
Owner URI Time Hash PrevBlock
Pub K1
URI1
T1
H1
1234567...
Pub K2
URI2
T2
H2
0000000...
Pub K3
URI3
T3
H3
9876543...
Owner URI Time Hash PrevBlock
Pub K1
URI1
T4
H5
5463728...
Pub K3
URI4
T5
H6
0000000...
IPNS + Blockchain + Memento
24@ibnesayeed
Lazy Relationship Evaluation
/<namespace>/about/<URI>
IP-LD to the rescue?
https://github.com/oduwsdl/ipwb/issues/61
Memento
Memento
Memento
MementoOf
(Active Relation)
HasMemento
(Lazy Evaluation)
25@ibnesayeed
Evaluation
26@ibnesayeed
● Reported IPFS slowness https://github.com/ipfs/go-ipfs/issues/1216
○ Has since been fixed, but we did not evaluate again
570 files per minute~10% overhead
27
Storage Space and Time Overhead
@ibnesayeed
Replay Time
● 600 requests in 222 seconds
● Slower than PyWB (which took 5.26 seconds)
● File vs. rich object based retrieval
● Never expiring cache
28
https://github.com/ibnesayeed/ipfsapi-concurrency-test
@ibnesayeed
Future Works
● Evaluate the improved IPFS on large dataset
● Evaluate deduplication
● Implement an index-free collaborative archiving system
● Utilize IPNS to reference URI-Rs with datetime
29@ibnesayeed
Conclusions
● A proof of concept system to leverage a novel approach to
archiving and retrieval
● Storage and time costs evaluation and qualitative analysis
● It can only work for small archives in its current state
● A path to answer “who will archive the archives?”
● More work to be done to make it a truly decentralized
archiving system
30@ibnesayeed
InterPlanetary Wayback
The Next Step Towards Decentralized Web Archiving
@WebSciDL
http://github.com/oduwsdl/ipwb
Supported in part by Protocol Labs, AMF 11600663, and NSF IIS-1526700
Sawood Alam, Mat Kelly, Michele C. Weigle, Michael L. Nelson

InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving

  • 1.
    Sawood Alam, MatKelly, Michele C. Weigle, Michael L. Nelson Web Science and Digital Libraries Research Group Old Dominion University Norfolk, Virginia, USA @WebSciDL InterPlanetary Wayback The Next Step Towards Decentralized Web Archiving IPFS Lab Day, Decentralized Web Summit, 2018 San Francisco, CA (USA) August 3, 2018 http://github.com/oduwsdl/ipwb
  • 2.
  • 3.
    Rendered HTML vs.Source Code 3@ibnesayeed
  • 4.
    HTTP Response vs.WARC Record 4 HTTP headers Payload WARC headers @ibnesayeed
  • 5.
    How Wayback MachineWorks? Archival Indexer Archival Index (e.g., CDXJ) Replay Engine processes Outputs references reads (file, offset) read archived content Present WARC content to user 5@ibnesayeed
  • 6.
    Memento: Time Dimensionto the Web https://tools.ietf.org/html/rfc7089 6@ibnesayeed
  • 7.
    Why IPWB? ● Persistenceof archived web dependent on resilience of organizations ● Availability of data is subject to censorship ● Redundancy in web archive files of exact duplicate content ● Lack of public participation in web archiving ● Discoverability issue of small web archives 7@ibnesayeed
  • 8.
  • 9.
    WARC Creation HTTPHeader & Payload Extraction Push to IPFS Generate CDXJ WARC-CDXJ Correspondence 9@ibnesayeed
  • 10.
    HTTP HEADER BLOCK HTTP PAYLOAD BLOCK WARCCreation HTTP Header & Payload Extraction Push to IPFS Generate CDXJ WARC-CDXJ Correspondence 10@ibnesayeed
  • 11.
    QmcN9eWwRF73dZj5BgT4x8jeEcFr xurX1hot8QwCbMi9PB Qmczh9YnB4U1ptPeqxcaTZA4aMm uNUswTLTWzXntvbp9sL HEADER DIGEST PAYLOAD DIGEST WARCCreation HTTP Header & Payload Extraction Push to IPFS Generate CDXJ WARC-CDXJ Correspondence 11 HTTP HEADER BLOCK HTTP PAYLOAD BLOCK @ibnesayeed
  • 12.
    QmcN9eWwRF73dZj5BgT4x8jeEcFrxurX1hot8QwCb Mi9PB Qmczh9YnB4U1ptPeqxcaTZA4aMmuNUswTLTWzX ntvbp9sL HEADER DIGEST com,example,ipwb)/ 20180802012013{ "locator":"urn:ipfs/QmcN9eWwRF73dZj5BgT4x8jeEcFrxurX1hot8QwCbMi9PB/Qmczh9YnB4U1ptPeqxca TZA4aMmuNUswTLTWzXntvbp9sL", "mime_type": "text/html", "status_code": 200, “other_fields”: “other values...” } // * This is a single-line record, line breaks and indentation are added for readability only PAYLOAD DIGEST CDXJ: http://ws-dl.blogspot.com/2015/09/2015-09-10-cdxj-object-resource-stream.html WARC Creation HTTP Header & Payload Extraction Push to IPFS Generate CDXJ Record WARC-CDXJ Correspondence 12@ibnesayeed
  • 13.
    edu,odu,cs)/~salam/dweb/ 20180802012013 { "locator":"urn:ipfs/QmcN9eWwRF73dZj5.../Qmczh9YnB4U1ptPe...", "mime_type": "text/html", "status_code": "200" } edu,odu,cs)/~salam/dweb/style.css 20180802012013 { "locator": "urn:ipfs/QmU1k71bT6ibZBSd.../QmbvUAo9U31wSdvA...", "mime_type": "text/css", "status_code": "200" } edu,odu,cs)/~salam/dweb/wsdl-logo.png 20180802012013 { "locator": "urn:ipfs/QmTjfMxFGvbP4nwF.../QmYMKZbnk53kuPJi...", "mime_type": "image/png", "status_code": "200" } WARC Creation HTTP Header & Payload Extraction Push to IPFS Generate CDXJ WARC-CDXJ Correspondence 13@ibnesayeed
  • 14.
  • 15.
    https://www.cs.odu.edu/~salam/dweb/ 15 edu,odu,cs)/~salam/dweb/ 20180802012013 { "locator":"urn:ipfs/QmcN9eWwRF.../Qmczh9YnB4...", "mime_type": "text/html", "status_code": "200" } edu,odu,cs)/~salam/dweb/style.css 20180802012013 { "locator": "urn:ipfs/QmU1k71bT6.../QmbvUAo9U3...", "mime_type": "text/css", "status_code": "200" } edu,odu,cs)/~salam/dweb/wsdl-logo.png 20180802012013 { "locator": "urn:ipfs/QmTjfMxFGv.../QmYMKZbnk5...", "mime_type": "image/png", "status_code": "200" } Fetch from IPFS Reroute ResourcesReconstruct ResponseLookup in CDXJ @ibnesayeed
  • 16.
    edu,odu,cs)/~salam/dweb/ 20180802012013 { "locator":"urn:ipfs/QmcN9eWwRF73dZj5.../Qmczh9YnB4U1ptPe...", "mime_type": "text/html", "status_code": "200" } 16 HTTP HEADER BLOCK HTTP PAYLOAD BLOCK Fetch from IPFS Reroute ResourcesReconstruct ResponseLookup in CDXJ @ibnesayeed
  • 17.
    17 HTTP HEADER BLOCK HTTP PAYLOAD BLOCK Fetchfrom IPFS Reroute ResourcesReconstruct ResponseLookup in CDXJ Reconstruct @ibnesayeed
  • 18.
    Fetch from IPFSReroute Resources 18 ● https://oduwsdl.github.io/Reconstructive/ ● http://ws-dl.blogspot.com/2018/01/2018-01-08-introducing-reconstructive.html ● Avoids zombies (live-leakage) ● Adds an unobtrusive archival banner (Custom HTML Element) Reconstruct ResponseLookup in CDXJ @ibnesayeed
  • 19.
    19 IPWB Indexing andReplay @ibnesayeed
  • 20.
  • 21.
    Current Issues ● IPFSis permanent, but not persistent ● DHT-based IPNS is history-unaware ● CDXJ index, a critical piece of replay, is centralized 21@ibnesayeed
  • 22.
    Persistence ● Data persistenceis critical for web archiving ● A decentralized storage with sufficient replication is needed ● Memory organizations should contribute storage infrastructure ● Qri, Filecoin, IPFS-Cluster, IPFS-Sync etc. can be helpful 22@ibnesayeed
  • 23.
    IPNS: InterPlanetary NamingSystem URI IPFS Hash http://example.org/yuri.jpg http://example.com/style.css http://example.com/logo.png http://example.com/style.css How about changes and history? 23@ibnesayeed
  • 24.
    IPNS Blockchain ● URI→ Latest hash ● URI + DateTime → A historical hash ● URI → List historical hashes with times https://github.com/oduwsdl/IPNS-Blockchain Owner URI Time Hash PrevBlock Pub K1 URI1 T1 H1 1234567... Pub K2 URI2 T2 H2 0000000... Pub K3 URI3 T3 H3 9876543... Owner URI Time Hash PrevBlock Pub K1 URI1 T4 H5 5463728... Pub K3 URI4 T5 H6 0000000... IPNS + Blockchain + Memento 24@ibnesayeed
  • 25.
    Lazy Relationship Evaluation /<namespace>/about/<URI> IP-LDto the rescue? https://github.com/oduwsdl/ipwb/issues/61 Memento Memento Memento MementoOf (Active Relation) HasMemento (Lazy Evaluation) 25@ibnesayeed
  • 26.
  • 27.
    ● Reported IPFSslowness https://github.com/ipfs/go-ipfs/issues/1216 ○ Has since been fixed, but we did not evaluate again 570 files per minute~10% overhead 27 Storage Space and Time Overhead @ibnesayeed
  • 28.
    Replay Time ● 600requests in 222 seconds ● Slower than PyWB (which took 5.26 seconds) ● File vs. rich object based retrieval ● Never expiring cache 28 https://github.com/ibnesayeed/ipfsapi-concurrency-test @ibnesayeed
  • 29.
    Future Works ● Evaluatethe improved IPFS on large dataset ● Evaluate deduplication ● Implement an index-free collaborative archiving system ● Utilize IPNS to reference URI-Rs with datetime 29@ibnesayeed
  • 30.
    Conclusions ● A proofof concept system to leverage a novel approach to archiving and retrieval ● Storage and time costs evaluation and qualitative analysis ● It can only work for small archives in its current state ● A path to answer “who will archive the archives?” ● More work to be done to make it a truly decentralized archiving system 30@ibnesayeed
  • 31.
    InterPlanetary Wayback The NextStep Towards Decentralized Web Archiving @WebSciDL http://github.com/oduwsdl/ipwb Supported in part by Protocol Labs, AMF 11600663, and NSF IIS-1526700 Sawood Alam, Mat Kelly, Michele C. Weigle, Michael L. Nelson