SlideShare a Scribd company logo
Transactional Web Archives


                                    Robert Sanderson
                                    Lyudmila Balakireva
                                      Harihar Shankar
                                   Herbert Van de Sompel

                                 Los Alamos National Laboratory
                                        Research Library


                                   Web Archive Globalization
                                         JCDL 2011, Ottawa
                                          Jun 16-17, 2011


This research is funded by the Library of Congress

             Memento: Transactional Archiving
    Web Archive Globalization Workshop, Jun 16-17 2010   1
Transactional Web Archiving


  Transactional   Archiving?

  Server Side Capture
       Submission, Storage, Access

  Browser   Side Capture
       Submission, Storage, Access

  Memento    for Access



                            Memento: Transactional Archiving
                   Web Archive Globalization Workshop, Jun 16-17 2010   2
Transactional Archiving?

  Current   web archives actively crawl the web




  For
     example, Heritrix from the Internet Archive and the many
archives that use it




                              Memento: Transactional Archiving
                     Web Archive Globalization Workshop, Jun 16-17 2010   3
Transactional Archiving?

  Transactional
               archives passively accept submitted HTTP
transactions between browser and server




  For   example, TTApache, PageVault and Everlast.




                             Memento: Transactional Archiving
                    Web Archive Globalization Workshop, Jun 16-17 2010   4
Why Transactional Archiving?

  Issues   with crawler based archiving:
        Can be rejected       (robots.txt, by user-agent, by host IP)
        Can be deceived       (cloaking: geo-location, by user-agent)
        Can be trapped        (infinite auto-generated pages)
        Don't necessarily capture well used resources
        Require constant and massive bandwidth



  None   of these are true for Transactional Archiving …

… but, it has its own different set of challenges




                              Memento: Transactional Archiving
                     Web Archive Globalization Workshop, Jun 16-17 2010   5
Transactional Archiving?

  Need   to record transactions between browser and server
        Server side: Servers to be archived must cooperate
        Browser side: Many browsers must cooperate

  Need   to transfer data to archive: either batch mode or real-time
  Archive   must trust submission to be authentic

  Deduplication   challenges as can't control what will be submitted:
        Aliases: Different URL, same response
        Negotiation: Same URL, different response
        Determine "significant" change in response
        Other factors for what to archive/throw away?



                              Memento: Transactional Archiving
                     Web Archive Globalization Workshop, Jun 16-17 2010   6
Server Side Capture

  Approach:

        Willing server records the request and response headers and
         response body just before returning to the browser
        Server sends to an archive for storage




                             Memento: Transactional Archiving
                    Web Archive Globalization Workshop, Jun 16-17 2010   7
Server Side Capture/Submission

  Developer:    Luda Balakireva

  Capture     Implementation
        Apache connection filter module implemented in C to trap
         URL, headers and response body
        Module POSTs to a configurable URL in real time

  Submission     Implementation
        Java/Grizzly+Jersey for handling submission interface
              Can also be deployed under tomcat or glassfish
        BerkeleyDB for storing metadata
        Headers and response body data stored in file system



                               Memento: Transactional Archiving
                      Web Archive Globalization Workshop, Jun 16-17 2010   8
Server Side Capture

  Direct   server to server upload, in real time:
         Most configurations will have server/archive in close network
          proximity
         Reduces wait time between observation and being
          discoverable in archive




                                Memento: Transactional Archiving
                       Web Archive Globalization Workshop, Jun 16-17 2010   9
Server Side Capture: Issues



  If   archive is not local, network latency may be an issue
            But could be amortized by batch upload

  Size      of dataset could very large for dynamically generated pages
            But could be reduced by better detection of high value
             changes compared to counters, timestamps, etc.

  Content      Negotiation problematic!

  Capture      of pages with “attack vector” query params
            index.html?f=/etc/passwd



                                  Memento: Transactional Archiving
                         Web Archive Globalization Workshop, Jun 16-17 2010   10
Browser Side Capture

  Approach:

        Willing browser records the request and response headers
         and response body after receiving from server
        Browser sends to an archive for storage




                             Memento: Transactional Archiving
                    Web Archive Globalization Workshop, Jun 16-17 2010   11
Browser Side Capture/Submission

  Developer:   Rob Sanderson

  Capture   Implementation
        Firefox add-on captures headers and body and writes to
         temporary storage on local disk
        After configurable amount of data stored, module compresses
         and moves to a shared Dropbox folder for batch upload
        (Limited) Ability to detect and ignore private data

  Submission   Implementation
        Dropbox used as transfer, temporary storage mechanism
        Python monitor system on top of Dropbox
        Cassandra (NoSQL hash store) for storing metadata
        Response body and headers stored in pair-tree file system

                             Memento: Transactional Archiving
                    Web Archive Globalization Workshop, Jun 16-17 2010   12
Browser Side Submission

  Reasons   for Dropbox rather than direct upload:
       Batch upload via existing infrastructure reduces bandwidth
       Increases Firefox responsiveness
       Batch processing can be scheduled as needed




                             Memento: Transactional Archiving
                    Web Archive Globalization Workshop, Jun 16-17 2010   13
Browser Side Capture/Submission

                                                             Upload
                                                             Preferences




                                                             Public/Private
                                                             Status Icon




            Memento: Transactional Archiving
   Web Archive Globalization Workshop, Jun 16-17 2010   14
Browser Side: Issues

  Privacy!    Privacy! Privacy!
        Difficult to determine if resource should be captured or not
        Current approach:
              No HTTPS
              Check for “log out”, “sign out” etc in body
              Check for usernames, personal name in body, headers
              Blacklist for domains

  Bandwidth

        Slow-down while uploading batch file noticable on home
         connections




                                Memento: Transactional Archiving
                       Web Archive Globalization Workshop, Jun 16-17 2010   15
Memento in One Slide




         Memento: Transactional Archiving
Web Archive Globalization Workshop, Jun 16-17 2010   16
Access via Memento

  Both   archives provide Memento TimeGates for access

  TimeGates     can be used with MementoFox:
         Endorsed Firefox add-on:             http://bit.ly/memfox
         Must be configured with Dropbox archive TimeGate
         Processes every HTTP request, not just HTML page


  Distributed   access is intentional design feature
         Possible to construct views from multiple archives:
          Server side will have most authentic copy, but may embed
          image from another server, only in Dropbox archive




                               Memento: Transactional Archiving
                      Web Archive Globalization Workshop, Jun 16-17 2010   17
Server Side Archive: Access

  Access     to archive via Memento TimeGate
        Implemented in Grizzly server using Jersey library
  Original   Server uses HTTP Link header to point to archive


  Exportfunctionality also available to WARC format to extract data
in batch mode
        By datetime of last update
        By URL




                               Memento: Transactional Archiving
                      Web Archive Globalization Workshop, Jun 16-17 2010   18
Browser Side Archive: Access

  Apache/Python   Memento TimeGate for access
       Archive provides combined, anonymous TimeGate
       Also provides per-user TimeGates to see own archive
       Per-User currently secure only through obscurity
       Export functionality also yet to be implemented




                            Memento: Transactional Archiving
                   Web Archive Globalization Workshop, Jun 16-17 2010   19
Access via Memento




Experimental
Transactional
Archive



TimeGate
Preferences




                         Memento: Transactional Archiving
                Web Archive Globalization Workshop, Jun 16-17 2010   20
Summary

  Implemented     and tested two types of Transactional Archive:
        Server Side
        Browser Side
  Transactional
              Archives lack many of the challenges of Crawler
based Archives (but have their own)

  Implemented     Memento TimeGates for Transactional Archives:
        Does not require rewriting URIs for self-contained-ness
        Works well with automated, distributed access patterns


  Access   via Browser add-on is fast and seamless
  Server   and Browser archiving code will be released



                              Memento: Transactional Archiving
                     Web Archive Globalization Workshop, Jun 16-17 2010   21
Memento wants to make Navigating the Web’s Past Easy




Learn:          http://www.mementoweb.org/
Talk:    http://groups.google.com/group/memento-dev
Use:                   http://bit.ly/memfox

                          Memento: Transactional Archiving
                 Web Archive Globalization Workshop, Jun 16-17 2010   22

More Related Content

Viewers also liked

OAC Technical Summary
OAC Technical SummaryOAC Technical Summary
OAC Technical Summary
Robert Sanderson
 
Linked Data and Images: Building Blocks for Cultural Heritage
Linked Data and Images: Building Blocks for Cultural HeritageLinked Data and Images: Building Blocks for Cultural Heritage
Linked Data and Images: Building Blocks for Cultural Heritage
Robert Sanderson
 
Open Annotation Core Data Model (tutorial)
Open Annotation Core Data Model (tutorial)Open Annotation Core Data Model (tutorial)
Open Annotation Core Data Model (tutorial)
Robert Sanderson
 
Parker Keio 2011: Interoperable Manuscript Framework
Parker Keio 2011: Interoperable Manuscript FrameworkParker Keio 2011: Interoperable Manuscript Framework
Parker Keio 2011: Interoperable Manuscript Framework
Robert Sanderson
 
Open Repositories 2014: Crowdsourced Transcription via IIIF
Open Repositories 2014: Crowdsourced Transcription via IIIFOpen Repositories 2014: Crowdsourced Transcription via IIIF
Open Repositories 2014: Crowdsourced Transcription via IIIF
Robert Sanderson
 
IIIF: Shared Canvas 2.0
IIIF: Shared Canvas 2.0IIIF: Shared Canvas 2.0
IIIF: Shared Canvas 2.0
Robert Sanderson
 
IIIF and JSON-LD: LODLAM Training Day
IIIF and JSON-LD: LODLAM Training DayIIIF and JSON-LD: LODLAM Training Day
IIIF and JSON-LD: LODLAM Training Day
Robert Sanderson
 
IIIF Foundational Specifications
IIIF Foundational SpecificationsIIIF Foundational Specifications
IIIF Foundational Specifications
Robert Sanderson
 
OAC Presentation at CNI 09 Fall Forum
OAC Presentation at CNI 09 Fall ForumOAC Presentation at CNI 09 Fall Forum
OAC Presentation at CNI 09 Fall Forum
Robert Sanderson
 
Globalization agents flows networks
Globalization agents flows networksGlobalization agents flows networks
Globalization agents flows networks
Compagnonjbd
 
Globalization
GlobalizationGlobalization
Globalization
rakesh30682
 
Update on Memento (IIPC 2011 Plenary)
Update on Memento (IIPC 2011 Plenary)Update on Memento (IIPC 2011 Plenary)
Update on Memento (IIPC 2011 Plenary)
Robert Sanderson
 
SharedCanvas: Collaborative Digital Facsimiles of Medieval Manuscripts
SharedCanvas: Collaborative Digital Facsimiles of Medieval ManuscriptsSharedCanvas: Collaborative Digital Facsimiles of Medieval Manuscripts
SharedCanvas: Collaborative Digital Facsimiles of Medieval Manuscripts
Robert Sanderson
 
Open Annotation: Annotating High Energy Physics on the Web
Open Annotation: Annotating High Energy Physics on the WebOpen Annotation: Annotating High Energy Physics on the Web
Open Annotation: Annotating High Energy Physics on the Web
Robert Sanderson
 
Multiplicity and Publishing in Open Annotation (tutorial)
Multiplicity and Publishing in Open Annotation (tutorial)Multiplicity and Publishing in Open Annotation (tutorial)
Multiplicity and Publishing in Open Annotation (tutorial)
Robert Sanderson
 
Globalization project
Globalization projectGlobalization project
Globalization project
Jesus Portillo
 
IIIF Presentation API
IIIF Presentation API IIIF Presentation API
IIIF Presentation API
Robert Sanderson
 
Books in Browsers / SharedCanvas: Collaborative Facsimiles
Books in Browsers / SharedCanvas: Collaborative FacsimilesBooks in Browsers / SharedCanvas: Collaborative Facsimiles
Books in Browsers / SharedCanvas: Collaborative Facsimiles
Robert Sanderson
 
Globalization ppt
Globalization pptGlobalization ppt
Globalization ppt
Mary Brown
 
Linked Data Snowball, or Why We Need Reconciliation
Linked Data Snowball, or Why We Need ReconciliationLinked Data Snowball, or Why We Need Reconciliation
Linked Data Snowball, or Why We Need Reconciliation
Robert Sanderson
 

Viewers also liked (20)

OAC Technical Summary
OAC Technical SummaryOAC Technical Summary
OAC Technical Summary
 
Linked Data and Images: Building Blocks for Cultural Heritage
Linked Data and Images: Building Blocks for Cultural HeritageLinked Data and Images: Building Blocks for Cultural Heritage
Linked Data and Images: Building Blocks for Cultural Heritage
 
Open Annotation Core Data Model (tutorial)
Open Annotation Core Data Model (tutorial)Open Annotation Core Data Model (tutorial)
Open Annotation Core Data Model (tutorial)
 
Parker Keio 2011: Interoperable Manuscript Framework
Parker Keio 2011: Interoperable Manuscript FrameworkParker Keio 2011: Interoperable Manuscript Framework
Parker Keio 2011: Interoperable Manuscript Framework
 
Open Repositories 2014: Crowdsourced Transcription via IIIF
Open Repositories 2014: Crowdsourced Transcription via IIIFOpen Repositories 2014: Crowdsourced Transcription via IIIF
Open Repositories 2014: Crowdsourced Transcription via IIIF
 
IIIF: Shared Canvas 2.0
IIIF: Shared Canvas 2.0IIIF: Shared Canvas 2.0
IIIF: Shared Canvas 2.0
 
IIIF and JSON-LD: LODLAM Training Day
IIIF and JSON-LD: LODLAM Training DayIIIF and JSON-LD: LODLAM Training Day
IIIF and JSON-LD: LODLAM Training Day
 
IIIF Foundational Specifications
IIIF Foundational SpecificationsIIIF Foundational Specifications
IIIF Foundational Specifications
 
OAC Presentation at CNI 09 Fall Forum
OAC Presentation at CNI 09 Fall ForumOAC Presentation at CNI 09 Fall Forum
OAC Presentation at CNI 09 Fall Forum
 
Globalization agents flows networks
Globalization agents flows networksGlobalization agents flows networks
Globalization agents flows networks
 
Globalization
GlobalizationGlobalization
Globalization
 
Update on Memento (IIPC 2011 Plenary)
Update on Memento (IIPC 2011 Plenary)Update on Memento (IIPC 2011 Plenary)
Update on Memento (IIPC 2011 Plenary)
 
SharedCanvas: Collaborative Digital Facsimiles of Medieval Manuscripts
SharedCanvas: Collaborative Digital Facsimiles of Medieval ManuscriptsSharedCanvas: Collaborative Digital Facsimiles of Medieval Manuscripts
SharedCanvas: Collaborative Digital Facsimiles of Medieval Manuscripts
 
Open Annotation: Annotating High Energy Physics on the Web
Open Annotation: Annotating High Energy Physics on the WebOpen Annotation: Annotating High Energy Physics on the Web
Open Annotation: Annotating High Energy Physics on the Web
 
Multiplicity and Publishing in Open Annotation (tutorial)
Multiplicity and Publishing in Open Annotation (tutorial)Multiplicity and Publishing in Open Annotation (tutorial)
Multiplicity and Publishing in Open Annotation (tutorial)
 
Globalization project
Globalization projectGlobalization project
Globalization project
 
IIIF Presentation API
IIIF Presentation API IIIF Presentation API
IIIF Presentation API
 
Books in Browsers / SharedCanvas: Collaborative Facsimiles
Books in Browsers / SharedCanvas: Collaborative FacsimilesBooks in Browsers / SharedCanvas: Collaborative Facsimiles
Books in Browsers / SharedCanvas: Collaborative Facsimiles
 
Globalization ppt
Globalization pptGlobalization ppt
Globalization ppt
 
Linked Data Snowball, or Why We Need Reconciliation
Linked Data Snowball, or Why We Need ReconciliationLinked Data Snowball, or Why We Need Reconciliation
Linked Data Snowball, or Why We Need Reconciliation
 

Similar to Transactional Archiving (Web Archive Globalization Workshop)

your browser, your storage
your browser, your storageyour browser, your storage
your browser, your storage
Francesco Fullone
 
Your browser, your storage (extended version)
Your browser, your storage (extended version)Your browser, your storage (extended version)
Your browser, your storage (extended version)
Francesco Fullone
 
Pithos - Architecture and .NET Technologies
Pithos - Architecture and .NET TechnologiesPithos - Architecture and .NET Technologies
Pithos - Architecture and .NET Technologies
Panagiotis Kanavos
 
your browser, my storage
your browser, my storageyour browser, my storage
your browser, my storage
Francesco Fullone
 
Krug Fat Client
Krug Fat ClientKrug Fat Client
Krug Fat Client
Paul Klipp
 
Caching and Its Main Types
Caching and Its Main TypesCaching and Its Main Types
Caching and Its Main Types
HTS Hosting
 
Web browser architecture.pptx
Web browser architecture.pptxWeb browser architecture.pptx
Web browser architecture.pptx
BabarHussain607332
 
Local Storage for Web Applications
Local Storage for Web ApplicationsLocal Storage for Web Applications
Local Storage for Web Applications
Markku Laine
 
Introduction to IPFS & Filecoin - longer version
Introduction to IPFS & Filecoin - longer versionIntroduction to IPFS & Filecoin - longer version
Introduction to IPFS & Filecoin - longer version
TinaBregovi
 
Your browser, my storage
Your browser, my storageYour browser, my storage
Your browser, my storage
Francesco Fullone
 
Introduction to IPFS & Filecoin
Introduction to IPFS & FilecoinIntroduction to IPFS & Filecoin
Introduction to IPFS & Filecoin
TinaBregovi
 
02 intro
02   intro02   intro
02 intro
babak mehrabi
 
Globo.com & Varnish
Globo.com & VarnishGlobo.com & Varnish
Globo.com & Varnish
lokama
 
thinking in key value stores
thinking in key value storesthinking in key value stores
thinking in key value stores
Bhasker Kode
 
F/LOSS in Norwegian libraries
F/LOSS in Norwegian librariesF/LOSS in Norwegian libraries
F/LOSS in Norwegian libraries
Libriotech
 
Introduction to Filecoin
Introduction to Filecoin   Introduction to Filecoin
Introduction to Filecoin
Vanessa Lošić
 
Browser Security
Browser SecurityBrowser Security
Browser Security
Roberto Suggi Liverani
 
Whats new in alfresco community 3.4
Whats new in alfresco community 3.4Whats new in alfresco community 3.4
Whats new in alfresco community 3.4
Alfresco Software
 
Clientside/Offline (onefile) Lecture Player in a Web Browser
Clientside/Offline (onefile) Lecture Player in a Web BrowserClientside/Offline (onefile) Lecture Player in a Web Browser
Clientside/Offline (onefile) Lecture Player in a Web Browser
Tokyo University of Science
 
Local storage
Local storageLocal storage
Local storage
Adam Crabtree
 

Similar to Transactional Archiving (Web Archive Globalization Workshop) (20)

your browser, your storage
your browser, your storageyour browser, your storage
your browser, your storage
 
Your browser, your storage (extended version)
Your browser, your storage (extended version)Your browser, your storage (extended version)
Your browser, your storage (extended version)
 
Pithos - Architecture and .NET Technologies
Pithos - Architecture and .NET TechnologiesPithos - Architecture and .NET Technologies
Pithos - Architecture and .NET Technologies
 
your browser, my storage
your browser, my storageyour browser, my storage
your browser, my storage
 
Krug Fat Client
Krug Fat ClientKrug Fat Client
Krug Fat Client
 
Caching and Its Main Types
Caching and Its Main TypesCaching and Its Main Types
Caching and Its Main Types
 
Web browser architecture.pptx
Web browser architecture.pptxWeb browser architecture.pptx
Web browser architecture.pptx
 
Local Storage for Web Applications
Local Storage for Web ApplicationsLocal Storage for Web Applications
Local Storage for Web Applications
 
Introduction to IPFS & Filecoin - longer version
Introduction to IPFS & Filecoin - longer versionIntroduction to IPFS & Filecoin - longer version
Introduction to IPFS & Filecoin - longer version
 
Your browser, my storage
Your browser, my storageYour browser, my storage
Your browser, my storage
 
Introduction to IPFS & Filecoin
Introduction to IPFS & FilecoinIntroduction to IPFS & Filecoin
Introduction to IPFS & Filecoin
 
02 intro
02   intro02   intro
02 intro
 
Globo.com & Varnish
Globo.com & VarnishGlobo.com & Varnish
Globo.com & Varnish
 
thinking in key value stores
thinking in key value storesthinking in key value stores
thinking in key value stores
 
F/LOSS in Norwegian libraries
F/LOSS in Norwegian librariesF/LOSS in Norwegian libraries
F/LOSS in Norwegian libraries
 
Introduction to Filecoin
Introduction to Filecoin   Introduction to Filecoin
Introduction to Filecoin
 
Browser Security
Browser SecurityBrowser Security
Browser Security
 
Whats new in alfresco community 3.4
Whats new in alfresco community 3.4Whats new in alfresco community 3.4
Whats new in alfresco community 3.4
 
Clientside/Offline (onefile) Lecture Player in a Web Browser
Clientside/Offline (onefile) Lecture Player in a Web BrowserClientside/Offline (onefile) Lecture Player in a Web Browser
Clientside/Offline (onefile) Lecture Player in a Web Browser
 
Local storage
Local storageLocal storage
Local storage
 

More from Robert Sanderson

Understanding Linked Art
Understanding Linked ArtUnderstanding Linked Art
Understanding Linked Art
Robert Sanderson
 
LUX - Cross Collections Cultural Heritage at Yale
LUX - Cross Collections Cultural Heritage at YaleLUX - Cross Collections Cultural Heritage at Yale
LUX - Cross Collections Cultural Heritage at Yale
Robert Sanderson
 
Zoom as a Paradigm for Linked Open Usable Data
Zoom as a Paradigm for Linked Open Usable DataZoom as a Paradigm for Linked Open Usable Data
Zoom as a Paradigm for Linked Open Usable Data
Robert Sanderson
 
Provenance and Uncertainty in Linked Art
Provenance and Uncertainty in Linked ArtProvenance and Uncertainty in Linked Art
Provenance and Uncertainty in Linked Art
Robert Sanderson
 
Data is our Product: Thoughts on LOD Sustainability
Data is our Product: Thoughts on LOD SustainabilityData is our Product: Thoughts on LOD Sustainability
Data is our Product: Thoughts on LOD Sustainability
Robert Sanderson
 
A Perspective on Wikidata: Ecosystems, Trust, and Usability
A Perspective on Wikidata: Ecosystems, Trust, and UsabilityA Perspective on Wikidata: Ecosystems, Trust, and Usability
A Perspective on Wikidata: Ecosystems, Trust, and Usability
Robert Sanderson
 
Linked Art: Sustainable Cultural Knowledge through Linked Open Usable Data
Linked Art: Sustainable Cultural Knowledge through Linked Open Usable DataLinked Art: Sustainable Cultural Knowledge through Linked Open Usable Data
Linked Art: Sustainable Cultural Knowledge through Linked Open Usable Data
Robert Sanderson
 
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open DataIllusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
Robert Sanderson
 
Structural Metadata in RDF (IS575)
Structural Metadata in RDF (IS575)Structural Metadata in RDF (IS575)
Structural Metadata in RDF (IS575)
Robert Sanderson
 
Sanderson CNI 2020 Keynote - Cultural Heritage Research Data Ecosystem
Sanderson CNI 2020 Keynote - Cultural Heritage Research Data EcosystemSanderson CNI 2020 Keynote - Cultural Heritage Research Data Ecosystem
Sanderson CNI 2020 Keynote - Cultural Heritage Research Data Ecosystem
Robert Sanderson
 
Tiers of Abstraction and Audience in Cultural Heritage Data Modeling
Tiers of Abstraction and Audience in Cultural Heritage Data ModelingTiers of Abstraction and Audience in Cultural Heritage Data Modeling
Tiers of Abstraction and Audience in Cultural Heritage Data Modeling
Robert Sanderson
 
The Importance of being LOUD
The Importance of being LOUDThe Importance of being LOUD
The Importance of being LOUD
Robert Sanderson
 
Introduction to Linked Art Model
Introduction to Linked Art ModelIntroduction to Linked Art Model
Introduction to Linked Art Model
Robert Sanderson
 
Standards and Communities: Connected People, Consistent Data, Usable Applicat...
Standards and Communities: Connected People, Consistent Data, Usable Applicat...Standards and Communities: Connected People, Consistent Data, Usable Applicat...
Standards and Communities: Connected People, Consistent Data, Usable Applicat...
Robert Sanderson
 
Strong Opinions, Weakly Held
Strong Opinions, Weakly HeldStrong Opinions, Weakly Held
Strong Opinions, Weakly Held
Robert Sanderson
 
IIIF Discovery Walkthrough
IIIF Discovery WalkthroughIIIF Discovery Walkthrough
IIIF Discovery Walkthrough
Robert Sanderson
 
Linked Art: An Art Museum Profile for CIDOC-CRM
Linked Art: An Art Museum Profile for CIDOC-CRMLinked Art: An Art Museum Profile for CIDOC-CRM
Linked Art: An Art Museum Profile for CIDOC-CRM
Robert Sanderson
 
Euromed2018 Keynote: Usability over Completeness, Community over Committee
Euromed2018 Keynote: Usability over Completeness, Community over CommitteeEuromed2018 Keynote: Usability over Completeness, Community over Committee
Euromed2018 Keynote: Usability over Completeness, Community over Committee
Robert Sanderson
 
Linked Art - Our Linked Open Usable Data Model
Linked Art - Our Linked Open Usable Data ModelLinked Art - Our Linked Open Usable Data Model
Linked Art - Our Linked Open Usable Data Model
Robert Sanderson
 
EuropeanaTech Keynote: Shout it out LOUD
EuropeanaTech Keynote: Shout it out LOUDEuropeanaTech Keynote: Shout it out LOUD
EuropeanaTech Keynote: Shout it out LOUD
Robert Sanderson
 

More from Robert Sanderson (20)

Understanding Linked Art
Understanding Linked ArtUnderstanding Linked Art
Understanding Linked Art
 
LUX - Cross Collections Cultural Heritage at Yale
LUX - Cross Collections Cultural Heritage at YaleLUX - Cross Collections Cultural Heritage at Yale
LUX - Cross Collections Cultural Heritage at Yale
 
Zoom as a Paradigm for Linked Open Usable Data
Zoom as a Paradigm for Linked Open Usable DataZoom as a Paradigm for Linked Open Usable Data
Zoom as a Paradigm for Linked Open Usable Data
 
Provenance and Uncertainty in Linked Art
Provenance and Uncertainty in Linked ArtProvenance and Uncertainty in Linked Art
Provenance and Uncertainty in Linked Art
 
Data is our Product: Thoughts on LOD Sustainability
Data is our Product: Thoughts on LOD SustainabilityData is our Product: Thoughts on LOD Sustainability
Data is our Product: Thoughts on LOD Sustainability
 
A Perspective on Wikidata: Ecosystems, Trust, and Usability
A Perspective on Wikidata: Ecosystems, Trust, and UsabilityA Perspective on Wikidata: Ecosystems, Trust, and Usability
A Perspective on Wikidata: Ecosystems, Trust, and Usability
 
Linked Art: Sustainable Cultural Knowledge through Linked Open Usable Data
Linked Art: Sustainable Cultural Knowledge through Linked Open Usable DataLinked Art: Sustainable Cultural Knowledge through Linked Open Usable Data
Linked Art: Sustainable Cultural Knowledge through Linked Open Usable Data
 
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open DataIllusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
 
Structural Metadata in RDF (IS575)
Structural Metadata in RDF (IS575)Structural Metadata in RDF (IS575)
Structural Metadata in RDF (IS575)
 
Sanderson CNI 2020 Keynote - Cultural Heritage Research Data Ecosystem
Sanderson CNI 2020 Keynote - Cultural Heritage Research Data EcosystemSanderson CNI 2020 Keynote - Cultural Heritage Research Data Ecosystem
Sanderson CNI 2020 Keynote - Cultural Heritage Research Data Ecosystem
 
Tiers of Abstraction and Audience in Cultural Heritage Data Modeling
Tiers of Abstraction and Audience in Cultural Heritage Data ModelingTiers of Abstraction and Audience in Cultural Heritage Data Modeling
Tiers of Abstraction and Audience in Cultural Heritage Data Modeling
 
The Importance of being LOUD
The Importance of being LOUDThe Importance of being LOUD
The Importance of being LOUD
 
Introduction to Linked Art Model
Introduction to Linked Art ModelIntroduction to Linked Art Model
Introduction to Linked Art Model
 
Standards and Communities: Connected People, Consistent Data, Usable Applicat...
Standards and Communities: Connected People, Consistent Data, Usable Applicat...Standards and Communities: Connected People, Consistent Data, Usable Applicat...
Standards and Communities: Connected People, Consistent Data, Usable Applicat...
 
Strong Opinions, Weakly Held
Strong Opinions, Weakly HeldStrong Opinions, Weakly Held
Strong Opinions, Weakly Held
 
IIIF Discovery Walkthrough
IIIF Discovery WalkthroughIIIF Discovery Walkthrough
IIIF Discovery Walkthrough
 
Linked Art: An Art Museum Profile for CIDOC-CRM
Linked Art: An Art Museum Profile for CIDOC-CRMLinked Art: An Art Museum Profile for CIDOC-CRM
Linked Art: An Art Museum Profile for CIDOC-CRM
 
Euromed2018 Keynote: Usability over Completeness, Community over Committee
Euromed2018 Keynote: Usability over Completeness, Community over CommitteeEuromed2018 Keynote: Usability over Completeness, Community over Committee
Euromed2018 Keynote: Usability over Completeness, Community over Committee
 
Linked Art - Our Linked Open Usable Data Model
Linked Art - Our Linked Open Usable Data ModelLinked Art - Our Linked Open Usable Data Model
Linked Art - Our Linked Open Usable Data Model
 
EuropeanaTech Keynote: Shout it out LOUD
EuropeanaTech Keynote: Shout it out LOUDEuropeanaTech Keynote: Shout it out LOUD
EuropeanaTech Keynote: Shout it out LOUD
 

Recently uploaded

Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 

Recently uploaded (20)

Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 

Transactional Archiving (Web Archive Globalization Workshop)

  • 1. Transactional Web Archives Robert Sanderson Lyudmila Balakireva Harihar Shankar Herbert Van de Sompel Los Alamos National Laboratory Research Library Web Archive Globalization JCDL 2011, Ottawa Jun 16-17, 2011 This research is funded by the Library of Congress Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 1
  • 2. Transactional Web Archiving   Transactional Archiving?   Server Side Capture   Submission, Storage, Access   Browser Side Capture   Submission, Storage, Access   Memento for Access Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 2
  • 3. Transactional Archiving?   Current web archives actively crawl the web   For example, Heritrix from the Internet Archive and the many archives that use it Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 3
  • 4. Transactional Archiving?   Transactional archives passively accept submitted HTTP transactions between browser and server   For example, TTApache, PageVault and Everlast. Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 4
  • 5. Why Transactional Archiving?   Issues with crawler based archiving:   Can be rejected (robots.txt, by user-agent, by host IP)   Can be deceived (cloaking: geo-location, by user-agent)   Can be trapped (infinite auto-generated pages)   Don't necessarily capture well used resources   Require constant and massive bandwidth   None of these are true for Transactional Archiving … … but, it has its own different set of challenges Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 5
  • 6. Transactional Archiving?   Need to record transactions between browser and server   Server side: Servers to be archived must cooperate   Browser side: Many browsers must cooperate   Need to transfer data to archive: either batch mode or real-time   Archive must trust submission to be authentic   Deduplication challenges as can't control what will be submitted:   Aliases: Different URL, same response   Negotiation: Same URL, different response   Determine "significant" change in response   Other factors for what to archive/throw away? Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 6
  • 7. Server Side Capture   Approach:   Willing server records the request and response headers and response body just before returning to the browser   Server sends to an archive for storage Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 7
  • 8. Server Side Capture/Submission   Developer: Luda Balakireva   Capture Implementation   Apache connection filter module implemented in C to trap URL, headers and response body   Module POSTs to a configurable URL in real time   Submission Implementation   Java/Grizzly+Jersey for handling submission interface   Can also be deployed under tomcat or glassfish   BerkeleyDB for storing metadata   Headers and response body data stored in file system Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 8
  • 9. Server Side Capture   Direct server to server upload, in real time:   Most configurations will have server/archive in close network proximity   Reduces wait time between observation and being discoverable in archive Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 9
  • 10. Server Side Capture: Issues   If archive is not local, network latency may be an issue   But could be amortized by batch upload   Size of dataset could very large for dynamically generated pages   But could be reduced by better detection of high value changes compared to counters, timestamps, etc.   Content Negotiation problematic!   Capture of pages with “attack vector” query params   index.html?f=/etc/passwd Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 10
  • 11. Browser Side Capture   Approach:   Willing browser records the request and response headers and response body after receiving from server   Browser sends to an archive for storage Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 11
  • 12. Browser Side Capture/Submission   Developer: Rob Sanderson   Capture Implementation   Firefox add-on captures headers and body and writes to temporary storage on local disk   After configurable amount of data stored, module compresses and moves to a shared Dropbox folder for batch upload   (Limited) Ability to detect and ignore private data   Submission Implementation   Dropbox used as transfer, temporary storage mechanism   Python monitor system on top of Dropbox   Cassandra (NoSQL hash store) for storing metadata   Response body and headers stored in pair-tree file system Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 12
  • 13. Browser Side Submission   Reasons for Dropbox rather than direct upload:   Batch upload via existing infrastructure reduces bandwidth   Increases Firefox responsiveness   Batch processing can be scheduled as needed Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 13
  • 14. Browser Side Capture/Submission Upload Preferences Public/Private Status Icon Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 14
  • 15. Browser Side: Issues   Privacy! Privacy! Privacy!   Difficult to determine if resource should be captured or not   Current approach:   No HTTPS   Check for “log out”, “sign out” etc in body   Check for usernames, personal name in body, headers   Blacklist for domains   Bandwidth   Slow-down while uploading batch file noticable on home connections Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 15
  • 16. Memento in One Slide Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 16
  • 17. Access via Memento   Both archives provide Memento TimeGates for access   TimeGates can be used with MementoFox:   Endorsed Firefox add-on: http://bit.ly/memfox   Must be configured with Dropbox archive TimeGate   Processes every HTTP request, not just HTML page   Distributed access is intentional design feature   Possible to construct views from multiple archives: Server side will have most authentic copy, but may embed image from another server, only in Dropbox archive Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 17
  • 18. Server Side Archive: Access   Access to archive via Memento TimeGate   Implemented in Grizzly server using Jersey library   Original Server uses HTTP Link header to point to archive   Exportfunctionality also available to WARC format to extract data in batch mode   By datetime of last update   By URL Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 18
  • 19. Browser Side Archive: Access   Apache/Python Memento TimeGate for access   Archive provides combined, anonymous TimeGate   Also provides per-user TimeGates to see own archive   Per-User currently secure only through obscurity   Export functionality also yet to be implemented Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 19
  • 20. Access via Memento Experimental Transactional Archive TimeGate Preferences Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 20
  • 21. Summary   Implemented and tested two types of Transactional Archive:   Server Side   Browser Side   Transactional Archives lack many of the challenges of Crawler based Archives (but have their own)   Implemented Memento TimeGates for Transactional Archives:   Does not require rewriting URIs for self-contained-ness   Works well with automated, distributed access patterns   Access via Browser add-on is fast and seamless   Server and Browser archiving code will be released Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 21
  • 22. Memento wants to make Navigating the Web’s Past Easy Learn: http://www.mementoweb.org/ Talk: http://groups.google.com/group/memento-dev Use: http://bit.ly/memfox Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 22