SlideShare a Scribd company logo
1 of 43
Visualizing Digital Collections
of Web Archives
Mat Kelly, Michael L. Nelson, Michele C. Weigle
Old Dominion University
Web Archiving Collaboration: New Tools and Models
Columbia University, New York, NY
June 4, 2015
@machawk1http://ws-dl.cs.odu.edu
Motivation for Thumbnail
Summarization
• Change over time - aboutness
Apple.com has > 17k mementos
Many Nearly Identical
(apple.com)
Methods of Summarization
• Including all mementos
– many redundant thumbnails
– temporally/spatially/cognitively expensive
• Naively excluding images
– missing important captures in summary
• Compare image thumbnails
– temporally expensive for identifying unique
thumbnails
Comparing mementos’ markup can identify
sufficiently unique mementos
Analyzing Markup
<title>Apple</title>
<meta property="analytics-
track" content="Apple - Index/Tab"
/>
<meta property="analytics-s-
channel" content="homepage" />
<meta property="analytics-s-
bucket-0"
content="appleglobal,applehome"
/>
<meta property="analytics-s-
bucket-1"
content="apple{COUNTRY_CODE}gl
obal,apple{COUNTRY_CODE}home"
/>
8664ee964799c38c156d8f0
39dae8330
apple.com at Mar 17, 2008 HTML for memento SimHash for HTML
SimHash?
HTML snippet for
memento
First k characters of
markup
Second k characters of
markup
64th k characters of
markup
63rd k characters of
markup
markup length
64
k =
…
Hash to a
character
Hash to a
character
Hash to a
character
Hash to a
character
c
3
9
f
…
SimHash vs. Other Hashes
• md5(“aaaaaaaaaaaaaaa”)
12f9cf6998d52dbe773b06f848bb3608
• md5(“aaaaaaabaaaaaaa”)
e984cee68697eb77577717b532171493
• simhash(“aaaaaaaaaaaaaaa”)
8664ee964799c38c156d8f039dae8330
• simhash(“aaaaaaabaaaaaaa”)
8664ee964799a48c156d8f039dae8330
Why SimHash?
• SimHash identifies similarities between
documents
• Conventional hashing algorithms are for
identifying differences
– Drastically different output from similar content
• To remove redundancies, we want to detect
when temporally adjacent mementos are
sufficiently dissimilar
SimHashes for Mementos
HTML of apple.com
March 3, 2008
HTML of apple.com
March 5, 2008
HTML of apple.com
April 12, 2008
HTML of apple.com
October 4, 2008
c39f0abc...b9
c39d0abc...c9
c39d0abc...b9
c770ad1b...b9
Identifying Similarity by
Calculating Hamming Distance
HTML of apple.com
March 3, 2008
HTML of apple.com
March 5, 2008
HTML of apple.com
April 12, 2008
HTML of apple.com
October 4, 2008
c39f0abc...b9
c39d0abc...c9
c39d0abc...b9
c770ad1b...b9
HAMMING DISTANCE
2
1
7
N/A
pivot
Sliding Hamming Distance
• Selection based on previously selected memento
• Sliding pivot
ΔM3 ΔM3
ΔM3
ΔM6
ΔM6
ΔM6
ΔM6
ΔM0
ΔM0
ΔM0
Project Goals
Develop tools that implement thumbnail
summarization for TimeMaps
• Web Service
– Allows anyone to view TimeMap using thumbnail
summarization
• Wayback add-on
– Allows any archive using wayback to provide this
service to users
• Embeddable version
– Allow web page authors to embed overview of
past versions of page on live web page
AlSummarization
• SimHash-based summarization scheme
created by Ahmed AlSum
• AlSum + Summarization = AlSummarization
A. AlSum, and M. L. Nelson. “Thumbnail Summarization Techniques for Web Archives.” In
Proceedings of the 36TH European Conference on Information Retrieval, ECIR 2014, 2014.
Dr. Nelson’s Homepage
• URI-R: http://www.cs.odu.edu/~mln
• Append onto service URI for summary
– http://service/http://www.cs.odu.edu/~mln
Anatomy of the Visualization
3 presentations of the Summary
Temporally sorted
mementos
Memento metadata
Additional (optional) Endpoint
Parameters
• Access – tailors user interface
– Interactive, Embed, Wayback
• Strategy – to use alternative summarization
– alSummarization, yearly, skipListed, random
• http://service/?
o access=wayback&URI-
R=http://www.cs.odu.edu/~mln
o access=wayback&strategy=random&URI-
R=http://www.cs.odu.edu/~mln
Programmatic Flow
User’s Browser Thumbnails
Service
Memento-Compliant
Archive
User Requests URI-R Summary
User’s Browser Thumbnails
Service
Memento-Compliant
Archive
Service Relays URI-R to Archive
User’s Browser Thumbnails
Service
Memento-Compliant
Archive
Service queries archive for all mementos
for URI-R
URI-Ms returned to Service
User’s Browser Thumbnails
Service
Memento-Compliant
Archive
Archive returns TimeMap with URI-Ms to
thumbnail service
TM
Service fetches HTML for each
Memento
Thumbnails
Service
Service generates SimHash for
Each Mementos’ HTML
Thumbnails
Service
c39f0abc...b9
c39d0abc...c9
c39d0abc...b9
c770ad1b...b9
c770ad1b...b9
Service Calculates
Hamming Distance
Thumbnails
Service
Mementos in summary selected based on
hamming distance
c39f0abc...b9
c39d0abc...c9
c39d0abc...b9
c770ad1b...b9
c770ad1b...b9
Hd()
2
1
7
0
Preliminary UI returned to user
User’s Browser Thumbnails
Service
Templated HTML interface is returned to
user with placeholders for thumbnails
HTML
interface
c39d0abc...c9
c39d0abc...b9
c770ad1b...b9
2
1
0
Service Generates Thumbnails for
Mementos in Summary
Thumbnails
Service
c39f0abc...b9
c770ad1b...b9
Hd()
7
Thumbnails Served to User
User’s Browser Thumbnails
Service
Asynchronous polling from HTML pages
populates placeholder images once available
HTML
interface
Core Implementation
•
• for thumbnail generation
• abstractions preserved for code reuse
and extensibility
• Code documented to facilitate extensibility,
usage, and fixes
http://github.com/machawk1/ArchiveThumbnails
Initializing the service
$ npm install
$ node alSummarization.js
* Local resource (css, js,etc.) server
listening on Port 1338...
* Thumbnails service started on Port 15421
> Try localhost:15421/?URI-
R=http://matkelly.com in your web browser
for sample execution.
User/Service Administrator simply enters:
Service responds and is ready for query:
Online vs. Offline Generation
• Online Thumbnail Summarization
– Fetch each mementos’ HTML
– Calculate SimHashes
– Calculate Hamming Distance (HD)
– Select Mementos That Pass HD threshold
– Generate Thumbnails of Mementos
• Offline Thumbnail Summarization
– All of the above performed a priori
– Data potentially updated on access
Adaptive Strategies
• Very large TimeMaps are temporally
expensive to generate
• Default behavior:
if(timeRequirement == tooLong):
use(naiveStrategy)
• User can explicitly override behavior
Other Summarization Strategies
• Random Selection
– k mementos, uniform selection
• Interval
– every mth memento, m = n/k
• Temporal Interval
– One memento/year, reverse chronological
monthly back-fill
• Temporally Uniform Trimming when k > 15
Grid View
AlSummarization vs Random
Dr. Nelson’s Homepage
Random Strategy
Dr. Nelson’s Homepage
AlSummarization Strategy
Grid View
AlSummarization vs Interval
Dr. Nelson’s Homepage
Interval Strategy
Dr. Nelson’s Homepage
AlSummarization Strategy
Grid View
AlSummarization vs Temporal Interval
Dr. Nelson’s Homepage
Temporal Interval Strategy
Dr. Nelson’s Homepage
AlSummarization Strategy
Asynchronous Polling
Server-side SimHash Caching
Four Summarization Strategies
OpenWayback Integration
Service Embedding
• <object data=http://service/http://yoururl.com”
type=“text/html”>
</object>
-or-
• <iframe src=“http://service/http://yoururl.com”>
</iframe>
Visualizing Digital Collections
of Web Archives
• Codebase:
– github.com/machawk1/ArchiveThumbnails
• Service URI:
– http://wsdl-docker.cs.odu.edu:15421
Live
Demo

More Related Content

What's hot

Carpenter - Wolfram Data Summit ResourceSync
Carpenter - Wolfram Data Summit ResourceSyncCarpenter - Wolfram Data Summit ResourceSync
Carpenter - Wolfram Data Summit ResourceSyncnisohq
 
ResourceSync: Web-Based Resource Synchronization
ResourceSync: Web-Based Resource SynchronizationResourceSync: Web-Based Resource Synchronization
ResourceSync: Web-Based Resource SynchronizationHerbert Van de Sompel
 
ResourceSync: Web-based Resource Synchronization
ResourceSync: Web-based Resource SynchronizationResourceSync: Web-based Resource Synchronization
ResourceSync: Web-based Resource SynchronizationSimeon Warner
 
Publishing "5 star" data: the case for RDF
Publishing "5 star" data: the case for RDFPublishing "5 star" data: the case for RDF
Publishing "5 star" data: the case for RDFPeterWinstanley1
 
Tune-up electronic resources in Alma for better discovery_May_06_2016
Tune-up electronic resources in Alma for better discovery_May_06_2016Tune-up electronic resources in Alma for better discovery_May_06_2016
Tune-up electronic resources in Alma for better discovery_May_06_2016mahongzn
 
Search Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureChris Bizer
 

What's hot (11)

Carpenter - Wolfram Data Summit ResourceSync
Carpenter - Wolfram Data Summit ResourceSyncCarpenter - Wolfram Data Summit ResourceSync
Carpenter - Wolfram Data Summit ResourceSync
 
ResourceSync: Web-Based Resource Synchronization
ResourceSync: Web-Based Resource SynchronizationResourceSync: Web-Based Resource Synchronization
ResourceSync: Web-Based Resource Synchronization
 
NISO ResourceSync Training Session
NISO ResourceSync Training SessionNISO ResourceSync Training Session
NISO ResourceSync Training Session
 
Today's forecast for your campus: BLUEcloud
 Today's forecast for your campus: BLUEcloud Today's forecast for your campus: BLUEcloud
Today's forecast for your campus: BLUEcloud
 
ResourceSync: Web-based Resource Synchronization
ResourceSync: Web-based Resource SynchronizationResourceSync: Web-based Resource Synchronization
ResourceSync: Web-based Resource Synchronization
 
Publishing "5 star" data: the case for RDF
Publishing "5 star" data: the case for RDFPublishing "5 star" data: the case for RDF
Publishing "5 star" data: the case for RDF
 
ResourceSync Overview
ResourceSync OverviewResourceSync Overview
ResourceSync Overview
 
Tune-up electronic resources in Alma for better discovery_May_06_2016
Tune-up electronic resources in Alma for better discovery_May_06_2016Tune-up electronic resources in Alma for better discovery_May_06_2016
Tune-up electronic resources in Alma for better discovery_May_06_2016
 
Search Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited Lecture
 
OpenGLAM: LOD and American Art
OpenGLAM: LOD and American ArtOpenGLAM: LOD and American Art
OpenGLAM: LOD and American Art
 
Intro cOMPUTERS
Intro cOMPUTERSIntro cOMPUTERS
Intro cOMPUTERS
 

Similar to Visualizing Digital Collections of Web Archives from Columbia Web Archiving Collaboration Conference

Architecture Patterns - Open Discussion
Architecture Patterns - Open DiscussionArchitecture Patterns - Open Discussion
Architecture Patterns - Open DiscussionNguyen Tung
 
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...Tony Erwin
 
JavaOne: Efficiently building and deploying microservices
JavaOne: Efficiently building and deploying microservicesJavaOne: Efficiently building and deploying microservices
JavaOne: Efficiently building and deploying microservicesBart Blommaerts
 
Tech talk-live-alfresco-drupal
Tech talk-live-alfresco-drupalTech talk-live-alfresco-drupal
Tech talk-live-alfresco-drupalAlfresco Software
 
Cloud Services Powered by IBM SoftLayer and NetflixOSS
Cloud Services Powered by IBM SoftLayer and NetflixOSSCloud Services Powered by IBM SoftLayer and NetflixOSS
Cloud Services Powered by IBM SoftLayer and NetflixOSSaspyker
 
How Responsive Do You Want Your Website?
How Responsive Do You Want Your Website?How Responsive Do You Want Your Website?
How Responsive Do You Want Your Website?IWMW
 
Guide to Application Performance: Planning to Continued Optimization
Guide to Application Performance: Planning to Continued OptimizationGuide to Application Performance: Planning to Continued Optimization
Guide to Application Performance: Planning to Continued OptimizationMuleSoft
 
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...Ram G Athreya
 
Monitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with PrometheusMonitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with PrometheusFabian Reinartz
 
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...Tony Erwin
 
Web technologies course, an introduction
Web technologies course, an introductionWeb technologies course, an introduction
Web technologies course, an introductionPiero Fraternali
 
HTML5 on Mobile(For Designer)
HTML5 on Mobile(For Designer)HTML5 on Mobile(For Designer)
HTML5 on Mobile(For Designer)Adam Lu
 
Membase Introduction
Membase IntroductionMembase Introduction
Membase IntroductionMembase
 
e-Learning Delivery System : The Challenges
e-Learning Delivery System : The Challengese-Learning Delivery System : The Challenges
e-Learning Delivery System : The ChallengesDenpong Soodphakdee
 
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...Big Data Value Association
 
4. Web programming MVC.pptx
4. Web programming  MVC.pptx4. Web programming  MVC.pptx
4. Web programming MVC.pptxKrisnaBayu41
 
2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indix2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indixYu Ishikawa
 
TrueReusableCode-BigDataCodeCamp2016
TrueReusableCode-BigDataCodeCamp2016TrueReusableCode-BigDataCodeCamp2016
TrueReusableCode-BigDataCodeCamp2016Eduard Lazar
 
Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016Aad Versteden
 

Similar to Visualizing Digital Collections of Web Archives from Columbia Web Archiving Collaboration Conference (20)

Architecture Patterns - Open Discussion
Architecture Patterns - Open DiscussionArchitecture Patterns - Open Discussion
Architecture Patterns - Open Discussion
 
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
 
JavaOne: Efficiently building and deploying microservices
JavaOne: Efficiently building and deploying microservicesJavaOne: Efficiently building and deploying microservices
JavaOne: Efficiently building and deploying microservices
 
Tech talk-live-alfresco-drupal
Tech talk-live-alfresco-drupalTech talk-live-alfresco-drupal
Tech talk-live-alfresco-drupal
 
Cloud Services Powered by IBM SoftLayer and NetflixOSS
Cloud Services Powered by IBM SoftLayer and NetflixOSSCloud Services Powered by IBM SoftLayer and NetflixOSS
Cloud Services Powered by IBM SoftLayer and NetflixOSS
 
How Responsive Do You Want Your Website?
How Responsive Do You Want Your Website?How Responsive Do You Want Your Website?
How Responsive Do You Want Your Website?
 
Guide to Application Performance: Planning to Continued Optimization
Guide to Application Performance: Planning to Continued OptimizationGuide to Application Performance: Planning to Continued Optimization
Guide to Application Performance: Planning to Continued Optimization
 
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
 
Monitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with PrometheusMonitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with Prometheus
 
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
 
Web technologies course, an introduction
Web technologies course, an introductionWeb technologies course, an introduction
Web technologies course, an introduction
 
HTML5 on Mobile(For Designer)
HTML5 on Mobile(For Designer)HTML5 on Mobile(For Designer)
HTML5 on Mobile(For Designer)
 
Membase Introduction
Membase IntroductionMembase Introduction
Membase Introduction
 
e-Learning Delivery System : The Challenges
e-Learning Delivery System : The Challengese-Learning Delivery System : The Challenges
e-Learning Delivery System : The Challenges
 
Salesforce Performance hacks - Client Side
Salesforce Performance hacks - Client SideSalesforce Performance hacks - Client Side
Salesforce Performance hacks - Client Side
 
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
 
4. Web programming MVC.pptx
4. Web programming  MVC.pptx4. Web programming  MVC.pptx
4. Web programming MVC.pptx
 
2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indix2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indix
 
TrueReusableCode-BigDataCodeCamp2016
TrueReusableCode-BigDataCodeCamp2016TrueReusableCode-BigDataCodeCamp2016
TrueReusableCode-BigDataCodeCamp2016
 
Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016
 

More from Mat Kelly

Aggregating Private and Public Web Archives Using the Mementity Framework
Aggregating Private and Public Web Archives Using the Mementity FrameworkAggregating Private and Public Web Archives Using the Mementity Framework
Aggregating Private and Public Web Archives Using the Mementity FrameworkMat Kelly
 
Client-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer HeaderClient-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer HeaderMat Kelly
 
A Framework for Aggregating Public and Private Web Archives
A Framework for Aggregating Public and Private Web ArchivesA Framework for Aggregating Public and Private Web Archives
A Framework for Aggregating Public and Private Web ArchivesMat Kelly
 
Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Mat Kelly
 
Exploring Aggregation of Personal, Private, and Institutional Web Archives
Exploring Aggregation of Personal, Private, and Institutional Web ArchivesExploring Aggregation of Personal, Private, and Institutional Web Archives
Exploring Aggregation of Personal, Private, and Institutional Web ArchivesMat Kelly
 
JCDL 2015 Doctoral Consortium - A Framework for Aggregating Private and Publi...
JCDL 2015 Doctoral Consortium - A Framework for AggregatingPrivate and Publi...JCDL 2015 Doctoral Consortium - A Framework for AggregatingPrivate and Publi...
JCDL 2015 Doctoral Consortium - A Framework for Aggregating Private and Publi...Mat Kelly
 
Facilitation of the A Posteriori Replication of Web Published Satellite Imagery
Facilitation of the A Posteriori Replication of Web Published Satellite ImageryFacilitation of the A Posteriori Replication of Web Published Satellite Imagery
Facilitation of the A Posteriori Replication of Web Published Satellite ImageryMat Kelly
 
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...Mat Kelly
 
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014Mat Kelly
 
Browser-Based Digital Preservation
Browser-Based Digital PreservationBrowser-Based Digital Preservation
Browser-Based Digital PreservationMat Kelly
 
Archive What I See Now - Archive-It Partner Meeting 2013 2013
Archive What I See Now - Archive-It Partner Meeting 2013 2013Archive What I See Now - Archive-It Partner Meeting 2013 2013
Archive What I See Now - Archive-It Partner Meeting 2013 2013Mat Kelly
 
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction SystemIEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction SystemMat Kelly
 
Digital Preservation 2013
Digital Preservation 2013Digital Preservation 2013
Digital Preservation 2013Mat Kelly
 
Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving
Making Enterprise-Level Archive Tools Accessible for Personal Web ArchivingMaking Enterprise-Level Archive Tools Accessible for Personal Web Archiving
Making Enterprise-Level Archive Tools Accessible for Personal Web ArchivingMat Kelly
 
An Extensible Framework for Creating Personal Web Archives of Content Behind ...
An Extensible Framework for Creating Personal Web Archives of Content Behind ...An Extensible Framework for Creating Personal Web Archives of Content Behind ...
An Extensible Framework for Creating Personal Web Archives of Content Behind ...Mat Kelly
 
The Revolution Will Not Be Archived
The Revolution Will Not Be ArchivedThe Revolution Will Not Be Archived
The Revolution Will Not Be ArchivedMat Kelly
 
WARCreate - Create Wayback-Consumable WARC Files from Any Webpage
WARCreate - Create Wayback-Consumable WARC Files from Any WebpageWARCreate - Create Wayback-Consumable WARC Files from Any Webpage
WARCreate - Create Wayback-Consumable WARC Files from Any WebpageMat Kelly
 
NDIIPP/NDSA 2011 - YouTube Link Restoration
NDIIPP/NDSA 2011 - YouTube Link RestorationNDIIPP/NDSA 2011 - YouTube Link Restoration
NDIIPP/NDSA 2011 - YouTube Link RestorationMat Kelly
 
NDIIPP/NDSA 2011 - Archive Facebook
NDIIPP/NDSA 2011 - Archive FacebookNDIIPP/NDSA 2011 - Archive Facebook
NDIIPP/NDSA 2011 - Archive FacebookMat Kelly
 

More from Mat Kelly (20)

Aggregating Private and Public Web Archives Using the Mementity Framework
Aggregating Private and Public Web Archives Using the Mementity FrameworkAggregating Private and Public Web Archives Using the Mementity Framework
Aggregating Private and Public Web Archives Using the Mementity Framework
 
Client-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer HeaderClient-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer Header
 
A Framework for Aggregating Public and Private Web Archives
A Framework for Aggregating Public and Private Web ArchivesA Framework for Aggregating Public and Private Web Archives
A Framework for Aggregating Public and Private Web Archives
 
Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count
 
Exploring Aggregation of Personal, Private, and Institutional Web Archives
Exploring Aggregation of Personal, Private, and Institutional Web ArchivesExploring Aggregation of Personal, Private, and Institutional Web Archives
Exploring Aggregation of Personal, Private, and Institutional Web Archives
 
JCDL 2015 Doctoral Consortium - A Framework for Aggregating Private and Publi...
JCDL 2015 Doctoral Consortium - A Framework for AggregatingPrivate and Publi...JCDL 2015 Doctoral Consortium - A Framework for AggregatingPrivate and Publi...
JCDL 2015 Doctoral Consortium - A Framework for Aggregating Private and Publi...
 
Facilitation of the A Posteriori Replication of Web Published Satellite Imagery
Facilitation of the A Posteriori Replication of Web Published Satellite ImageryFacilitation of the A Posteriori Replication of Web Published Satellite Imagery
Facilitation of the A Posteriori Replication of Web Published Satellite Imagery
 
Slides
SlidesSlides
Slides
 
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
 
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
 
Browser-Based Digital Preservation
Browser-Based Digital PreservationBrowser-Based Digital Preservation
Browser-Based Digital Preservation
 
Archive What I See Now - Archive-It Partner Meeting 2013 2013
Archive What I See Now - Archive-It Partner Meeting 2013 2013Archive What I See Now - Archive-It Partner Meeting 2013 2013
Archive What I See Now - Archive-It Partner Meeting 2013 2013
 
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction SystemIEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
 
Digital Preservation 2013
Digital Preservation 2013Digital Preservation 2013
Digital Preservation 2013
 
Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving
Making Enterprise-Level Archive Tools Accessible for Personal Web ArchivingMaking Enterprise-Level Archive Tools Accessible for Personal Web Archiving
Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving
 
An Extensible Framework for Creating Personal Web Archives of Content Behind ...
An Extensible Framework for Creating Personal Web Archives of Content Behind ...An Extensible Framework for Creating Personal Web Archives of Content Behind ...
An Extensible Framework for Creating Personal Web Archives of Content Behind ...
 
The Revolution Will Not Be Archived
The Revolution Will Not Be ArchivedThe Revolution Will Not Be Archived
The Revolution Will Not Be Archived
 
WARCreate - Create Wayback-Consumable WARC Files from Any Webpage
WARCreate - Create Wayback-Consumable WARC Files from Any WebpageWARCreate - Create Wayback-Consumable WARC Files from Any Webpage
WARCreate - Create Wayback-Consumable WARC Files from Any Webpage
 
NDIIPP/NDSA 2011 - YouTube Link Restoration
NDIIPP/NDSA 2011 - YouTube Link RestorationNDIIPP/NDSA 2011 - YouTube Link Restoration
NDIIPP/NDSA 2011 - YouTube Link Restoration
 
NDIIPP/NDSA 2011 - Archive Facebook
NDIIPP/NDSA 2011 - Archive FacebookNDIIPP/NDSA 2011 - Archive Facebook
NDIIPP/NDSA 2011 - Archive Facebook
 

Recently uploaded

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 

Recently uploaded (20)

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 

Visualizing Digital Collections of Web Archives from Columbia Web Archiving Collaboration Conference

  • 1. Visualizing Digital Collections of Web Archives Mat Kelly, Michael L. Nelson, Michele C. Weigle Old Dominion University Web Archiving Collaboration: New Tools and Models Columbia University, New York, NY June 4, 2015 @machawk1http://ws-dl.cs.odu.edu
  • 2. Motivation for Thumbnail Summarization • Change over time - aboutness
  • 3. Apple.com has > 17k mementos
  • 5. Methods of Summarization • Including all mementos – many redundant thumbnails – temporally/spatially/cognitively expensive • Naively excluding images – missing important captures in summary • Compare image thumbnails – temporally expensive for identifying unique thumbnails Comparing mementos’ markup can identify sufficiently unique mementos
  • 6. Analyzing Markup <title>Apple</title> <meta property="analytics- track" content="Apple - Index/Tab" /> <meta property="analytics-s- channel" content="homepage" /> <meta property="analytics-s- bucket-0" content="appleglobal,applehome" /> <meta property="analytics-s- bucket-1" content="apple{COUNTRY_CODE}gl obal,apple{COUNTRY_CODE}home" /> 8664ee964799c38c156d8f0 39dae8330 apple.com at Mar 17, 2008 HTML for memento SimHash for HTML
  • 7. SimHash? HTML snippet for memento First k characters of markup Second k characters of markup 64th k characters of markup 63rd k characters of markup markup length 64 k = … Hash to a character Hash to a character Hash to a character Hash to a character c 3 9 f …
  • 8. SimHash vs. Other Hashes • md5(“aaaaaaaaaaaaaaa”) 12f9cf6998d52dbe773b06f848bb3608 • md5(“aaaaaaabaaaaaaa”) e984cee68697eb77577717b532171493 • simhash(“aaaaaaaaaaaaaaa”) 8664ee964799c38c156d8f039dae8330 • simhash(“aaaaaaabaaaaaaa”) 8664ee964799a48c156d8f039dae8330
  • 9. Why SimHash? • SimHash identifies similarities between documents • Conventional hashing algorithms are for identifying differences – Drastically different output from similar content • To remove redundancies, we want to detect when temporally adjacent mementos are sufficiently dissimilar
  • 10. SimHashes for Mementos HTML of apple.com March 3, 2008 HTML of apple.com March 5, 2008 HTML of apple.com April 12, 2008 HTML of apple.com October 4, 2008 c39f0abc...b9 c39d0abc...c9 c39d0abc...b9 c770ad1b...b9
  • 11. Identifying Similarity by Calculating Hamming Distance HTML of apple.com March 3, 2008 HTML of apple.com March 5, 2008 HTML of apple.com April 12, 2008 HTML of apple.com October 4, 2008 c39f0abc...b9 c39d0abc...c9 c39d0abc...b9 c770ad1b...b9 HAMMING DISTANCE 2 1 7 N/A pivot
  • 12.
  • 13. Sliding Hamming Distance • Selection based on previously selected memento • Sliding pivot ΔM3 ΔM3 ΔM3 ΔM6 ΔM6 ΔM6 ΔM6 ΔM0 ΔM0 ΔM0
  • 14. Project Goals Develop tools that implement thumbnail summarization for TimeMaps • Web Service – Allows anyone to view TimeMap using thumbnail summarization • Wayback add-on – Allows any archive using wayback to provide this service to users • Embeddable version – Allow web page authors to embed overview of past versions of page on live web page
  • 15. AlSummarization • SimHash-based summarization scheme created by Ahmed AlSum • AlSum + Summarization = AlSummarization A. AlSum, and M. L. Nelson. “Thumbnail Summarization Techniques for Web Archives.” In Proceedings of the 36TH European Conference on Information Retrieval, ECIR 2014, 2014.
  • 16. Dr. Nelson’s Homepage • URI-R: http://www.cs.odu.edu/~mln • Append onto service URI for summary – http://service/http://www.cs.odu.edu/~mln
  • 17. Anatomy of the Visualization 3 presentations of the Summary Temporally sorted mementos Memento metadata
  • 18. Additional (optional) Endpoint Parameters • Access – tailors user interface – Interactive, Embed, Wayback • Strategy – to use alternative summarization – alSummarization, yearly, skipListed, random • http://service/? o access=wayback&URI- R=http://www.cs.odu.edu/~mln o access=wayback&strategy=random&URI- R=http://www.cs.odu.edu/~mln
  • 19. Programmatic Flow User’s Browser Thumbnails Service Memento-Compliant Archive
  • 20. User Requests URI-R Summary User’s Browser Thumbnails Service Memento-Compliant Archive
  • 21. Service Relays URI-R to Archive User’s Browser Thumbnails Service Memento-Compliant Archive Service queries archive for all mementos for URI-R
  • 22. URI-Ms returned to Service User’s Browser Thumbnails Service Memento-Compliant Archive Archive returns TimeMap with URI-Ms to thumbnail service TM
  • 23. Service fetches HTML for each Memento Thumbnails Service
  • 24. Service generates SimHash for Each Mementos’ HTML Thumbnails Service c39f0abc...b9 c39d0abc...c9 c39d0abc...b9 c770ad1b...b9 c770ad1b...b9
  • 25. Service Calculates Hamming Distance Thumbnails Service Mementos in summary selected based on hamming distance c39f0abc...b9 c39d0abc...c9 c39d0abc...b9 c770ad1b...b9 c770ad1b...b9 Hd() 2 1 7 0
  • 26. Preliminary UI returned to user User’s Browser Thumbnails Service Templated HTML interface is returned to user with placeholders for thumbnails HTML interface
  • 27. c39d0abc...c9 c39d0abc...b9 c770ad1b...b9 2 1 0 Service Generates Thumbnails for Mementos in Summary Thumbnails Service c39f0abc...b9 c770ad1b...b9 Hd() 7
  • 28. Thumbnails Served to User User’s Browser Thumbnails Service Asynchronous polling from HTML pages populates placeholder images once available HTML interface
  • 29. Core Implementation • • for thumbnail generation • abstractions preserved for code reuse and extensibility • Code documented to facilitate extensibility, usage, and fixes http://github.com/machawk1/ArchiveThumbnails
  • 30. Initializing the service $ npm install $ node alSummarization.js * Local resource (css, js,etc.) server listening on Port 1338... * Thumbnails service started on Port 15421 > Try localhost:15421/?URI- R=http://matkelly.com in your web browser for sample execution. User/Service Administrator simply enters: Service responds and is ready for query:
  • 31. Online vs. Offline Generation • Online Thumbnail Summarization – Fetch each mementos’ HTML – Calculate SimHashes – Calculate Hamming Distance (HD) – Select Mementos That Pass HD threshold – Generate Thumbnails of Mementos • Offline Thumbnail Summarization – All of the above performed a priori – Data potentially updated on access
  • 32. Adaptive Strategies • Very large TimeMaps are temporally expensive to generate • Default behavior: if(timeRequirement == tooLong): use(naiveStrategy) • User can explicitly override behavior
  • 33. Other Summarization Strategies • Random Selection – k mementos, uniform selection • Interval – every mth memento, m = n/k • Temporal Interval – One memento/year, reverse chronological monthly back-fill • Temporally Uniform Trimming when k > 15
  • 34. Grid View AlSummarization vs Random Dr. Nelson’s Homepage Random Strategy Dr. Nelson’s Homepage AlSummarization Strategy
  • 35. Grid View AlSummarization vs Interval Dr. Nelson’s Homepage Interval Strategy Dr. Nelson’s Homepage AlSummarization Strategy
  • 36. Grid View AlSummarization vs Temporal Interval Dr. Nelson’s Homepage Temporal Interval Strategy Dr. Nelson’s Homepage AlSummarization Strategy
  • 41. Service Embedding • <object data=http://service/http://yoururl.com” type=“text/html”> </object> -or- • <iframe src=“http://service/http://yoururl.com”> </iframe>
  • 42. Visualizing Digital Collections of Web Archives • Codebase: – github.com/machawk1/ArchiveThumbnails • Service URI: – http://wsdl-docker.cs.odu.edu:15421