SlideShare a Scribd company logo
1 of 30
Thumbnail Summarization
Techniques For Web Archives
Ahmed AlSum*
Stanford University Libraries
Stanford CA, USA
aalsum@stanford.edu
Michael L. Nelson
Old Dominion University
Norfolk VA, USA
mln@cs.odu.edu
The 36th European Conference on Information Retrieval.
ECIR 2014, Amsterdam, Netherlands, 2014
* The research has been conducted while Ahmed AlSum was at Old Dominion University
ECIR 2014 Amsterdam, Netherlands
What is a Web Archive?
http://www.cs.odu.edu
2ECIR 2014 Amsterdam, Netherlands
Memento Terminology
URI-R, R
URI-M, M
URI-T, TM
http://www.amazon.com
http://web.archive.org/web/20110411070244/http://amazon.com
Original Resource
Memento
TimeMap
3ECIR 2014 Amsterdam, Netherlands
Thumbnails in Web Archive
Internet Archive UK Web Archive
4ECIR 2014 Amsterdam, Netherlands
Thumbnail Creation Challenges
• Scalability in Time
• IA may need 361 years to create thumbnail for each memento
using one hundred machines.
• Scalability in Space
• IA will need 355 TB to store 1 thumbnail per each memento.
• Page quality
5ECIR 2014 Amsterdam, Netherlands
Thumbnail Usage Challenges
6
• This is partial view of the first 700 thumbnails out of
10,500 available mementos for www.apple.com
ECIR 2014 Amsterdam, Netherlands
From 10,500 Mementos to 69 Thumbnails.
7ECIR 2014 Amsterdam, Netherlands
How many thumbnails do we need?
www.unfi.com on the live Web
8ECIR 2014 Amsterdam, Netherlands
How many thumbnails do we need?
www.unfi.com on the live Web
9ECIR 2014 Amsterdam, Netherlands
40 Thumbnails are good.
10ECIR 2014 Amsterdam, Netherlands
METHODOLOGY
11ECIR 2014 Amsterdam, Netherlands
Visual Similarity and Text Similarity
SimilarDifferent
HTML Text
12ECIR 2014 Amsterdam, Netherlands
Correlation between
Visual Similarity and Text Similarity
• Text Similarity
• SimHash
• DOM Tree
• Embedded resources
• Memento Datetime (Capture time)
• Visual Similarity
• Number of different pixels
13ECIR 2014 Amsterdam, Netherlands
Text Similarity
SimHash
• Compute 64-bit SimHash fingerprints with k = 4 for two
pages, then Calculate the distance using Hamming
Distance
14ECIR 2014 Amsterdam, Netherlands
12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05
Distance
12 bits
Simhash: 147EDAA9977E9400 Simhash: 157EFAAC97189100
Text Similarity
DOM Tree
• Transfer each webpage to DOM tree
• Calculate the difference using Levenshtein Distance
• Levenshtein distance: is the number of operations to insert, update, and delete.
15ECIR 2014 Amsterdam, Netherlands
Pawlik, M., & Augsten, N. (2011). RTED: a robust algorithm for the tree edit distance. Proceedings of the VLDB Endowment, 5(4), 334–345.
Text Similarity
Embedded resources
• Extract the embedded resources from each page
• Calculate the total number of new resources that have
been added and the resources that have been removed.
16ECIR 2014 Amsterdam, Netherlands
Addition
Removal
Total 4 11
Images 1 9
JS 1 0
CSS 2 2
12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05
Text Similarity
Memento datetime
• Calculate the difference between the record capture time
for both pages in seconds.
17ECIR 2014 Amsterdam, Netherlands
12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05
Difference
70942 sec
Visual Similarity
• The number of different pixels between two thumbnails,
we resize them into different dimensions (e.g., 64x64 and
128x128). We calculate the Manhattan distance between
each pair
ECIR 2014 Amsterdam, Netherlands 18
12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05
Distance
0.65
EXPERIMENT
Calculate the correlation between Visual Similarity and
Text Similarity
ECIR 2014 Amsterdam, Netherlands 19
Fortune 500
• 499,540 mementos from 488
TimeMaps.
• For each Memento, we download the
HTML and capture the thumbnail using
PhantomJS.
20
Dataset
Correlation between
Visual Similarity and Text Similarity
SimHash DOM tree
Embedded resources Memento Datetime
21
SimHash [Charikar 2002], DOM tree [Pawlik 2011], Memento Datetime [Van de Sompel 2013]
ECIR 2014 Amsterdam, Netherlands
SELECTION ALGORITHMS
Using text similarity features to predict the visual
similarity.
22ECIR 2014 Amsterdam, Netherlands
#1: Threshold Grouping
23ECIR 2014 Amsterdam, Netherlands
#1: Threshold Grouping
24ECIR 2014 Amsterdam, Netherlands
#2: Clustering technique
• Input:
• TimeMap with n mementos
• A set of features.
• For example, F = {SimHash, Memento-Datetime}
• Task:
• Cluster n mementos in K clusters.
25ECIR 2014 Amsterdam, Netherlands
#2: Clustering technique
SimHash Feature SimHash and Datetime Features
26
Park, H.-S., & Jun, C.-H. (2009). A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications, 36(2, Part 2), 3336–3341.
ECIR 2014 Amsterdam, Netherlands
#3: Time Normalization
27ECIR 2014 Amsterdam, Netherlands
Selection Algorithms Comparison
Threshold Grouping K clustering Time Normalization
TimeMap Reduction 27% 9% to 12% 23%
Image Loss 28 78 - 101 109
# Features 1 feature 1 or more 1 feature
Preprocessing required Yes Yes No
Efficient processing Medium Extensive Light
Incremental Yes No Yes
Online/offline Both Both Both
28ECIR 2014 Amsterdam, Netherlands
Generalization outside the Web Archive
• Summarize a website of n pages with only k thumbnails
29ECIR 2014 Amsterdam, Netherlands
Conclusions
• We explored the similarity between the text and visual
appearance of the web page.
• We found that SimHash difference between HTML text and
Levenshtein distance between HTML DOM tree have the highest
correlation
• We presented three algorithms to select k thumbnails
from n mementos per TimeMap.
30
aalsum@stanford.edu
@aalsum
ECIR 2014 Amsterdam, Netherlands

More Related Content

Similar to Thumbnail Summarization Techniques For Web Archives

DepositMOre: Applying tools to increase full-text content in institutional re...
DepositMOre: Applying tools to increase full-text content in institutional re...DepositMOre: Applying tools to increase full-text content in institutional re...
DepositMOre: Applying tools to increase full-text content in institutional re...depositMO
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingPlanetData Network of Excellence
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...Oscar Corcho
 
Mining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMOVING Project
 
Mining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataAnsgar Scherp
 
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...Databricks
 
Cloud-native persistence in a serverless world
Cloud-native persistence in a serverless worldCloud-native persistence in a serverless world
Cloud-native persistence in a serverless worldNick Do
 
Apache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseApache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseFlorian Lautenschlager
 
CLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage informationCLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage informationEnno Meijers
 
Fontys Lecture - The Evolution of the Oracle Database 2016
Fontys Lecture -  The Evolution of the Oracle Database 2016Fontys Lecture -  The Evolution of the Oracle Database 2016
Fontys Lecture - The Evolution of the Oracle Database 2016Lucas Jellema
 
RDM programme @ Edinburgh an institutional approach
RDM programme @ Edinburgh an institutional approachRDM programme @ Edinburgh an institutional approach
RDM programme @ Edinburgh an institutional approachJisc
 
Geospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning DataGeospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning DataAlexMiowski
 
ODIN 1st year Conference Oct 2013 Interoperability: connecting identifiers
ODIN 1st year Conference Oct 2013 Interoperability: connecting identifiersODIN 1st year Conference Oct 2013 Interoperability: connecting identifiers
ODIN 1st year Conference Oct 2013 Interoperability: connecting identifiersGudmundur Thorisson
 
RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)Daniele Dell'Aglio
 
Duraspace Hot Topics Series 6: Metadata and Repository Services
Duraspace Hot Topics Series 6: Metadata and Repository ServicesDuraspace Hot Topics Series 6: Metadata and Repository Services
Duraspace Hot Topics Series 6: Metadata and Repository ServicesMatthew Critchlow
 

Similar to Thumbnail Summarization Techniques For Web Archives (20)

DepositMOre: Applying tools to increase full-text content in institutional re...
DepositMOre: Applying tools to increase full-text content in institutional re...DepositMOre: Applying tools to increase full-text content in institutional re...
DepositMOre: Applying tools to increase full-text content in institutional re...
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
 
Mining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open Data
 
Mining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open Data
 
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
 
Cloud-native persistence in a serverless world
Cloud-native persistence in a serverless worldCloud-native persistence in a serverless world
Cloud-native persistence in a serverless world
 
sample-resume
sample-resumesample-resume
sample-resume
 
Apache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseApache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series database
 
Service Integration to Enhance RDM
Service Integration to Enhance RDMService Integration to Enhance RDM
Service Integration to Enhance RDM
 
CLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage informationCLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage information
 
RDM Programme @ Edinburgh
RDM Programme @ Edinburgh RDM Programme @ Edinburgh
RDM Programme @ Edinburgh
 
Fontys Lecture - The Evolution of the Oracle Database 2016
Fontys Lecture -  The Evolution of the Oracle Database 2016Fontys Lecture -  The Evolution of the Oracle Database 2016
Fontys Lecture - The Evolution of the Oracle Database 2016
 
RDM programme @ Edinburgh an institutional approach
RDM programme @ Edinburgh an institutional approachRDM programme @ Edinburgh an institutional approach
RDM programme @ Edinburgh an institutional approach
 
RDM@Edinburgh_interoperation_IDCC2015
RDM@Edinburgh_interoperation_IDCC2015RDM@Edinburgh_interoperation_IDCC2015
RDM@Edinburgh_interoperation_IDCC2015
 
Geospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning DataGeospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning Data
 
ODIN 1st year Conference Oct 2013 Interoperability: connecting identifiers
ODIN 1st year Conference Oct 2013 Interoperability: connecting identifiersODIN 1st year Conference Oct 2013 Interoperability: connecting identifiers
ODIN 1st year Conference Oct 2013 Interoperability: connecting identifiers
 
RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)
 
Duraspace Hot Topics Series 6: Metadata and Repository Services
Duraspace Hot Topics Series 6: Metadata and Repository ServicesDuraspace Hot Topics Series 6: Metadata and Repository Services
Duraspace Hot Topics Series 6: Metadata and Repository Services
 
Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
 

More from Ahmed AlSum

Restoring US First Website
Restoring US First WebsiteRestoring US First Website
Restoring US First WebsiteAhmed AlSum
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunitiesAhmed AlSum
 
Web Archiving Profile - WADL 2013
Web Archiving Profile - WADL 2013Web Archiving Profile - WADL 2013
Web Archiving Profile - WADL 2013Ahmed AlSum
 
Archival HTTP Redirection Retrieval Policies - TemporalWeb 2013
Archival HTTP Redirection Retrieval Policies - TemporalWeb 2013Archival HTTP Redirection Retrieval Policies - TemporalWeb 2013
Archival HTTP Redirection Retrieval Policies - TemporalWeb 2013Ahmed AlSum
 
ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013Ahmed AlSum
 
How Much of the Web is Archived? JCDL 2011
How Much of the Web is Archived? JCDL 2011How Much of the Web is Archived? JCDL 2011
How Much of the Web is Archived? JCDL 2011Ahmed AlSum
 

More from Ahmed AlSum (6)

Restoring US First Website
Restoring US First WebsiteRestoring US First Website
Restoring US First Website
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunities
 
Web Archiving Profile - WADL 2013
Web Archiving Profile - WADL 2013Web Archiving Profile - WADL 2013
Web Archiving Profile - WADL 2013
 
Archival HTTP Redirection Retrieval Policies - TemporalWeb 2013
Archival HTTP Redirection Retrieval Policies - TemporalWeb 2013Archival HTTP Redirection Retrieval Policies - TemporalWeb 2013
Archival HTTP Redirection Retrieval Policies - TemporalWeb 2013
 
ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013
 
How Much of the Web is Archived? JCDL 2011
How Much of the Web is Archived? JCDL 2011How Much of the Web is Archived? JCDL 2011
How Much of the Web is Archived? JCDL 2011
 

Recently uploaded

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 

Recently uploaded (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 

Thumbnail Summarization Techniques For Web Archives

  • 1. Thumbnail Summarization Techniques For Web Archives Ahmed AlSum* Stanford University Libraries Stanford CA, USA aalsum@stanford.edu Michael L. Nelson Old Dominion University Norfolk VA, USA mln@cs.odu.edu The 36th European Conference on Information Retrieval. ECIR 2014, Amsterdam, Netherlands, 2014 * The research has been conducted while Ahmed AlSum was at Old Dominion University ECIR 2014 Amsterdam, Netherlands
  • 2. What is a Web Archive? http://www.cs.odu.edu 2ECIR 2014 Amsterdam, Netherlands
  • 3. Memento Terminology URI-R, R URI-M, M URI-T, TM http://www.amazon.com http://web.archive.org/web/20110411070244/http://amazon.com Original Resource Memento TimeMap 3ECIR 2014 Amsterdam, Netherlands
  • 4. Thumbnails in Web Archive Internet Archive UK Web Archive 4ECIR 2014 Amsterdam, Netherlands
  • 5. Thumbnail Creation Challenges • Scalability in Time • IA may need 361 years to create thumbnail for each memento using one hundred machines. • Scalability in Space • IA will need 355 TB to store 1 thumbnail per each memento. • Page quality 5ECIR 2014 Amsterdam, Netherlands
  • 6. Thumbnail Usage Challenges 6 • This is partial view of the first 700 thumbnails out of 10,500 available mementos for www.apple.com ECIR 2014 Amsterdam, Netherlands
  • 7. From 10,500 Mementos to 69 Thumbnails. 7ECIR 2014 Amsterdam, Netherlands
  • 8. How many thumbnails do we need? www.unfi.com on the live Web 8ECIR 2014 Amsterdam, Netherlands
  • 9. How many thumbnails do we need? www.unfi.com on the live Web 9ECIR 2014 Amsterdam, Netherlands
  • 10. 40 Thumbnails are good. 10ECIR 2014 Amsterdam, Netherlands
  • 12. Visual Similarity and Text Similarity SimilarDifferent HTML Text 12ECIR 2014 Amsterdam, Netherlands
  • 13. Correlation between Visual Similarity and Text Similarity • Text Similarity • SimHash • DOM Tree • Embedded resources • Memento Datetime (Capture time) • Visual Similarity • Number of different pixels 13ECIR 2014 Amsterdam, Netherlands
  • 14. Text Similarity SimHash • Compute 64-bit SimHash fingerprints with k = 4 for two pages, then Calculate the distance using Hamming Distance 14ECIR 2014 Amsterdam, Netherlands 12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05 Distance 12 bits Simhash: 147EDAA9977E9400 Simhash: 157EFAAC97189100
  • 15. Text Similarity DOM Tree • Transfer each webpage to DOM tree • Calculate the difference using Levenshtein Distance • Levenshtein distance: is the number of operations to insert, update, and delete. 15ECIR 2014 Amsterdam, Netherlands Pawlik, M., & Augsten, N. (2011). RTED: a robust algorithm for the tree edit distance. Proceedings of the VLDB Endowment, 5(4), 334–345.
  • 16. Text Similarity Embedded resources • Extract the embedded resources from each page • Calculate the total number of new resources that have been added and the resources that have been removed. 16ECIR 2014 Amsterdam, Netherlands Addition Removal Total 4 11 Images 1 9 JS 1 0 CSS 2 2 12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05
  • 17. Text Similarity Memento datetime • Calculate the difference between the record capture time for both pages in seconds. 17ECIR 2014 Amsterdam, Netherlands 12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05 Difference 70942 sec
  • 18. Visual Similarity • The number of different pixels between two thumbnails, we resize them into different dimensions (e.g., 64x64 and 128x128). We calculate the Manhattan distance between each pair ECIR 2014 Amsterdam, Netherlands 18 12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05 Distance 0.65
  • 19. EXPERIMENT Calculate the correlation between Visual Similarity and Text Similarity ECIR 2014 Amsterdam, Netherlands 19
  • 20. Fortune 500 • 499,540 mementos from 488 TimeMaps. • For each Memento, we download the HTML and capture the thumbnail using PhantomJS. 20 Dataset
  • 21. Correlation between Visual Similarity and Text Similarity SimHash DOM tree Embedded resources Memento Datetime 21 SimHash [Charikar 2002], DOM tree [Pawlik 2011], Memento Datetime [Van de Sompel 2013] ECIR 2014 Amsterdam, Netherlands
  • 22. SELECTION ALGORITHMS Using text similarity features to predict the visual similarity. 22ECIR 2014 Amsterdam, Netherlands
  • 23. #1: Threshold Grouping 23ECIR 2014 Amsterdam, Netherlands
  • 24. #1: Threshold Grouping 24ECIR 2014 Amsterdam, Netherlands
  • 25. #2: Clustering technique • Input: • TimeMap with n mementos • A set of features. • For example, F = {SimHash, Memento-Datetime} • Task: • Cluster n mementos in K clusters. 25ECIR 2014 Amsterdam, Netherlands
  • 26. #2: Clustering technique SimHash Feature SimHash and Datetime Features 26 Park, H.-S., & Jun, C.-H. (2009). A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications, 36(2, Part 2), 3336–3341. ECIR 2014 Amsterdam, Netherlands
  • 27. #3: Time Normalization 27ECIR 2014 Amsterdam, Netherlands
  • 28. Selection Algorithms Comparison Threshold Grouping K clustering Time Normalization TimeMap Reduction 27% 9% to 12% 23% Image Loss 28 78 - 101 109 # Features 1 feature 1 or more 1 feature Preprocessing required Yes Yes No Efficient processing Medium Extensive Light Incremental Yes No Yes Online/offline Both Both Both 28ECIR 2014 Amsterdam, Netherlands
  • 29. Generalization outside the Web Archive • Summarize a website of n pages with only k thumbnails 29ECIR 2014 Amsterdam, Netherlands
  • 30. Conclusions • We explored the similarity between the text and visual appearance of the web page. • We found that SimHash difference between HTML text and Levenshtein distance between HTML DOM tree have the highest correlation • We presented three algorithms to select k thumbnails from n mementos per TimeMap. 30 aalsum@stanford.edu @aalsum ECIR 2014 Amsterdam, Netherlands

Editor's Notes

  1. Verbally show this is the endExplain this is an initial step in this area