Thumbnail Summarization Techniques For Web Archives

1,866 views
1,722 views

Published on

In this presentation, I'm discussing general techniques to summarize web archives timemap to generate thumbnails. The techniques depend on similarity features on the HTML text such as Simhash and DOM tree.

Published in: Technology, Travel
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,866
On SlideShare
0
From Embeds
0
Number of Embeds
585
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Verbally show this is the endExplain this is an initial step in this area
  • Thumbnail Summarization Techniques For Web Archives

    1. 1. Thumbnail Summarization Techniques For Web Archives Ahmed AlSum* Stanford University Libraries Stanford CA, USA aalsum@stanford.edu Michael L. Nelson Old Dominion University Norfolk VA, USA mln@cs.odu.edu The 36th European Conference on Information Retrieval. ECIR 2014, Amsterdam, Netherlands, 2014 * The research has been conducted while Ahmed AlSum was at Old Dominion University ECIR 2014 Amsterdam, Netherlands
    2. 2. What is a Web Archive? http://www.cs.odu.edu 2ECIR 2014 Amsterdam, Netherlands
    3. 3. Memento Terminology URI-R, R URI-M, M URI-T, TM http://www.amazon.com http://web.archive.org/web/20110411070244/http://amazon.com Original Resource Memento TimeMap 3ECIR 2014 Amsterdam, Netherlands
    4. 4. Thumbnails in Web Archive Internet Archive UK Web Archive 4ECIR 2014 Amsterdam, Netherlands
    5. 5. Thumbnail Creation Challenges • Scalability in Time • IA may need 361 years to create thumbnail for each memento using one hundred machines. • Scalability in Space • IA will need 355 TB to store 1 thumbnail per each memento. • Page quality 5ECIR 2014 Amsterdam, Netherlands
    6. 6. Thumbnail Usage Challenges 6 • This is partial view of the first 700 thumbnails out of 10,500 available mementos for www.apple.com ECIR 2014 Amsterdam, Netherlands
    7. 7. From 10,500 Mementos to 69 Thumbnails. 7ECIR 2014 Amsterdam, Netherlands
    8. 8. How many thumbnails do we need? www.unfi.com on the live Web 8ECIR 2014 Amsterdam, Netherlands
    9. 9. How many thumbnails do we need? www.unfi.com on the live Web 9ECIR 2014 Amsterdam, Netherlands
    10. 10. 40 Thumbnails are good. 10ECIR 2014 Amsterdam, Netherlands
    11. 11. METHODOLOGY 11ECIR 2014 Amsterdam, Netherlands
    12. 12. Visual Similarity and Text Similarity SimilarDifferent HTML Text 12ECIR 2014 Amsterdam, Netherlands
    13. 13. Correlation between Visual Similarity and Text Similarity • Text Similarity • SimHash • DOM Tree • Embedded resources • Memento Datetime (Capture time) • Visual Similarity • Number of different pixels 13ECIR 2014 Amsterdam, Netherlands
    14. 14. Text Similarity SimHash • Compute 64-bit SimHash fingerprints with k = 4 for two pages, then Calculate the distance using Hamming Distance 14ECIR 2014 Amsterdam, Netherlands 12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05 Distance 12 bits Simhash: 147EDAA9977E9400 Simhash: 157EFAAC97189100
    15. 15. Text Similarity DOM Tree • Transfer each webpage to DOM tree • Calculate the difference using Levenshtein Distance • Levenshtein distance: is the number of operations to insert, update, and delete. 15ECIR 2014 Amsterdam, Netherlands Pawlik, M., & Augsten, N. (2011). RTED: a robust algorithm for the tree edit distance. Proceedings of the VLDB Endowment, 5(4), 334–345.
    16. 16. Text Similarity Embedded resources • Extract the embedded resources from each page • Calculate the total number of new resources that have been added and the resources that have been removed. 16ECIR 2014 Amsterdam, Netherlands Addition Removal Total 4 11 Images 1 9 JS 1 0 CSS 2 2 12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05
    17. 17. Text Similarity Memento datetime • Calculate the difference between the record capture time for both pages in seconds. 17ECIR 2014 Amsterdam, Netherlands 12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05 Difference 70942 sec
    18. 18. Visual Similarity • The number of different pixels between two thumbnails, we resize them into different dimensions (e.g., 64x64 and 128x128). We calculate the Manhattan distance between each pair ECIR 2014 Amsterdam, Netherlands 18 12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05 Distance 0.65
    19. 19. EXPERIMENT Calculate the correlation between Visual Similarity and Text Similarity ECIR 2014 Amsterdam, Netherlands 19
    20. 20. Fortune 500 • 499,540 mementos from 488 TimeMaps. • For each Memento, we download the HTML and capture the thumbnail using PhantomJS. 20 Dataset
    21. 21. Correlation between Visual Similarity and Text Similarity SimHash DOM tree Embedded resources Memento Datetime 21 SimHash [Charikar 2002], DOM tree [Pawlik 2011], Memento Datetime [Van de Sompel 2013] ECIR 2014 Amsterdam, Netherlands
    22. 22. SELECTION ALGORITHMS Using text similarity features to predict the visual similarity. 22ECIR 2014 Amsterdam, Netherlands
    23. 23. #1: Threshold Grouping 23ECIR 2014 Amsterdam, Netherlands
    24. 24. #1: Threshold Grouping 24ECIR 2014 Amsterdam, Netherlands
    25. 25. #2: Clustering technique • Input: • TimeMap with n mementos • A set of features. • For example, F = {SimHash, Memento-Datetime} • Task: • Cluster n mementos in K clusters. 25ECIR 2014 Amsterdam, Netherlands
    26. 26. #2: Clustering technique SimHash Feature SimHash and Datetime Features 26 Park, H.-S., & Jun, C.-H. (2009). A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications, 36(2, Part 2), 3336–3341. ECIR 2014 Amsterdam, Netherlands
    27. 27. #3: Time Normalization 27ECIR 2014 Amsterdam, Netherlands
    28. 28. Selection Algorithms Comparison Threshold Grouping K clustering Time Normalization TimeMap Reduction 27% 9% to 12% 23% Image Loss 28 78 - 101 109 # Features 1 feature 1 or more 1 feature Preprocessing required Yes Yes No Efficient processing Medium Extensive Light Incremental Yes No Yes Online/offline Both Both Both 28ECIR 2014 Amsterdam, Netherlands
    29. 29. Generalization outside the Web Archive • Summarize a website of n pages with only k thumbnails 29ECIR 2014 Amsterdam, Netherlands
    30. 30. Conclusions • We explored the similarity between the text and visual appearance of the web page. • We found that SimHash difference between HTML text and Levenshtein distance between HTML DOM tree have the highest correlation • We presented three algorithms to select k thumbnails from n mementos per TimeMap. 30 aalsum@stanford.edu @aalsum ECIR 2014 Amsterdam, Netherlands

    ×