Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Visualizing Digital Collections
of Web Archives
Mat Kelly, Michael L. Nelson, Michele C. Weigle
Old Dominion University
We...
Motivation for Thumbnail
Summarization
• Change over time - aboutness
Apple.com has > 17k mementos
Many Nearly Identical
(apple.com)
Methods of Summarization
• Including all mementos
– many redundant thumbnails
– temporally/spatially/cognitively expensive...
Analyzing Markup
<title>Apple</title>
<meta property="analytics-
track" content="Apple - Index/Tab"
/>
<meta property="ana...
SimHash?
HTML snippet for
memento
First k characters of
markup
Second k characters of
markup
64th k characters of
markup
6...
SimHash vs. Other Hashes
• md5(“aaaaaaaaaaaaaaa”)
12f9cf6998d52dbe773b06f848bb3608
• md5(“aaaaaaabaaaaaaa”)
e984cee68697...
Why SimHash?
• SimHash identifies similarities between
documents
• Conventional hashing algorithms are for
identifying dif...
SimHashes for Mementos
HTML of apple.com
March 3, 2008
HTML of apple.com
March 5, 2008
HTML of apple.com
April 12, 2008
HT...
Identifying Similarity by
Calculating Hamming Distance
HTML of apple.com
March 3, 2008
HTML of apple.com
March 5, 2008
HTM...
Sliding Hamming Distance
• Selection based on previously selected memento
• Sliding pivot
ΔM3 ΔM3
ΔM3
ΔM6
ΔM6
ΔM6
ΔM6
ΔM0
...
Project Goals
Develop tools that implement thumbnail
summarization for TimeMaps
• Web Service
– Allows anyone to view Time...
AlSummarization
• SimHash-based summarization scheme
created by Ahmed AlSum
• AlSum + Summarization = AlSummarization
A. A...
Dr. Nelson’s Homepage
• URI-R: http://www.cs.odu.edu/~mln
• Append onto service URI for summary
– http://service/http://ww...
Anatomy of the Visualization
3 presentations of the Summary
Temporally sorted
mementos
Memento metadata
Additional (optional) Endpoint
Parameters
• Access – tailors user interface
– Interactive, Embed, Wayback
• Strategy – to ...
Programmatic Flow
User’s Browser Thumbnails
Service
Memento-Compliant
Archive
User Requests URI-R Summary
User’s Browser Thumbnails
Service
Memento-Compliant
Archive
Service Relays URI-R to Archive
User’s Browser Thumbnails
Service
Memento-Compliant
Archive
Service queries archive for al...
URI-Ms returned to Service
User’s Browser Thumbnails
Service
Memento-Compliant
Archive
Archive returns TimeMap with URI-Ms...
Service fetches HTML for each
Memento
Thumbnails
Service
Service generates SimHash for
Each Mementos’ HTML
Thumbnails
Service
c39f0abc...b9
c39d0abc...c9
c39d0abc...b9
c770ad1b......
Service Calculates
Hamming Distance
Thumbnails
Service
Mementos in summary selected based on
hamming distance
c39f0abc...b...
Preliminary UI returned to user
User’s Browser Thumbnails
Service
Templated HTML interface is returned to
user with placeh...
c39d0abc...c9
c39d0abc...b9
c770ad1b...b9
2
1
0
Service Generates Thumbnails for
Mementos in Summary
Thumbnails
Service
c3...
Thumbnails Served to User
User’s Browser Thumbnails
Service
Asynchronous polling from HTML pages
populates placeholder ima...
Core Implementation
•
• for thumbnail generation
• abstractions preserved for code reuse
and extensibility
• Code document...
Initializing the service
$ npm install
$ node alSummarization.js
* Local resource (css, js,etc.) server
listening on Port ...
Online vs. Offline Generation
• Online Thumbnail Summarization
– Fetch each mementos’ HTML
– Calculate SimHashes
– Calcula...
Adaptive Strategies
• Very large TimeMaps are temporally
expensive to generate
• Default behavior:
if(timeRequirement == t...
Other Summarization Strategies
• Random Selection
– k mementos, uniform selection
• Interval
– every mth memento, m = n/k
...
Grid View
AlSummarization vs Random
Dr. Nelson’s Homepage
Random Strategy
Dr. Nelson’s Homepage
AlSummarization Strategy
Grid View
AlSummarization vs Interval
Dr. Nelson’s Homepage
Interval Strategy
Dr. Nelson’s Homepage
AlSummarization Strate...
Grid View
AlSummarization vs Temporal Interval
Dr. Nelson’s Homepage
Temporal Interval Strategy
Dr. Nelson’s Homepage
AlSu...
Asynchronous Polling
Server-side SimHash Caching
Four Summarization Strategies
OpenWayback Integration
Service Embedding
• <object data=http://service/http://yoururl.com”
type=“text/html”>
</object>
-or-
• <iframe src=“http:/...
Visualizing Digital Collections
of Web Archives
• Codebase:
– github.com/machawk1/ArchiveThumbnails
• Service URI:
– http:...
Live
Demo
Visualizing Digital Collections of Web Archives from Columbia Web Archiving Collaboration Conference
Upcoming SlideShare
Loading in …5
×

Visualizing Digital Collections of Web Archives from Columbia Web Archiving Collaboration Conference

1,648 views

Published on

Published in: Technology
  • DOWNLOAD THAT BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download Full EPUB Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download Full doc Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download PDF EBOOK here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download EPUB Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download doc Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book that can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer that is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story That Helped Ignite a Movement,-- Atomic Habits: An Easy &amp; Proven Way to Build Good Habits &amp; Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money That the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths that Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Copas Url to Read PDF Format === http://ebookdfsrewsa.justdied.com/8416268231-memento-practico-concursal-abogacia-del-estado-2015-mementos-practicos.html
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Visualizing Digital Collections of Web Archives from Columbia Web Archiving Collaboration Conference

  1. 1. Visualizing Digital Collections of Web Archives Mat Kelly, Michael L. Nelson, Michele C. Weigle Old Dominion University Web Archiving Collaboration: New Tools and Models Columbia University, New York, NY June 4, 2015 @machawk1http://ws-dl.cs.odu.edu
  2. 2. Motivation for Thumbnail Summarization • Change over time - aboutness
  3. 3. Apple.com has > 17k mementos
  4. 4. Many Nearly Identical (apple.com)
  5. 5. Methods of Summarization • Including all mementos – many redundant thumbnails – temporally/spatially/cognitively expensive • Naively excluding images – missing important captures in summary • Compare image thumbnails – temporally expensive for identifying unique thumbnails Comparing mementos’ markup can identify sufficiently unique mementos
  6. 6. Analyzing Markup <title>Apple</title> <meta property="analytics- track" content="Apple - Index/Tab" /> <meta property="analytics-s- channel" content="homepage" /> <meta property="analytics-s- bucket-0" content="appleglobal,applehome" /> <meta property="analytics-s- bucket-1" content="apple{COUNTRY_CODE}gl obal,apple{COUNTRY_CODE}home" /> 8664ee964799c38c156d8f0 39dae8330 apple.com at Mar 17, 2008 HTML for memento SimHash for HTML
  7. 7. SimHash? HTML snippet for memento First k characters of markup Second k characters of markup 64th k characters of markup 63rd k characters of markup markup length 64 k = … Hash to a character Hash to a character Hash to a character Hash to a character c 3 9 f …
  8. 8. SimHash vs. Other Hashes • md5(“aaaaaaaaaaaaaaa”) 12f9cf6998d52dbe773b06f848bb3608 • md5(“aaaaaaabaaaaaaa”) e984cee68697eb77577717b532171493 • simhash(“aaaaaaaaaaaaaaa”) 8664ee964799c38c156d8f039dae8330 • simhash(“aaaaaaabaaaaaaa”) 8664ee964799a48c156d8f039dae8330
  9. 9. Why SimHash? • SimHash identifies similarities between documents • Conventional hashing algorithms are for identifying differences – Drastically different output from similar content • To remove redundancies, we want to detect when temporally adjacent mementos are sufficiently dissimilar
  10. 10. SimHashes for Mementos HTML of apple.com March 3, 2008 HTML of apple.com March 5, 2008 HTML of apple.com April 12, 2008 HTML of apple.com October 4, 2008 c39f0abc...b9 c39d0abc...c9 c39d0abc...b9 c770ad1b...b9
  11. 11. Identifying Similarity by Calculating Hamming Distance HTML of apple.com March 3, 2008 HTML of apple.com March 5, 2008 HTML of apple.com April 12, 2008 HTML of apple.com October 4, 2008 c39f0abc...b9 c39d0abc...c9 c39d0abc...b9 c770ad1b...b9 HAMMING DISTANCE 2 1 7 N/A pivot
  12. 12. Sliding Hamming Distance • Selection based on previously selected memento • Sliding pivot ΔM3 ΔM3 ΔM3 ΔM6 ΔM6 ΔM6 ΔM6 ΔM0 ΔM0 ΔM0
  13. 13. Project Goals Develop tools that implement thumbnail summarization for TimeMaps • Web Service – Allows anyone to view TimeMap using thumbnail summarization • Wayback add-on – Allows any archive using wayback to provide this service to users • Embeddable version – Allow web page authors to embed overview of past versions of page on live web page
  14. 14. AlSummarization • SimHash-based summarization scheme created by Ahmed AlSum • AlSum + Summarization = AlSummarization A. AlSum, and M. L. Nelson. “Thumbnail Summarization Techniques for Web Archives.” In Proceedings of the 36TH European Conference on Information Retrieval, ECIR 2014, 2014.
  15. 15. Dr. Nelson’s Homepage • URI-R: http://www.cs.odu.edu/~mln • Append onto service URI for summary – http://service/http://www.cs.odu.edu/~mln
  16. 16. Anatomy of the Visualization 3 presentations of the Summary Temporally sorted mementos Memento metadata
  17. 17. Additional (optional) Endpoint Parameters • Access – tailors user interface – Interactive, Embed, Wayback • Strategy – to use alternative summarization – alSummarization, yearly, skipListed, random • http://service/? o access=wayback&URI- R=http://www.cs.odu.edu/~mln o access=wayback&strategy=random&URI- R=http://www.cs.odu.edu/~mln
  18. 18. Programmatic Flow User’s Browser Thumbnails Service Memento-Compliant Archive
  19. 19. User Requests URI-R Summary User’s Browser Thumbnails Service Memento-Compliant Archive
  20. 20. Service Relays URI-R to Archive User’s Browser Thumbnails Service Memento-Compliant Archive Service queries archive for all mementos for URI-R
  21. 21. URI-Ms returned to Service User’s Browser Thumbnails Service Memento-Compliant Archive Archive returns TimeMap with URI-Ms to thumbnail service TM
  22. 22. Service fetches HTML for each Memento Thumbnails Service
  23. 23. Service generates SimHash for Each Mementos’ HTML Thumbnails Service c39f0abc...b9 c39d0abc...c9 c39d0abc...b9 c770ad1b...b9 c770ad1b...b9
  24. 24. Service Calculates Hamming Distance Thumbnails Service Mementos in summary selected based on hamming distance c39f0abc...b9 c39d0abc...c9 c39d0abc...b9 c770ad1b...b9 c770ad1b...b9 Hd() 2 1 7 0
  25. 25. Preliminary UI returned to user User’s Browser Thumbnails Service Templated HTML interface is returned to user with placeholders for thumbnails HTML interface
  26. 26. c39d0abc...c9 c39d0abc...b9 c770ad1b...b9 2 1 0 Service Generates Thumbnails for Mementos in Summary Thumbnails Service c39f0abc...b9 c770ad1b...b9 Hd() 7
  27. 27. Thumbnails Served to User User’s Browser Thumbnails Service Asynchronous polling from HTML pages populates placeholder images once available HTML interface
  28. 28. Core Implementation • • for thumbnail generation • abstractions preserved for code reuse and extensibility • Code documented to facilitate extensibility, usage, and fixes http://github.com/machawk1/ArchiveThumbnails
  29. 29. Initializing the service $ npm install $ node alSummarization.js * Local resource (css, js,etc.) server listening on Port 1338... * Thumbnails service started on Port 15421 > Try localhost:15421/?URI- R=http://matkelly.com in your web browser for sample execution. User/Service Administrator simply enters: Service responds and is ready for query:
  30. 30. Online vs. Offline Generation • Online Thumbnail Summarization – Fetch each mementos’ HTML – Calculate SimHashes – Calculate Hamming Distance (HD) – Select Mementos That Pass HD threshold – Generate Thumbnails of Mementos • Offline Thumbnail Summarization – All of the above performed a priori – Data potentially updated on access
  31. 31. Adaptive Strategies • Very large TimeMaps are temporally expensive to generate • Default behavior: if(timeRequirement == tooLong): use(naiveStrategy) • User can explicitly override behavior
  32. 32. Other Summarization Strategies • Random Selection – k mementos, uniform selection • Interval – every mth memento, m = n/k • Temporal Interval – One memento/year, reverse chronological monthly back-fill • Temporally Uniform Trimming when k > 15
  33. 33. Grid View AlSummarization vs Random Dr. Nelson’s Homepage Random Strategy Dr. Nelson’s Homepage AlSummarization Strategy
  34. 34. Grid View AlSummarization vs Interval Dr. Nelson’s Homepage Interval Strategy Dr. Nelson’s Homepage AlSummarization Strategy
  35. 35. Grid View AlSummarization vs Temporal Interval Dr. Nelson’s Homepage Temporal Interval Strategy Dr. Nelson’s Homepage AlSummarization Strategy
  36. 36. Asynchronous Polling
  37. 37. Server-side SimHash Caching
  38. 38. Four Summarization Strategies
  39. 39. OpenWayback Integration
  40. 40. Service Embedding • <object data=http://service/http://yoururl.com” type=“text/html”> </object> -or- • <iframe src=“http://service/http://yoururl.com”> </iframe>
  41. 41. Visualizing Digital Collections of Web Archives • Codebase: – github.com/machawk1/ArchiveThumbnails • Service URI: – http://wsdl-docker.cs.odu.edu:15421
  42. 42. Live Demo

×