• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
MS Thesis Defense, Aug 2012 - Visualizing Digital Collections at Archive-It
 

MS Thesis Defense, Aug 2012 - Visualizing Digital Collections at Archive-It

on

  • 3,617 views

Advisor: Dr. Michele C. Weigle. ...

Advisor: Dr. Michele C. Weigle.
Committee: Dr. Michael L. Nelson, Dr. Ravi Mukkamala

Slide 32-34 contain embedded video which has been embedded as youtube video in this slideshare
Direct links to videos:
Treemap: http://www.youtube.com/watch?v=BJDrxQEEFYM
Timecloud: http://www.youtube.com/watch?v=YYkI6aBO0to
Bubble Chart, Image Plot and Timeline: http://www.youtube.com/watch?v=j94clxqKQk8

Abstract:
Archive-It, a subscription service from the Internet Archive, allows users to create,
maintain, and view digital collections of web resources. The current interface of
Archive-It is largely text-based, supporting drill-down navigation using lists of URIs.
While this interface provides good searching capabilities, it is not very efficient for
browsing. In the absence of keywords, a user has to spend large amount of time trying
to locate a webpage of interest. In order to provide a better visual experience to
the user, we have studied the underlying characteristics of Archive-It collections and
implemented six different visualizations (treemap, time cloud, bubble chart, image
plot, timeline and wordle), each highlighting one or more of the underlying characteristics
of the collection. Archive-It supports grouping of webpages into categories,
however, it does not enforce its usage. As a result there are many collections with
missing or improper grouping. For such collections, we present a method of grouping
webpages based on a set of pre-defined rules.

Statistics

Views

Total Views
3,617
Views on SlideShare
601
Embed Views
3,016

Actions

Likes
0
Downloads
8
Comments
0

46 Embeds 3,016

http://kalpeshpadia.com 960
http://ws-dl.blogspot.com 896
http://kalpeshpadia.wordpress.com 668
http://www4.ncsu.edu 154
http://ws-dl.blogspot.in 44
http://ws-dl.blogspot.co.uk 39
http://ws-dl.blogspot.ca 34
http://ws-dl.blogspot.de 21
http://ws-dl.blogspot.fr 21
http://www.cs.odu.edu 20
http://ws-dl.blogspot.kr 14
http://ws-dl.blogspot.nl 13
http://ws-dl.blogspot.se 10
http://ws-dl.blogspot.com.au 10
http://ws-dl.blogspot.pt 9
http://webspace.cs.odu.edu 9
http://ws-dl.blogspot.it 8
http://ws-dl.blogspot.fi 5
http://ws-dl.blogspot.ie 5
http://ws-dl.blogspot.ro 4
http://ws-dl.blogspot.jp 4
http://ws-dl.blogspot.be 4
http://ws-dl.blogspot.com.es 4
http://ws-dl.blogspot.co.at 4
http://webcache.googleusercontent.com 4
http://ws-dl.blogspot.ru 4
http://ws-dl.blogspot.gr 4
http://feeds.feedburner.com 4
http://ws-dl.blogspot.com.br 4
http://www.yatedo.com 3
http://ws-dl.blogspot.ch 3
http://ws-dl.blogspot.cz 3
http://ws-dl.blogspot.sg 3
http://ws-dl.blogspot.dk 3
http://ws-dl.blogspot.ae 2
http://translate.googleusercontent.com 2
http://ws-dl.blogspot.co.nz 2
http://ws-dl.blogspot.mx 2
http://23.23.205.92 2
http://ws-dl.blogspot.co.il 2
http://ws-dl.blogspot.tw 2
http://cs.odu.edu 2
https://www.linkedin.com 2
http://localhost 1
http://justinfbrunelle.bo.lt 1
http://ws-dl.blogspot.no 1
More...

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Collections used for developmentPresent a good mix of various metrics
  • Many Archive-It collections are not curated wellLack categorizationImproper CategorizationSuggest categorization forOrganizing collectionsProperly categorizing articles with existing categorization
  • If the domain is news web sites,, such as cnn,abc, bbc, put them into news web site
  • Domains and subdomains
  • Number of articles. Or sites
  • Extend stacked bar charts to represent independent values as imagesAll values in a sample given equal weightThe height of each stack represents the size of each sampleCategories are samples, articles are valuesHover over each article to reveal number of mementos, timespan and wordle summarizing articles’ content

MS Thesis Defense, Aug 2012 - Visualizing Digital Collections at Archive-It MS Thesis Defense, Aug 2012 - Visualizing Digital Collections at Archive-It Presentation Transcript

  • Visualizing Digital Collections at Archive-It Kalpesh Padia Director: Michele C. Weigle Committee: Michael L. Nelson Ravi Mukkamala7/20/2012 MS Thesis - August 2012 1
  • Agenda Introduction Motivation Related Work Collection Retrieval and Processing Visualizations Case Studies Future Work Conclusion7/20/2012 MS Thesis - August 2012 2
  • INTRODUCTION AND MOTIVATION7/20/2012 MS Thesis - August 2012 3
  • Digital Archiveshttp://www.loc.gov/index.htmlhttp://digitalcollections.library.yale.edu/7/20/2012 MS Thesis - August 2012 4
  • Archive-It http://archive-it.org/7/20/2012 MS Thesis - August 2012 5
  • Archive-It Collection Hierarchy Collection Root Title Level 1 Category 1 Category n Level 2 Web page 1 Web page n Level 3 (Leaf Archived Archived Nodes) Version 1 Version n7/20/2012 MS Thesis - August 2012 6
  • Exploring Archive-It Collections http://archive-it.org/collections/10687/20/2012 MS Thesis - August 2012 7
  • Exploring Archive-It Collections http://archive-it.org/collections/10687/20/2012 MS Thesis - August 2012 8
  • Exploring Archive-It Collections http://archive-it.org/collections/10687/20/2012 MS Thesis - August 2012 9
  • Exploring Archive-It Collections http://wayback.archive-it.org/1068/*/http://acda.co/7/20/2012 MS Thesis - August 2012 10
  • Exploring Archive-It Collections http://archive-it.org/collections/10687/20/2012 MS Thesis - August 2012 11
  • Exploring Archive-It Collections http://archive-it.org/collections/28367/20/2012 MS Thesis - August 2012 12
  • Drawbacks No visual feedback Discovering individual pages is difficult Optional metadata and categorization Collection structure known only to curator7/20/2012 MS Thesis - August 2012 13
  • Contribution Interactive visualizations  Treemap  Time cloud  Bubble chart  Image plot  Wordle  Timeline Temporal exploration of collections Uncover collection structure7/20/2012 MS Thesis - August 2012 14
  • RELATED WORK7/20/2012 MS Thesis - August 2012 15
  • Microsoft Pivot http://www.microsoft.com/silverlight/pivotviewer/7/20/2012 MS Thesis - August 2012 16
  • Page History Explorer A. Jatowt, Y. Kawai, and K. Tanaka, “Visualizing Historical Content of Web Pages,” in Proceedings of the 17th international conference on World Wide Web,2008.7/20/2012 MS Thesis - August 2012 17
  • 3D Wall http://www.webarchive.org.uk/ukwa/wall/Blogs7/20/2012 MS Thesis - August 2012 18
  • Treemap Johnson and Shneiderman, “Space-Filling Approach to the Visualization of Hierarchical Information Structures” in proceedings of the 2nd conference on Visualization 917/20/2012 MS Thesis - August 2012 19
  • Series Browser M. Whitelaw, “Visualising Archival Collections: The Visible Archive Project,” in Archives and Manuscripts, vol. 37, Issue 2, 2009.7/20/2012 MS Thesis - August 2012 20
  • A1 Explorer M. Whitelaw, “Visualising Archival Collections: The Visible Archive Project,” in Archives and Manuscripts, vol. 37, Issue 2, 2009.7/20/2012 MS Thesis - August 2012 21
  • EASY Scharnhorst et.al. “Looking at a digital research data archive Visual interfaces to EASY,” in CORR, 2012, http://arxiv.org/abs/1204.32007/20/2012 MS Thesis - August 2012 22
  • Wordle . Jonathan Feinberg, http://wordle.net/ , Dogear7/20/2012 MS Thesis - August 2012 23
  • DATA RETRIEVAL AND PROCESSING7/20/2012 MS Thesis - August 2012 24
  • 11 Collections, 2K+ Web pages, 70K+ Mementos7/20/2012 MS Thesis - August 2012 25
  • Data Retrieval & Processing Retrieval:  Screen scrape  Copy collection hierarchy  Store page content Processing:  Calculate TF and TF-IDF  Generate screenshots  Generate wordles  Rule-based categorization  Construct JSON7/20/2012 MS Thesis - August 2012 26
  • No categorization7/20/2012 MS Thesis - August 2012 27 http://www.archive-it.org/collections/2836
  • Improper Categorization7/20/2012 MS Thesis - August 2012 28 http://www.archive-it.org/collections/2323
  • Rule based categorization News Web pages Blogs Social Media Videos7/20/2012 MS Thesis - August 2012 29 http://www.archive-it.org/collections/2836
  • Special URI and TLD based categorization Pakistani news web pages7/20/2012 MS Thesis - August 2012 30 http://www.archive-it.org/collections/2836
  • VISUALIZATIONS7/20/2012 MS Thesis - August 2012 31
  • Treemap7/20/2012 MS Thesis - August 2012 32
  • Time Cloud7/20/2012 MS Thesis - August 2012 33
  • Bubble Chart, Image Plot & Timeline7/20/2012 MS Thesis - August 2012 34
  • CASE STUDIES7/20/2012 MS Thesis - August 2012 35
  • 1. Collection Building and Growth7/20/2012 MS Thesis - August 2012 36
  • 2. Re-Categorization (Pakistan Flood: no categorization)7/20/2012 MS Thesis - August 2012 37
  • 2. Re-Categorization (Pakistan Flood: after categorization)7/20/2012 MS Thesis - August 2012 38
  • 3. Collection Synopsis7/20/2012 MS Thesis - August 2012 39
  • 3. Collection Synopsis7/20/2012 MS Thesis - August 2012 40
  • 3. Collection Synopsis7/20/2012 MS Thesis - August 2012 41
  • 3. Collection Synopsis7/20/2012 MS Thesis - August 2012 42
  • 3. Collection Synopsis7/20/2012 MS Thesis - August 2012 43
  • 4. Theme Tracking7/20/2012 MS Thesis - August 2012 44
  • 4. Theme Tracking7/20/2012 MS Thesis - August 2012 45
  • 4. Theme Tracking7/20/2012 MS Thesis - August 2012 46
  • 4. Theme Tracking7/20/2012 MS Thesis - August 2012 47
  • Informal User Evaluation Alex Thurman, Columbia University Libraries Feedback on  ease of browsing and obtaining information  user-friendliness of the interface  whether they prefer textual or graphical interface  most effective visualization  effectiveness of the rule-based categorization in exploring archives7/20/2012 MS Thesis - August 2012 48
  • Feedback Effective visualizations:  Treemap – color coding useful for identifying newer additions  Image plot – screenshots with mouse-over wordles allow for good navigation  Timeline – useful for visualizing development of groups in collection Suggestions  Broader timescale for treemaps  Include stop words from other languages7/20/2012 MS Thesis - August 2012 49
  • FUTURE WORK AND CONCLUSION7/20/2012 MS Thesis - August 2012 50
  • Future Work N-Gram wordles Term expansion Krovetz stemmer (dictionary based stemmer) Integration with Archive-It Detailed user evaluation Implementation for other archives7/20/2012 MS Thesis - August 2012 51
  • Conclusion Identified metrics for collections7/20/2012 MS Thesis - August 2012 52
  • Conclusion Identified metrics for collections Visualizations  Treemap7/20/2012 MS Thesis - August 2012 53
  • Conclusion Identified metrics for collections Visualizations  Treemap  Time cloud7/20/2012 MS Thesis - August 2012 54
  • Conclusion Identified metrics for collections Visualizations  Treemap  Time cloud  Bubble chart7/20/2012 MS Thesis - August 2012 55
  • Conclusion Identified metrics for collections Visualizations  Treemap  Time cloud  Bubble chart  Image plot7/20/2012 MS Thesis - August 2012 56
  • Conclusion Identified metrics for collections Visualizations  Treemap  Time cloud  Bubble chart  Image plot  Wordle7/20/2012 MS Thesis - August 2012 57
  • Conclusion Identified metrics for collections Visualizations  Treemap  Time cloud  Bubble chart  Image plot  Wordle  Timeline7/20/2012 MS Thesis - August 2012 58
  • Conclusion Identified metrics for collections Visualizations  Treemap  Time cloud  Bubble chart  Image plot  Wordle  Timeline Rule – based categorization7/20/2012 MS Thesis - August 2012 59
  • BACKUP7/20/2012 MS Thesis - August 2012 60
  • Time Span Small 1 Day - 2 Weeks Time span Medium 2 Weeks - 4 Months Large > 4 Months http://wayback.archive-it.org/1068/*/http://amigosdemujeres.org/7/20/2012 MS Thesis - August 2012 61
  • Groups Small 1 Groups Medium 2-5 Large >5 http://www.archive-it.org/collections/10687/20/2012 MS Thesis - August 2012 62
  • URI Domains Small 1 - 10 URI Domains Medium 11 - 20 Large > 20 http://www.archive-it.org/collections/28367/20/2012 MS Thesis - August 2012 63
  • Number of Web Pages Small 1 - 10 # of Web Pages Medium 11 - 99 Large > 99 http://www.archive-it.org/collections/28367/20/2012 MS Thesis - August 2012 64
  • Jigsaw Stasko et.al., IEEE VAST 20077/20/2012 MS Thesis - August 2012 65
  • Themeriver Wei et.al. in SIGKDD, 20107/20/2012 MS Thesis - August 2012 66
  • Time Cloud7/20/2012 MS Thesis - August 2012 68
  • Bubble Chart7/20/2012 MS Thesis - August 2012 69 http://tmix.cs.odu.edu:8080/project/test.php?coll_id=1068
  • Image Plot with Wordle7/20/2012 MS Thesis - August 2012 70 http://tmix.cs.odu.edu:8080/project/test.php?coll_id=1068
  • Timeline7/20/2012 MS Thesis - August 2012 71 http://tmix.cs.odu.edu:8080/project/test.php?coll_id=1068