Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

MS Thesis Defense, Aug 2012 - Visualizing Digital Collections at Archive-It

8,301 views

Published on

Advisor: Dr. Michele C. Weigle.
Committee: Dr. Michael L. Nelson, Dr. Ravi Mukkamala

Slide 32-34 contain embedded video which has been embedded as youtube video in this slideshare
Direct links to videos:
Treemap: http://www.youtube.com/watch?v=BJDrxQEEFYM
Timecloud: http://www.youtube.com/watch?v=YYkI6aBO0to
Bubble Chart, Image Plot and Timeline: http://www.youtube.com/watch?v=j94clxqKQk8

Abstract:
Archive-It, a subscription service from the Internet Archive, allows users to create,
maintain, and view digital collections of web resources. The current interface of
Archive-It is largely text-based, supporting drill-down navigation using lists of URIs.
While this interface provides good searching capabilities, it is not very efficient for
browsing. In the absence of keywords, a user has to spend large amount of time trying
to locate a webpage of interest. In order to provide a better visual experience to
the user, we have studied the underlying characteristics of Archive-It collections and
implemented six different visualizations (treemap, time cloud, bubble chart, image
plot, timeline and wordle), each highlighting one or more of the underlying characteristics
of the collection. Archive-It supports grouping of webpages into categories,
however, it does not enforce its usage. As a result there are many collections with
missing or improper grouping. For such collections, we present a method of grouping
webpages based on a set of pre-defined rules.

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

MS Thesis Defense, Aug 2012 - Visualizing Digital Collections at Archive-It

  1. 1. Visualizing Digital Collections at Archive-It Kalpesh Padia Director: Michele C. Weigle Committee: Michael L. Nelson Ravi Mukkamala7/20/2012 MS Thesis - August 2012 1
  2. 2. Agenda Introduction Motivation Related Work Collection Retrieval and Processing Visualizations Case Studies Future Work Conclusion7/20/2012 MS Thesis - August 2012 2
  3. 3. INTRODUCTION AND MOTIVATION7/20/2012 MS Thesis - August 2012 3
  4. 4. Digital Archiveshttp://www.loc.gov/index.htmlhttp://digitalcollections.library.yale.edu/7/20/2012 MS Thesis - August 2012 4
  5. 5. Archive-It http://archive-it.org/7/20/2012 MS Thesis - August 2012 5
  6. 6. Archive-It Collection Hierarchy Collection Root Title Level 1 Category 1 Category n Level 2 Web page 1 Web page n Level 3 (Leaf Archived Archived Nodes) Version 1 Version n7/20/2012 MS Thesis - August 2012 6
  7. 7. Exploring Archive-It Collections http://archive-it.org/collections/10687/20/2012 MS Thesis - August 2012 7
  8. 8. Exploring Archive-It Collections http://archive-it.org/collections/10687/20/2012 MS Thesis - August 2012 8
  9. 9. Exploring Archive-It Collections http://archive-it.org/collections/10687/20/2012 MS Thesis - August 2012 9
  10. 10. Exploring Archive-It Collections http://wayback.archive-it.org/1068/*/http://acda.co/7/20/2012 MS Thesis - August 2012 10
  11. 11. Exploring Archive-It Collections http://archive-it.org/collections/10687/20/2012 MS Thesis - August 2012 11
  12. 12. Exploring Archive-It Collections http://archive-it.org/collections/28367/20/2012 MS Thesis - August 2012 12
  13. 13. Drawbacks No visual feedback Discovering individual pages is difficult Optional metadata and categorization Collection structure known only to curator7/20/2012 MS Thesis - August 2012 13
  14. 14. Contribution Interactive visualizations  Treemap  Time cloud  Bubble chart  Image plot  Wordle  Timeline Temporal exploration of collections Uncover collection structure7/20/2012 MS Thesis - August 2012 14
  15. 15. RELATED WORK7/20/2012 MS Thesis - August 2012 15
  16. 16. Microsoft Pivot http://www.microsoft.com/silverlight/pivotviewer/7/20/2012 MS Thesis - August 2012 16
  17. 17. Page History Explorer A. Jatowt, Y. Kawai, and K. Tanaka, “Visualizing Historical Content of Web Pages,” in Proceedings of the 17th international conference on World Wide Web,2008.7/20/2012 MS Thesis - August 2012 17
  18. 18. 3D Wall http://www.webarchive.org.uk/ukwa/wall/Blogs7/20/2012 MS Thesis - August 2012 18
  19. 19. Treemap Johnson and Shneiderman, “Space-Filling Approach to the Visualization of Hierarchical Information Structures” in proceedings of the 2nd conference on Visualization 917/20/2012 MS Thesis - August 2012 19
  20. 20. Series Browser M. Whitelaw, “Visualising Archival Collections: The Visible Archive Project,” in Archives and Manuscripts, vol. 37, Issue 2, 2009.7/20/2012 MS Thesis - August 2012 20
  21. 21. A1 Explorer M. Whitelaw, “Visualising Archival Collections: The Visible Archive Project,” in Archives and Manuscripts, vol. 37, Issue 2, 2009.7/20/2012 MS Thesis - August 2012 21
  22. 22. EASY Scharnhorst et.al. “Looking at a digital research data archive Visual interfaces to EASY,” in CORR, 2012, http://arxiv.org/abs/1204.32007/20/2012 MS Thesis - August 2012 22
  23. 23. Wordle . Jonathan Feinberg, http://wordle.net/ , Dogear7/20/2012 MS Thesis - August 2012 23
  24. 24. DATA RETRIEVAL AND PROCESSING7/20/2012 MS Thesis - August 2012 24
  25. 25. 11 Collections, 2K+ Web pages, 70K+ Mementos7/20/2012 MS Thesis - August 2012 25
  26. 26. Data Retrieval & Processing Retrieval:  Screen scrape  Copy collection hierarchy  Store page content Processing:  Calculate TF and TF-IDF  Generate screenshots  Generate wordles  Rule-based categorization  Construct JSON7/20/2012 MS Thesis - August 2012 26
  27. 27. No categorization7/20/2012 MS Thesis - August 2012 27 http://www.archive-it.org/collections/2836
  28. 28. Improper Categorization7/20/2012 MS Thesis - August 2012 28 http://www.archive-it.org/collections/2323
  29. 29. Rule based categorization News Web pages Blogs Social Media Videos7/20/2012 MS Thesis - August 2012 29 http://www.archive-it.org/collections/2836
  30. 30. Special URI and TLD based categorization Pakistani news web pages7/20/2012 MS Thesis - August 2012 30 http://www.archive-it.org/collections/2836
  31. 31. VISUALIZATIONS7/20/2012 MS Thesis - August 2012 31
  32. 32. Treemap7/20/2012 MS Thesis - August 2012 32
  33. 33. Time Cloud7/20/2012 MS Thesis - August 2012 33
  34. 34. Bubble Chart, Image Plot & Timeline7/20/2012 MS Thesis - August 2012 34
  35. 35. CASE STUDIES7/20/2012 MS Thesis - August 2012 35
  36. 36. 1. Collection Building and Growth7/20/2012 MS Thesis - August 2012 36
  37. 37. 2. Re-Categorization (Pakistan Flood: no categorization)7/20/2012 MS Thesis - August 2012 37
  38. 38. 2. Re-Categorization (Pakistan Flood: after categorization)7/20/2012 MS Thesis - August 2012 38
  39. 39. 3. Collection Synopsis7/20/2012 MS Thesis - August 2012 39
  40. 40. 3. Collection Synopsis7/20/2012 MS Thesis - August 2012 40
  41. 41. 3. Collection Synopsis7/20/2012 MS Thesis - August 2012 41
  42. 42. 3. Collection Synopsis7/20/2012 MS Thesis - August 2012 42
  43. 43. 3. Collection Synopsis7/20/2012 MS Thesis - August 2012 43
  44. 44. 4. Theme Tracking7/20/2012 MS Thesis - August 2012 44
  45. 45. 4. Theme Tracking7/20/2012 MS Thesis - August 2012 45
  46. 46. 4. Theme Tracking7/20/2012 MS Thesis - August 2012 46
  47. 47. 4. Theme Tracking7/20/2012 MS Thesis - August 2012 47
  48. 48. Informal User Evaluation Alex Thurman, Columbia University Libraries Feedback on  ease of browsing and obtaining information  user-friendliness of the interface  whether they prefer textual or graphical interface  most effective visualization  effectiveness of the rule-based categorization in exploring archives7/20/2012 MS Thesis - August 2012 48
  49. 49. Feedback Effective visualizations:  Treemap – color coding useful for identifying newer additions  Image plot – screenshots with mouse-over wordles allow for good navigation  Timeline – useful for visualizing development of groups in collection Suggestions  Broader timescale for treemaps  Include stop words from other languages7/20/2012 MS Thesis - August 2012 49
  50. 50. FUTURE WORK AND CONCLUSION7/20/2012 MS Thesis - August 2012 50
  51. 51. Future Work N-Gram wordles Term expansion Krovetz stemmer (dictionary based stemmer) Integration with Archive-It Detailed user evaluation Implementation for other archives7/20/2012 MS Thesis - August 2012 51
  52. 52. Conclusion Identified metrics for collections7/20/2012 MS Thesis - August 2012 52
  53. 53. Conclusion Identified metrics for collections Visualizations  Treemap7/20/2012 MS Thesis - August 2012 53
  54. 54. Conclusion Identified metrics for collections Visualizations  Treemap  Time cloud7/20/2012 MS Thesis - August 2012 54
  55. 55. Conclusion Identified metrics for collections Visualizations  Treemap  Time cloud  Bubble chart7/20/2012 MS Thesis - August 2012 55
  56. 56. Conclusion Identified metrics for collections Visualizations  Treemap  Time cloud  Bubble chart  Image plot7/20/2012 MS Thesis - August 2012 56
  57. 57. Conclusion Identified metrics for collections Visualizations  Treemap  Time cloud  Bubble chart  Image plot  Wordle7/20/2012 MS Thesis - August 2012 57
  58. 58. Conclusion Identified metrics for collections Visualizations  Treemap  Time cloud  Bubble chart  Image plot  Wordle  Timeline7/20/2012 MS Thesis - August 2012 58
  59. 59. Conclusion Identified metrics for collections Visualizations  Treemap  Time cloud  Bubble chart  Image plot  Wordle  Timeline Rule – based categorization7/20/2012 MS Thesis - August 2012 59
  60. 60. BACKUP7/20/2012 MS Thesis - August 2012 60
  61. 61. Time Span Small 1 Day - 2 Weeks Time span Medium 2 Weeks - 4 Months Large > 4 Months http://wayback.archive-it.org/1068/*/http://amigosdemujeres.org/7/20/2012 MS Thesis - August 2012 61
  62. 62. Groups Small 1 Groups Medium 2-5 Large >5 http://www.archive-it.org/collections/10687/20/2012 MS Thesis - August 2012 62
  63. 63. URI Domains Small 1 - 10 URI Domains Medium 11 - 20 Large > 20 http://www.archive-it.org/collections/28367/20/2012 MS Thesis - August 2012 63
  64. 64. Number of Web Pages Small 1 - 10 # of Web Pages Medium 11 - 99 Large > 99 http://www.archive-it.org/collections/28367/20/2012 MS Thesis - August 2012 64
  65. 65. Jigsaw Stasko et.al., IEEE VAST 20077/20/2012 MS Thesis - August 2012 65
  66. 66. Themeriver Wei et.al. in SIGKDD, 20107/20/2012 MS Thesis - August 2012 66
  67. 67. Time Cloud7/20/2012 MS Thesis - August 2012 68
  68. 68. Bubble Chart7/20/2012 MS Thesis - August 2012 69 http://tmix.cs.odu.edu:8080/project/test.php?coll_id=1068
  69. 69. Image Plot with Wordle7/20/2012 MS Thesis - August 2012 70 http://tmix.cs.odu.edu:8080/project/test.php?coll_id=1068
  70. 70. Timeline7/20/2012 MS Thesis - August 2012 71 http://tmix.cs.odu.edu:8080/project/test.php?coll_id=1068

×