Visualizing Digital Collections at               Archive-It                    Kalpesh Padia            Director:      Mic...
Agenda    Introduction    Motivation    Related Work    Collection Retrieval and Processing    Visualizations    Case Stud...
INTRODUCTION AND   MOTIVATION7/20/2012   MS Thesis - August 2012   3
Digital Archiveshttp://www.loc.gov/index.htmlhttp://digitalcollections.library.yale.edu/7/20/2012                         ...
Archive-It            http://archive-it.org/7/20/2012    MS Thesis - August 2012   5
Archive-It Collection Hierarchy                                                      Collection                Root       ...
Exploring Archive-It Collections            http://archive-it.org/collections/10687/20/2012             MS Thesis - August...
Exploring Archive-It Collections            http://archive-it.org/collections/10687/20/2012             MS Thesis - August...
Exploring Archive-It Collections            http://archive-it.org/collections/10687/20/2012             MS Thesis - August...
Exploring Archive-It Collections            http://wayback.archive-it.org/1068/*/http://acda.co/7/20/2012                 ...
Exploring Archive-It Collections            http://archive-it.org/collections/10687/20/2012             MS Thesis - August...
Exploring Archive-It Collections            http://archive-it.org/collections/28367/20/2012            MS Thesis - August ...
Drawbacks    No visual feedback    Discovering individual pages is difficult    Optional metadata and categorization    Co...
Contribution    Interactive visualizations       Treemap       Time cloud       Bubble chart       Image plot       W...
RELATED WORK7/20/2012   MS Thesis - August 2012   15
Microsoft Pivot            http://www.microsoft.com/silverlight/pivotviewer/7/20/2012                    MS Thesis - Augus...
Page History Explorer                                            A. Jatowt, Y. Kawai, and K.                              ...
3D Wall            http://www.webarchive.org.uk/ukwa/wall/Blogs7/20/2012               MS Thesis - August 2012            18
Treemap            Johnson and Shneiderman, “Space-Filling Approach to the Visualization of Hierarchical Information Struc...
Series Browser            M. Whitelaw, “Visualising Archival Collections: The Visible Archive            Project,” in Arch...
A1 Explorer            M. Whitelaw, “Visualising Archival Collections: The Visible Archive            Project,” in Archive...
EASY                     Scharnhorst et.al. “Looking at a digital research data archive Visual                        inte...
Wordle              . Jonathan Feinberg, http://wordle.net/ , Dogear7/20/2012   MS Thesis - August 2012                   ...
DATA RETRIEVAL AND   PROCESSING7/20/2012   MS Thesis - August 2012   24
11 Collections, 2K+ Web pages, 70K+ Mementos7/20/2012         MS Thesis - August 2012    25
Data Retrieval & Processing    Retrieval:       Screen scrape       Copy collection hierarchy       Store page content ...
No categorization7/20/2012       MS Thesis - August 2012                                     27                           ...
Improper Categorization7/20/2012          MS Thesis - August 2012                                     28                  ...
Rule based categorization                                                     News Web pages                              ...
Special URI and TLD based categorization                                                 Pakistani news web               ...
VISUALIZATIONS7/20/2012   MS Thesis - August 2012   31
Treemap7/20/2012   MS Thesis - August 2012   32
Bubble Chart, Image Plot & Timeline7/20/2012        MS Thesis - August 2012   34
1. Collection Building and Growth7/20/2012    MS Thesis - August 2012   36
2. Re-Categorization     (Pakistan Flood: after categorization)7/20/2012         MS Thesis - August 2012   38
3. Collection Synopsis7/20/2012         MS Thesis - August 2012   39
3. Collection Synopsis7/20/2012         MS Thesis - August 2012   40
3. Collection Synopsis7/20/2012         MS Thesis - August 2012   41
3. Collection Synopsis7/20/2012         MS Thesis - August 2012   42
3. Collection Synopsis7/20/2012         MS Thesis - August 2012   43
4. Theme Tracking7/20/2012       MS Thesis - August 2012   44
4. Theme Tracking7/20/2012       MS Thesis - August 2012   45
4. Theme Tracking7/20/2012       MS Thesis - August 2012   46
4. Theme Tracking7/20/2012       MS Thesis - August 2012   47
Informal User Evaluation    Alex Thurman, Columbia University Libraries    Feedback on       ease of browsing and obtaini...
Feedback    Effective visualizations:       Treemap – color coding useful for identifying newer        additions       I...
FUTURE WORK AND   CONCLUSION7/20/2012   MS Thesis - August 2012   50
Future Work    N-Gram wordles    Term expansion    Krovetz stemmer (dictionary based stemmer)    Integration with Archive-...
Conclusion    Identified metrics for collections7/20/2012          MS Thesis - August 2012   52
Conclusion    Identified metrics for collections    Visualizations       Treemap7/20/2012            MS Thesis - August 2...
Conclusion    Identified metrics for collections    Visualizations       Treemap       Time cloud7/20/2012             M...
Conclusion    Identified metrics for collections    Visualizations       Treemap       Time cloud       Bubble chart7/2...
Conclusion    Identified metrics for collections    Visualizations       Treemap       Time cloud       Bubble chart   ...
Conclusion    Identified metrics for collections    Visualizations       Treemap       Time cloud       Bubble chart   ...
Conclusion    Identified metrics for collections    Visualizations       Treemap       Time cloud       Bubble chart   ...
Conclusion    Identified metrics for collections    Visualizations       Treemap       Time cloud       Bubble chart   ...
BACKUP7/20/2012   MS Thesis - August 2012   60
Time Span                                                                 Small           1 Day - 2 Weeks                 ...
Groups                                                          Small    1                                          Groups...
URI Domains                                                  Small             1 - 10                                 URI ...
Number of Web Pages                                                     Small             1 - 10                          ...
Jigsaw                                      Stasko et.al., IEEE VAST 20077/20/2012   MS Thesis - August 2012              ...
Themeriver                                       Wei et.al. in SIGKDD, 20107/20/2012    MS Thesis - August 2012           ...
Time Cloud7/20/2012    MS Thesis - August 2012   68
Bubble Chart7/20/2012     MS Thesis - August 2012                                           69                            ...
Image Plot with Wordle7/20/2012         MS Thesis - August 2012                                           70              ...
Timeline7/20/2012   MS Thesis - August 2012                                           71                           http://...
MS Thesis Defense, Aug 2012 - Visualizing Digital Collections at Archive-It
MS Thesis Defense, Aug 2012 - Visualizing Digital Collections at Archive-It
MS Thesis Defense, Aug 2012 - Visualizing Digital Collections at Archive-It
Upcoming SlideShare
Loading in...5
×

MS Thesis Defense, Aug 2012 - Visualizing Digital Collections at Archive-It

4,561

Published on

Advisor: Dr. Michele C. Weigle.
Committee: Dr. Michael L. Nelson, Dr. Ravi Mukkamala

Slide 32-34 contain embedded video which has been embedded as youtube video in this slideshare
Direct links to videos:
Treemap: http://www.youtube.com/watch?v=BJDrxQEEFYM
Timecloud: http://www.youtube.com/watch?v=YYkI6aBO0to
Bubble Chart, Image Plot and Timeline: http://www.youtube.com/watch?v=j94clxqKQk8

Abstract:
Archive-It, a subscription service from the Internet Archive, allows users to create,
maintain, and view digital collections of web resources. The current interface of
Archive-It is largely text-based, supporting drill-down navigation using lists of URIs.
While this interface provides good searching capabilities, it is not very efficient for
browsing. In the absence of keywords, a user has to spend large amount of time trying
to locate a webpage of interest. In order to provide a better visual experience to
the user, we have studied the underlying characteristics of Archive-It collections and
implemented six different visualizations (treemap, time cloud, bubble chart, image
plot, timeline and wordle), each highlighting one or more of the underlying characteristics
of the collection. Archive-It supports grouping of webpages into categories,
however, it does not enforce its usage. As a result there are many collections with
missing or improper grouping. For such collections, we present a method of grouping
webpages based on a set of pre-defined rules.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
4,561
On Slideshare
0
From Embeds
0
Number of Embeds
39
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Collections used for developmentPresent a good mix of various metrics
  • Many Archive-It collections are not curated wellLack categorizationImproper CategorizationSuggest categorization forOrganizing collectionsProperly categorizing articles with existing categorization
  • If the domain is news web sites,, such as cnn,abc, bbc, put them into news web site
  • Domains and subdomains
  • Number of articles. Or sites
  • Extend stacked bar charts to represent independent values as imagesAll values in a sample given equal weightThe height of each stack represents the size of each sampleCategories are samples, articles are valuesHover over each article to reveal number of mementos, timespan and wordle summarizing articles’ content
  • MS Thesis Defense, Aug 2012 - Visualizing Digital Collections at Archive-It

    1. 1. Visualizing Digital Collections at Archive-It Kalpesh Padia Director: Michele C. Weigle Committee: Michael L. Nelson Ravi Mukkamala7/20/2012 MS Thesis - August 2012 1
    2. 2. Agenda Introduction Motivation Related Work Collection Retrieval and Processing Visualizations Case Studies Future Work Conclusion7/20/2012 MS Thesis - August 2012 2
    3. 3. INTRODUCTION AND MOTIVATION7/20/2012 MS Thesis - August 2012 3
    4. 4. Digital Archiveshttp://www.loc.gov/index.htmlhttp://digitalcollections.library.yale.edu/7/20/2012 MS Thesis - August 2012 4
    5. 5. Archive-It http://archive-it.org/7/20/2012 MS Thesis - August 2012 5
    6. 6. Archive-It Collection Hierarchy Collection Root Title Level 1 Category 1 Category n Level 2 Web page 1 Web page n Level 3 (Leaf Archived Archived Nodes) Version 1 Version n7/20/2012 MS Thesis - August 2012 6
    7. 7. Exploring Archive-It Collections http://archive-it.org/collections/10687/20/2012 MS Thesis - August 2012 7
    8. 8. Exploring Archive-It Collections http://archive-it.org/collections/10687/20/2012 MS Thesis - August 2012 8
    9. 9. Exploring Archive-It Collections http://archive-it.org/collections/10687/20/2012 MS Thesis - August 2012 9
    10. 10. Exploring Archive-It Collections http://wayback.archive-it.org/1068/*/http://acda.co/7/20/2012 MS Thesis - August 2012 10
    11. 11. Exploring Archive-It Collections http://archive-it.org/collections/10687/20/2012 MS Thesis - August 2012 11
    12. 12. Exploring Archive-It Collections http://archive-it.org/collections/28367/20/2012 MS Thesis - August 2012 12
    13. 13. Drawbacks No visual feedback Discovering individual pages is difficult Optional metadata and categorization Collection structure known only to curator7/20/2012 MS Thesis - August 2012 13
    14. 14. Contribution Interactive visualizations  Treemap  Time cloud  Bubble chart  Image plot  Wordle  Timeline Temporal exploration of collections Uncover collection structure7/20/2012 MS Thesis - August 2012 14
    15. 15. RELATED WORK7/20/2012 MS Thesis - August 2012 15
    16. 16. Microsoft Pivot http://www.microsoft.com/silverlight/pivotviewer/7/20/2012 MS Thesis - August 2012 16
    17. 17. Page History Explorer A. Jatowt, Y. Kawai, and K. Tanaka, “Visualizing Historical Content of Web Pages,” in Proceedings of the 17th international conference on World Wide Web,2008.7/20/2012 MS Thesis - August 2012 17
    18. 18. 3D Wall http://www.webarchive.org.uk/ukwa/wall/Blogs7/20/2012 MS Thesis - August 2012 18
    19. 19. Treemap Johnson and Shneiderman, “Space-Filling Approach to the Visualization of Hierarchical Information Structures” in proceedings of the 2nd conference on Visualization 917/20/2012 MS Thesis - August 2012 19
    20. 20. Series Browser M. Whitelaw, “Visualising Archival Collections: The Visible Archive Project,” in Archives and Manuscripts, vol. 37, Issue 2, 2009.7/20/2012 MS Thesis - August 2012 20
    21. 21. A1 Explorer M. Whitelaw, “Visualising Archival Collections: The Visible Archive Project,” in Archives and Manuscripts, vol. 37, Issue 2, 2009.7/20/2012 MS Thesis - August 2012 21
    22. 22. EASY Scharnhorst et.al. “Looking at a digital research data archive Visual interfaces to EASY,” in CORR, 2012, http://arxiv.org/abs/1204.32007/20/2012 MS Thesis - August 2012 22
    23. 23. Wordle . Jonathan Feinberg, http://wordle.net/ , Dogear7/20/2012 MS Thesis - August 2012 23
    24. 24. DATA RETRIEVAL AND PROCESSING7/20/2012 MS Thesis - August 2012 24
    25. 25. 11 Collections, 2K+ Web pages, 70K+ Mementos7/20/2012 MS Thesis - August 2012 25
    26. 26. Data Retrieval & Processing Retrieval:  Screen scrape  Copy collection hierarchy  Store page content Processing:  Calculate TF and TF-IDF  Generate screenshots  Generate wordles  Rule-based categorization  Construct JSON7/20/2012 MS Thesis - August 2012 26
    27. 27. No categorization7/20/2012 MS Thesis - August 2012 27 http://www.archive-it.org/collections/2836
    28. 28. Improper Categorization7/20/2012 MS Thesis - August 2012 28 http://www.archive-it.org/collections/2323
    29. 29. Rule based categorization News Web pages Blogs Social Media Videos7/20/2012 MS Thesis - August 2012 29 http://www.archive-it.org/collections/2836
    30. 30. Special URI and TLD based categorization Pakistani news web pages7/20/2012 MS Thesis - August 2012 30 http://www.archive-it.org/collections/2836
    31. 31. VISUALIZATIONS7/20/2012 MS Thesis - August 2012 31
    32. 32. Treemap7/20/2012 MS Thesis - August 2012 32
    33. 33. Time Cloud7/20/2012 MS Thesis - August 2012 33
    34. 34. Bubble Chart, Image Plot & Timeline7/20/2012 MS Thesis - August 2012 34
    35. 35. CASE STUDIES7/20/2012 MS Thesis - August 2012 35
    36. 36. 1. Collection Building and Growth7/20/2012 MS Thesis - August 2012 36
    37. 37. 2. Re-Categorization (Pakistan Flood: no categorization)7/20/2012 MS Thesis - August 2012 37
    38. 38. 2. Re-Categorization (Pakistan Flood: after categorization)7/20/2012 MS Thesis - August 2012 38
    39. 39. 3. Collection Synopsis7/20/2012 MS Thesis - August 2012 39
    40. 40. 3. Collection Synopsis7/20/2012 MS Thesis - August 2012 40
    41. 41. 3. Collection Synopsis7/20/2012 MS Thesis - August 2012 41
    42. 42. 3. Collection Synopsis7/20/2012 MS Thesis - August 2012 42
    43. 43. 3. Collection Synopsis7/20/2012 MS Thesis - August 2012 43
    44. 44. 4. Theme Tracking7/20/2012 MS Thesis - August 2012 44
    45. 45. 4. Theme Tracking7/20/2012 MS Thesis - August 2012 45
    46. 46. 4. Theme Tracking7/20/2012 MS Thesis - August 2012 46
    47. 47. 4. Theme Tracking7/20/2012 MS Thesis - August 2012 47
    48. 48. Informal User Evaluation Alex Thurman, Columbia University Libraries Feedback on  ease of browsing and obtaining information  user-friendliness of the interface  whether they prefer textual or graphical interface  most effective visualization  effectiveness of the rule-based categorization in exploring archives7/20/2012 MS Thesis - August 2012 48
    49. 49. Feedback Effective visualizations:  Treemap – color coding useful for identifying newer additions  Image plot – screenshots with mouse-over wordles allow for good navigation  Timeline – useful for visualizing development of groups in collection Suggestions  Broader timescale for treemaps  Include stop words from other languages7/20/2012 MS Thesis - August 2012 49
    50. 50. FUTURE WORK AND CONCLUSION7/20/2012 MS Thesis - August 2012 50
    51. 51. Future Work N-Gram wordles Term expansion Krovetz stemmer (dictionary based stemmer) Integration with Archive-It Detailed user evaluation Implementation for other archives7/20/2012 MS Thesis - August 2012 51
    52. 52. Conclusion Identified metrics for collections7/20/2012 MS Thesis - August 2012 52
    53. 53. Conclusion Identified metrics for collections Visualizations  Treemap7/20/2012 MS Thesis - August 2012 53
    54. 54. Conclusion Identified metrics for collections Visualizations  Treemap  Time cloud7/20/2012 MS Thesis - August 2012 54
    55. 55. Conclusion Identified metrics for collections Visualizations  Treemap  Time cloud  Bubble chart7/20/2012 MS Thesis - August 2012 55
    56. 56. Conclusion Identified metrics for collections Visualizations  Treemap  Time cloud  Bubble chart  Image plot7/20/2012 MS Thesis - August 2012 56
    57. 57. Conclusion Identified metrics for collections Visualizations  Treemap  Time cloud  Bubble chart  Image plot  Wordle7/20/2012 MS Thesis - August 2012 57
    58. 58. Conclusion Identified metrics for collections Visualizations  Treemap  Time cloud  Bubble chart  Image plot  Wordle  Timeline7/20/2012 MS Thesis - August 2012 58
    59. 59. Conclusion Identified metrics for collections Visualizations  Treemap  Time cloud  Bubble chart  Image plot  Wordle  Timeline Rule – based categorization7/20/2012 MS Thesis - August 2012 59
    60. 60. BACKUP7/20/2012 MS Thesis - August 2012 60
    61. 61. Time Span Small 1 Day - 2 Weeks Time span Medium 2 Weeks - 4 Months Large > 4 Months http://wayback.archive-it.org/1068/*/http://amigosdemujeres.org/7/20/2012 MS Thesis - August 2012 61
    62. 62. Groups Small 1 Groups Medium 2-5 Large >5 http://www.archive-it.org/collections/10687/20/2012 MS Thesis - August 2012 62
    63. 63. URI Domains Small 1 - 10 URI Domains Medium 11 - 20 Large > 20 http://www.archive-it.org/collections/28367/20/2012 MS Thesis - August 2012 63
    64. 64. Number of Web Pages Small 1 - 10 # of Web Pages Medium 11 - 99 Large > 99 http://www.archive-it.org/collections/28367/20/2012 MS Thesis - August 2012 64
    65. 65. Jigsaw Stasko et.al., IEEE VAST 20077/20/2012 MS Thesis - August 2012 65
    66. 66. Themeriver Wei et.al. in SIGKDD, 20107/20/2012 MS Thesis - August 2012 66
    67. 67. Time Cloud7/20/2012 MS Thesis - August 2012 68
    68. 68. Bubble Chart7/20/2012 MS Thesis - August 2012 69 http://tmix.cs.odu.edu:8080/project/test.php?coll_id=1068
    69. 69. Image Plot with Wordle7/20/2012 MS Thesis - August 2012 70 http://tmix.cs.odu.edu:8080/project/test.php?coll_id=1068
    70. 70. Timeline7/20/2012 MS Thesis - August 2012 71 http://tmix.cs.odu.edu:8080/project/test.php?coll_id=1068

    ×