DocuBurst:
               Visualizing Document Content
               Using Language Structure

EuroVis 2009   Christopher...
2
3
Document Content Visualization
4


     Navigation in collections of digital text
     Content analysis (digital humanit...
...Using Language Structure
5


       Traditional glyph techniques use unstructured
        word counts (e.g. tag clouds...
WordNet Background
6


       Basic data unit is a set of synonyms called a synset:
        {lawyer, attorney}, {jump, ho...
Hyponymy Relation
7


       X is a Y or X is a kind of Y
       transitive, asymmetric relationship
       example
   ...
Creating DocuBurst
8



          gamesgame
          takentake




          absolute,noun,10
          chair,noun,2
  ...
Hyponymy Structure
Word Sense Ambiguity
10


        Man = {mankind,world}, {male human}, ...
        Water = {H2O}, {water supply}, {body ...
Alternative Scoring Models
11


        Count for all senses
          undue prominence     to ambiguous words
        ...
Visual Encoding
12



        Node Size: # of leaves in subtree
            Stability across documents
        Node Pos...
Node Colouring Alternatives
13




         Cumulative Counts               Single Node Counts
      Supports Visual Summa...
14   Interaction
Trace-to-Root
15




     Cattle IS-A bovine IS-A bovid IS-A ... Mammal IS-A vertebrate IS-A chordate IS-A animal
Roll Up
16
Drill Down
17
18
19
Concordance
20
Level of Detail Filter
21


        Nodes > N away from root are hidden
Search
22
23   Design Trade-Offs
Node Size Mapping
24


        Size by # leaves
         + consistent
         – visual artifacts (highly relevant words ...
Font Size Mapping
25


        Size to fit cell
         + maximize legibility
         – short words have huge font


  ...
Inclusion of Zero-count Words
26


      + provides context (what is not in document)
      – more cluttered
27   Case Studies
28
29
30
31
2008 U.S. Presidential Debate
32
Unexpected Uses
33


        WordNet Visualization
Unexpected Uses
34


        WordNet Visualization
Unexpected Uses
35


        Language Education
          “invaluable potential for writing and vocabulary
           de...
36   Related Work
Types of Document
37
     Visualization
Features of Document
38
     Visualization
        Semantic:    indicate meaning
        Cluster:     generalize into co...
Features of Document
39
     Visualization
Semantics & Clustering
40


        Provides word
         definitions and
         relations
        Clusters of
      ...
Phrases & All Words
41


        Cannot visualize multi-word phrases that are not
         ‘words’ in WordNet
        On...
42   Future Work
Uneven Tree Cut Models
43
44
DocuBurst Comparative Views
45


        Embed small multiples in e-libraries
        Colour scale based on text differe...
Simplification
47


        Root suggestion
          How   to know where to start exploring?
        Word sense disamb...
Thanks for your Attention!


    Acknowledgements:
    Ravin Balakrishnan and helpful reviewers.
    Contact: ccollins@cs....
EuroVis DocuBurst Presentation 2009
Upcoming SlideShare
Loading in...5
×

EuroVis DocuBurst Presentation 2009

29,847

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
29,847
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

EuroVis DocuBurst Presentation 2009

  1. 1. DocuBurst: Visualizing Document Content Using Language Structure EuroVis 2009 Christopher Collins, Sheelagh Carpendale, and Gerald Penn
  2. 2. 2
  3. 3. 3
  4. 4. Document Content Visualization 4  Navigation in collections of digital text  Content analysis (digital humanities)  Plagiarism detection  Authorship attribution
  5. 5. ...Using Language Structure 5  Traditional glyph techniques use unstructured word counts (e.g. tag clouds)  DocuBurst structure is based on a carefully designed ontology called WordNet
  6. 6. WordNet Background 6  Basic data unit is a set of synonyms called a synset: {lawyer, attorney}, {jump, hop, skip}  Words can occur in multiple synsets: {bank, financial institution} {bank, slope, riverside}  Free resource from Princeton University
  7. 7. Hyponymy Relation 7  X is a Y or X is a kind of Y  transitive, asymmetric relationship  example  {robin,redbreast} IS A {bird}  robin and redbreast are hyponyms of bird  forms the basic structure of the noun network {robin, redbreast} IS-A {bird} IS-A {animal, animate_being} IS-A {organism, life_form, living_thing} IS-A {entity}
  8. 8. Creating DocuBurst 8 gamesgame takentake absolute,noun,10 chair,noun,2 moment,noun,11 game,noun,30 reality,noun,3 take,verb,13 represent,verb,17 ... game IS activity chair IS furniture
  9. 9. Hyponymy Structure
  10. 10. Word Sense Ambiguity 10  Man = {mankind,world}, {male human}, ...  Water = {H2O}, {water supply}, {body of water}, ...  Word senses are roughly ordered by frequency in WordNet
  11. 11. Alternative Scoring Models 11  Count for all senses  undue prominence to ambiguous words  Count first sense only  loses too much information  Divide by sense count (same for all senses)  high penalty on polysemous words  Divide by sense index  decreased prominence for uncommon senses
  12. 12. Visual Encoding 12  Node Size: # of leaves in subtree  Stability across documents  Node Position: IS-A relation  Multi-level linguistic abstraction  Additive (2 ducks + 3 geese = 5 birds)  Node Hue: sense index  Differentiates subtrees  Node Saturation: word count  Ordering & approximate scale is perceived  Node Label: First word in synset  Words are ordered by commonality in the language, reveals well-known words
  13. 13. Node Colouring Alternatives 13 Cumulative Counts Single Node Counts Supports Visual Summaries Supports Precision and Selection
  14. 14. 14 Interaction
  15. 15. Trace-to-Root 15 Cattle IS-A bovine IS-A bovid IS-A ... Mammal IS-A vertebrate IS-A chordate IS-A animal
  16. 16. Roll Up 16
  17. 17. Drill Down 17
  18. 18. 18
  19. 19. 19
  20. 20. Concordance 20
  21. 21. Level of Detail Filter 21  Nodes > N away from root are hidden
  22. 22. Search 22
  23. 23. 23 Design Trade-Offs
  24. 24. Node Size Mapping 24  Size by # leaves + consistent – visual artifacts (highly relevant words with few leaves are too small)  Size by score + redundant encoding + important words more prominent – disrupts inter-document comparison
  25. 25. Font Size Mapping 25  Size to fit cell + maximize legibility – short words have huge font  Font size proportional to cell size + short words not more prominent – small maximum size to accommodate long words
  26. 26. Inclusion of Zero-count Words 26 + provides context (what is not in document) – more cluttered
  27. 27. 27 Case Studies
  28. 28. 28
  29. 29. 29
  30. 30. 30
  31. 31. 31
  32. 32. 2008 U.S. Presidential Debate 32
  33. 33. Unexpected Uses 33  WordNet Visualization
  34. 34. Unexpected Uses 34  WordNet Visualization
  35. 35. Unexpected Uses 35  Language Education  “invaluable potential for writing and vocabulary development at the secondary level”  “I'm very interested in using the program, I'm an English teacher”
  36. 36. 36 Related Work
  37. 37. Types of Document 37 Visualization
  38. 38. Features of Document 38 Visualization  Semantic: indicate meaning  Cluster: generalize into concepts  Overview: provide quick gist  Zoom: support varying level of detail  Compare: multi-document comparisons  Search: find specific words/phrases  Read: drill-down to original text  Pattern: reveal patterns of repetition  Features: reveal extracted features such as emotion  Suggest: automatically select interesting focus words  Phrases: can show multi-word phrases  All words: can show all parts of speech
  39. 39. Features of Document 39 Visualization
  40. 40. Semantics & Clustering 40  Provides word definitions and relations  Clusters of related terms allow variable level of abstraction
  41. 41. Phrases & All Words 41  Cannot visualize multi-word phrases that are not ‘words’ in WordNet  Only English nouns, verbs
  42. 42. 42 Future Work
  43. 43. Uneven Tree Cut Models 43
  44. 44. 44
  45. 45. DocuBurst Comparative Views 45  Embed small multiples in e-libraries  Colour scale based on text difference  From each other  From corpus average
  46. 46. Simplification 47  Root suggestion  How to know where to start exploring?  Word sense disambiguation  Attempt to select a sense  Use a less detailed ontology
  47. 47. Thanks for your Attention! Acknowledgements: Ravin Balakrishnan and helpful reviewers. Contact: ccollins@cs.utoronto.ca EuroVis 2009 Christopher Collins, Sheelagh Carpendale, and Gerald Penn

×