SlideShare a Scribd company logo
DocuBurst:
               Visualizing Document Content
               Using Language Structure

EuroVis 2009   Christopher Collins, Sheelagh Carpendale, and Gerald Penn
2
3
Document Content Visualization
4


     Navigation in collections of digital text
     Content analysis (digital humanities)

     Plagiarism detection

     Authorship attribution
...Using Language Structure
5


       Traditional glyph techniques use unstructured
        word counts (e.g. tag clouds)
       DocuBurst structure is based on a carefully
        designed ontology called WordNet
WordNet Background
6


       Basic data unit is a set of synonyms called a synset:
        {lawyer, attorney}, {jump, hop, skip}


       Words can occur in multiple synsets:
        {bank, financial institution}
        {bank, slope, riverside}


       Free resource from Princeton University
Hyponymy Relation
7


       X is a Y or X is a kind of Y
       transitive, asymmetric relationship
       example
         {robin,redbreast} IS A {bird}
         robin and redbreast are hyponyms of bird

       forms the basic structure of the noun network

        {robin, redbreast} IS-A {bird} IS-A
          {animal, animate_being} IS-A
          {organism, life_form, living_thing} IS-A {entity}
Creating DocuBurst
8



          gamesgame
          takentake




          absolute,noun,10
          chair,noun,2
          moment,noun,11
          game,noun,30
          reality,noun,3
          take,verb,13
          represent,verb,17
          ...




          game IS activity
          chair IS furniture
Hyponymy Structure
Word Sense Ambiguity
10


        Man = {mankind,world}, {male human}, ...
        Water = {H2O}, {water supply}, {body of water}, ...
        Word senses are roughly ordered by frequency in
         WordNet
Alternative Scoring Models
11


        Count for all senses
          undue prominence     to ambiguous words
        Count first sense only
          loses   too much information
        Divide by sense count (same for all senses)
          high   penalty on polysemous words
        Divide by sense index
          decreased prominence    for uncommon senses
Visual Encoding
12



        Node Size: # of leaves in subtree
            Stability across documents
        Node Position: IS-A relation
            Multi-level linguistic abstraction
            Additive
             (2 ducks + 3 geese = 5 birds)
        Node Hue: sense index
            Differentiates subtrees
        Node Saturation: word count
            Ordering & approximate scale is perceived
        Node Label: First word in synset
            Words are ordered by commonality in the
             language, reveals well-known words
Node Colouring Alternatives
13




         Cumulative Counts               Single Node Counts
      Supports Visual Summaries   Supports Precision and Selection
14   Interaction
Trace-to-Root
15




     Cattle IS-A bovine IS-A bovid IS-A ... Mammal IS-A vertebrate IS-A chordate IS-A animal
Roll Up
16
Drill Down
17
18
19
Concordance
20
Level of Detail Filter
21


        Nodes > N away from root are hidden
Search
22
23   Design Trade-Offs
Node Size Mapping
24


        Size by # leaves
         + consistent
         – visual artifacts (highly relevant words with few leaves
           are too small)


        Size by score
         + redundant encoding
         + important words more prominent
         – disrupts inter-document comparison
Font Size Mapping
25


        Size to fit cell
         + maximize legibility
         – short words have huge font


        Font size proportional to cell size
         + short words not more prominent
         – small maximum size to accommodate long words
Inclusion of Zero-count Words
26


      + provides context (what is not in document)
      – more cluttered
27   Case Studies
28
29
30
31
2008 U.S. Presidential Debate
32
Unexpected Uses
33


        WordNet Visualization
Unexpected Uses
34


        WordNet Visualization
Unexpected Uses
35


        Language Education
          “invaluable potential for writing and vocabulary
           development at the secondary level”
          “I'm very interested in using the program, I'm an English
           teacher”
36   Related Work
Types of Document
37
     Visualization
Features of Document
38
     Visualization
        Semantic:    indicate meaning
        Cluster:     generalize into concepts
        Overview:    provide quick gist
        Zoom:        support varying level of detail
        Compare:     multi-document comparisons
        Search:      find specific words/phrases
        Read:        drill-down to original text
        Pattern:     reveal patterns of repetition
        Features:    reveal extracted features such as emotion
        Suggest:     automatically select interesting focus words
        Phrases:     can show multi-word phrases
        All words:   can show all parts of speech
Features of Document
39
     Visualization
Semantics & Clustering
40


        Provides word
         definitions and
         relations
        Clusters of
         related terms
         allow variable
         level of
         abstraction
Phrases & All Words
41


        Cannot visualize multi-word phrases that are not
         ‘words’ in WordNet
        Only English nouns, verbs
42   Future Work
Uneven Tree Cut Models
43
44
DocuBurst Comparative Views
45


        Embed small multiples in e-libraries
        Colour scale based on text difference
          From each other
          From corpus average
Simplification
47


        Root suggestion
          How   to know where to start exploring?
        Word sense disambiguation
          Attempt to  select a sense
          Use a less detailed ontology
Thanks for your Attention!


    Acknowledgements:
    Ravin Balakrishnan and helpful reviewers.
    Contact: ccollins@cs.utoronto.ca




EuroVis 2009             Christopher Collins, Sheelagh Carpendale, and Gerald Penn

More Related Content

Similar to EuroVis DocuBurst Presentation 2009

Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
shakimov
 
NLP
NLPNLP
Chi-Un Lei "Text Mining and Educational Discourse"
Chi-Un Lei "Text Mining and Educational Discourse"Chi-Un Lei "Text Mining and Educational Discourse"
Chi-Un Lei "Text Mining and Educational Discourse"
CITE
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Bhaskar Mitra
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
Roelof Pieters
 
Natural Language Processing for Games Research
Natural Language Processing for Games ResearchNatural Language Processing for Games Research
Natural Language Processing for Games Research
Jose Zagal
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Toine Bogers
 
WORDNET: A Database of Lexical Relations
WORDNET: A Database of Lexical RelationsWORDNET: A Database of Lexical Relations
WORDNET: A Database of Lexical Relations
Ahmed Abd-Elwasaa
 
CMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics ICMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics I
butest
 
Understanding ASL Grammatical Features and Discourse Mapping
Understanding ASL Grammatical Features and Discourse MappingUnderstanding ASL Grammatical Features and Discourse Mapping
Understanding ASL Grammatical Features and Discourse Mapping
Doug Stringham
 
Text mining introduction-1
Text mining   introduction-1Text mining   introduction-1
Text mining introduction-1
Sumit Sony
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learning
fridolin.wild
 
Themes identification techniques in qualitative research
Themes identification techniques in qualitative researchThemes identification techniques in qualitative research
Themes identification techniques in qualitative research
Ghulam Qambar
 
NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BC...
NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BC...NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BC...
NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BC...
National Information Standards Organization (NISO)
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
Benjamin Bengfort
 
Interactive Analysis of Word Vector Embeddings
Interactive Analysis of Word Vector EmbeddingsInteractive Analysis of Word Vector Embeddings
Interactive Analysis of Word Vector Embeddings
gleicher
 
Rettig.interface designislanguagedesign
Rettig.interface designislanguagedesignRettig.interface designislanguagedesign
Rettig.interface designislanguagedesign
Marc Rettig
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Yasir Khan
 
Appropriate use of sources in academic writing
Appropriate use of sources in academic writing Appropriate use of sources in academic writing
Appropriate use of sources in academic writing
Dr Stylianos Mystakidis
 
Interpreting Embeddings with Comparison
Interpreting Embeddings with ComparisonInterpreting Embeddings with Comparison
Interpreting Embeddings with Comparison
gleicher
 

Similar to EuroVis DocuBurst Presentation 2009 (20)

Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
 
NLP
NLPNLP
NLP
 
Chi-Un Lei "Text Mining and Educational Discourse"
Chi-Un Lei "Text Mining and Educational Discourse"Chi-Un Lei "Text Mining and Educational Discourse"
Chi-Un Lei "Text Mining and Educational Discourse"
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Natural Language Processing for Games Research
Natural Language Processing for Games ResearchNatural Language Processing for Games Research
Natural Language Processing for Games Research
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
WORDNET: A Database of Lexical Relations
WORDNET: A Database of Lexical RelationsWORDNET: A Database of Lexical Relations
WORDNET: A Database of Lexical Relations
 
CMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics ICMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics I
 
Understanding ASL Grammatical Features and Discourse Mapping
Understanding ASL Grammatical Features and Discourse MappingUnderstanding ASL Grammatical Features and Discourse Mapping
Understanding ASL Grammatical Features and Discourse Mapping
 
Text mining introduction-1
Text mining   introduction-1Text mining   introduction-1
Text mining introduction-1
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learning
 
Themes identification techniques in qualitative research
Themes identification techniques in qualitative researchThemes identification techniques in qualitative research
Themes identification techniques in qualitative research
 
NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BC...
NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BC...NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BC...
NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BC...
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Interactive Analysis of Word Vector Embeddings
Interactive Analysis of Word Vector EmbeddingsInteractive Analysis of Word Vector Embeddings
Interactive Analysis of Word Vector Embeddings
 
Rettig.interface designislanguagedesign
Rettig.interface designislanguagedesignRettig.interface designislanguagedesign
Rettig.interface designislanguagedesign
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Appropriate use of sources in academic writing
Appropriate use of sources in academic writing Appropriate use of sources in academic writing
Appropriate use of sources in academic writing
 
Interpreting Embeddings with Comparison
Interpreting Embeddings with ComparisonInterpreting Embeddings with Comparison
Interpreting Embeddings with Comparison
 

Recently uploaded

Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
marufrahmanstratejm
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 

Recently uploaded (20)

Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 

EuroVis DocuBurst Presentation 2009

  • 1. DocuBurst: Visualizing Document Content Using Language Structure EuroVis 2009 Christopher Collins, Sheelagh Carpendale, and Gerald Penn
  • 2. 2
  • 3. 3
  • 4. Document Content Visualization 4  Navigation in collections of digital text  Content analysis (digital humanities)  Plagiarism detection  Authorship attribution
  • 5. ...Using Language Structure 5  Traditional glyph techniques use unstructured word counts (e.g. tag clouds)  DocuBurst structure is based on a carefully designed ontology called WordNet
  • 6. WordNet Background 6  Basic data unit is a set of synonyms called a synset: {lawyer, attorney}, {jump, hop, skip}  Words can occur in multiple synsets: {bank, financial institution} {bank, slope, riverside}  Free resource from Princeton University
  • 7. Hyponymy Relation 7  X is a Y or X is a kind of Y  transitive, asymmetric relationship  example  {robin,redbreast} IS A {bird}  robin and redbreast are hyponyms of bird  forms the basic structure of the noun network {robin, redbreast} IS-A {bird} IS-A {animal, animate_being} IS-A {organism, life_form, living_thing} IS-A {entity}
  • 8. Creating DocuBurst 8 gamesgame takentake absolute,noun,10 chair,noun,2 moment,noun,11 game,noun,30 reality,noun,3 take,verb,13 represent,verb,17 ... game IS activity chair IS furniture
  • 10. Word Sense Ambiguity 10  Man = {mankind,world}, {male human}, ...  Water = {H2O}, {water supply}, {body of water}, ...  Word senses are roughly ordered by frequency in WordNet
  • 11. Alternative Scoring Models 11  Count for all senses  undue prominence to ambiguous words  Count first sense only  loses too much information  Divide by sense count (same for all senses)  high penalty on polysemous words  Divide by sense index  decreased prominence for uncommon senses
  • 12. Visual Encoding 12  Node Size: # of leaves in subtree  Stability across documents  Node Position: IS-A relation  Multi-level linguistic abstraction  Additive (2 ducks + 3 geese = 5 birds)  Node Hue: sense index  Differentiates subtrees  Node Saturation: word count  Ordering & approximate scale is perceived  Node Label: First word in synset  Words are ordered by commonality in the language, reveals well-known words
  • 13. Node Colouring Alternatives 13 Cumulative Counts Single Node Counts Supports Visual Summaries Supports Precision and Selection
  • 14. 14 Interaction
  • 15. Trace-to-Root 15 Cattle IS-A bovine IS-A bovid IS-A ... Mammal IS-A vertebrate IS-A chordate IS-A animal
  • 18. 18
  • 19. 19
  • 21. Level of Detail Filter 21  Nodes > N away from root are hidden
  • 23. 23 Design Trade-Offs
  • 24. Node Size Mapping 24  Size by # leaves + consistent – visual artifacts (highly relevant words with few leaves are too small)  Size by score + redundant encoding + important words more prominent – disrupts inter-document comparison
  • 25. Font Size Mapping 25  Size to fit cell + maximize legibility – short words have huge font  Font size proportional to cell size + short words not more prominent – small maximum size to accommodate long words
  • 26. Inclusion of Zero-count Words 26 + provides context (what is not in document) – more cluttered
  • 27. 27 Case Studies
  • 28. 28
  • 29. 29
  • 30. 30
  • 31. 31
  • 33. Unexpected Uses 33  WordNet Visualization
  • 34. Unexpected Uses 34  WordNet Visualization
  • 35. Unexpected Uses 35  Language Education  “invaluable potential for writing and vocabulary development at the secondary level”  “I'm very interested in using the program, I'm an English teacher”
  • 36. 36 Related Work
  • 37. Types of Document 37 Visualization
  • 38. Features of Document 38 Visualization  Semantic: indicate meaning  Cluster: generalize into concepts  Overview: provide quick gist  Zoom: support varying level of detail  Compare: multi-document comparisons  Search: find specific words/phrases  Read: drill-down to original text  Pattern: reveal patterns of repetition  Features: reveal extracted features such as emotion  Suggest: automatically select interesting focus words  Phrases: can show multi-word phrases  All words: can show all parts of speech
  • 39. Features of Document 39 Visualization
  • 40. Semantics & Clustering 40  Provides word definitions and relations  Clusters of related terms allow variable level of abstraction
  • 41. Phrases & All Words 41  Cannot visualize multi-word phrases that are not ‘words’ in WordNet  Only English nouns, verbs
  • 42. 42 Future Work
  • 43. Uneven Tree Cut Models 43
  • 44. 44
  • 45. DocuBurst Comparative Views 45  Embed small multiples in e-libraries  Colour scale based on text difference  From each other  From corpus average
  • 46.
  • 47. Simplification 47  Root suggestion  How to know where to start exploring?  Word sense disambiguation  Attempt to select a sense  Use a less detailed ontology
  • 48. Thanks for your Attention! Acknowledgements: Ravin Balakrishnan and helpful reviewers. Contact: ccollins@cs.utoronto.ca EuroVis 2009 Christopher Collins, Sheelagh Carpendale, and Gerald Penn