Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Visualising a text with a tree cloud

10,199 views

Published on

A new visualization tool to display the words of a text (newspaper article, blog content, political speech) is presented, the tree cloud, a kind of improved tag cloud: http://www.treecloud.org.

  • Hey guys! Who wants to chat with me? More photos with me here 👉 http://www.bit.ly/katekoxx
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Visualising a text with a tree cloud

  1. 1. IFCS 2009 Dresden – 17/03/2009 Visualising a text with a tree cloud Philippe Gambette, Jean Véronis
  2. 2. Outline • Tag and word clouds • Enhanced tag clouds • Tree clouds • Construction steps • Quality control
  3. 3. Tag clouds • Built from a set of tags • Font size related to frequency What is considered the first tag cloud, from D. Coupland: Microserfs, HarperCollins, Toronto, 1995
  4. 4. Tag clouds • Built from a set of tags • Font size related to frequency • Gained popularity with Flickr Flickr's all time most popular tags
  5. 5. Word clouds • Built from a set of words from a text • Font size related to frequency • Gained popularity with Wordle
  6. 6. Word clouds • Built from a set of words from a text • Font size related to frequency • Gained popularity with Wordle GoogleImage(obama inaugural address wordle)
  7. 7. Word clouds • Built from a set of words from a text • Font size related to frequency • Gained popularity with Wordle GoogleImage(obama inaugural address wordle)
  8. 8. Word clouds • Built from a set of words from a text • Font size related to frequency • Gained popularity with Wordle GoogleImage(obama inaugural address wordle)
  9. 9. Enhanced tag / word clouds Add information from the text: • color intensity to express recency in Amazon • shared tags in red on del.icio.us • group together cooccurring tags on the same line Hassan-Montero & Herrero-Solana, InScit'06 • optimize blank space and semantic proximity Kaser & Lemire, WWW'07 • “topigraphy”: 2D placement according to cooccurrence Fujimura, Fujimura, Matsubayashi, Yamada & Okuda, WWW'08
  10. 10. Enhanced tag / word clouds Add information from the text: • color intensity to express recency in Amazon • shared tags in red on del.icio.us • group together cooccurring tags on the same line Hassan-Montero & Herrero-Solana, InScit'06 • optimize blank space and semantic proximity Kaser & Lemire, WWW'07 • “topigraphy”: 2D placement according to cooccurrence Fujimura, Fujimura, Matsubayashi, Yamada & Okuda, WWW'08
  11. 11. Enhanced tag / word clouds Add information from the text: • color intensity to express recency in Amazon • shared tags in red on del.icio.us • group together cooccurring tags on the same line Hassan-Montero & Herrero-Solana, InScit'06 • optimize blank space and semantic proximity Kaser & Lemire, WWW'07 • “topigraphy”: 2D placement according to cooccurrence Fujimura, Fujimura, Matsubayashi, Yamada & Okuda, WWW'08
  12. 12. Enhanced tag / word clouds Add information from the text: • color intensity to express recency in Amazon • shared tags in red on del.icio.us • group together cooccurring tags on the same line Hassan-Montero & Herrero-Solana, InScit'06 • optimize blank space and semantic proximity Kaser & Lemire, WWW'07 • “topigraphy”: 2D placement according to cooccurrence Fujimura, Fujimura, Matsubayashi, Yamada & Okuda, WWW'08
  13. 13. Enhanced tag / word clouds Add information from the text: • color intensity to express recency in Amazon • shared tags in red on del.icio.us • group together cooccurring tags on the same line Hassan-Montero & Herrero-Solana, InScit'06 • optimize blank space and semantic proximity Kaser & Lemire, WWW'07 • “topigraphy”: 2D placement according to cooccurrence Fujimura, Fujimura, Matsubayashi, Yamada & Okuda, WWW'08
  14. 14. Extract semantic information from a text • literature analysis: philological approach: only consider the text Brody • discourse analysis: tree analysis or cooccurrence graph, geodesic projection Brunet (Hyperbase), Viprey (Astartex) • text mining: semantic graph Grimmer (Wordmapper) • natural language processing: sense desamibiguation Véronis (Hyperlex)
  15. 15. Extract semantic information from a text • literature analysis: philological approach: only consider the text Brody • discourse analysis: tree analysis or cooccurrence graph, geodesic projection Brunet (Hyperbase), Viprey (Astartex) • text mining: semantic graph Grimmer (Wordmapper) • natural language processing: sense desamibiguation Véronis (Hyperlex) Mayaffre, Quand travail, famille, et patrie cooccurrent dans le discours de Nicolas Sarkozy, JADT'08
  16. 16. Extract semantic information from a text • literature analysis: philological approach: only consider the text Brody • discourse analysis: tree analysis or cooccurrence graph, geodesic projection Brunet (Hyperbase), Viprey (Astartex) • text mining: semantic graph Grimmer (Wordmapper) • natural language processing: sense desamibiguation Véronis (Hyperlex) Brunet, Les séquences (suite), JADT'08
  17. 17. Extract semantic information from a text • literature analysis: philological approach: only consider the text Brody • discourse analysis: tree analysis or cooccurrence graph, geodesic projection Brunet (Hyperbase), Viprey (Astartex) • text mining: semantic graph Grimmer (Wordmapper) • natural language processing: word sense disamibiguation Véronis (Hyperlex) Barry, Viprey, Approche comparative des résultats d'exploration textuelle des discours de deux leaders africains Keita et Touré, JADT'08
  18. 18. Extract semantic information from a text • literature analysis: philological approach: only consider the text Brody • discourse analysis: tree analysis or cooccurrence graph, geodesic projection Brunet (Hyperbase), Viprey (Astartex) • text mining: semantic graph Grimmer (Wordmapper) • natural language processing: word sense disamibiguation Véronis (Hyperlex) Peyrat-Guillard, Analyse du discours syndical sur l’entreprise, JADT'08
  19. 19. Extract semantic information from a text • literature analysis: philological approach: only consider the text Brody • discourse analysis: tree analysis or cooccurrence graph, geodesic projection Brunet (Hyperbase), Viprey (Astartex) • text mining: semantic graph Grimmer (Wordmapper) • natural language processing: word sense disambiguation Véronis (Hyperlex) Disambiguation of word “barrage”: dam, play-off, roadblock, police cordon. Véronis, HyperLex: Lexical Cartography for Information Retrieval, 2004
  20. 20. Tag cloud + tree = tree cloud SplitsTree: Huson 1998, Huson & Bryant 2006 Built with TreeCloud and GPL-licensed Treecloud in Python, available at http://www.treecloud.org
  21. 21. The first tree cloud Tree cloud of the blog posts containing “Laurence Ferrari” from 25/11/2007 to 10/12/2007, by Jean Véronis http://aixtal.blogspot.com/2007/12/actu-une-ferrari-dans-un-arbre.html
  22. 22. Building a tree cloud – extracting the words Extract words with frequency: • stoplist? without stoplist with stoplist
  23. 23. Building a tree cloud – extracting the words Extract words with frequency: • lemmatization? • groups words with similar meaning? no lemmatization... sometimes interesting 70 most frequent words in Obama's campaign speeches, winsize=30, distance=dice, NJ-tree.
  24. 24. Building a tree cloud – dissimilarity matrix Many semantic distance formulas based on cooccurrence
  25. 25. Building a tree cloud – dissimilarity matrix Many semantic distance formulas based on cooccurrence Text sliding window S sliding step s width w cooccurrence matrices semantic dissimilarity O11, O12, O21, O22 matrix chi squared, mutual information, liddel, dice, jaccard, gmean, hyperlex, minimum sensitivity, odds ratio, zscore, log likelihood, poisson-stirling... Evert, Statistics of words cooccurrences, PhD Thesis, 2005
  26. 26. Building a tree cloud – dissimilarity matrix Transformations needed on the dissimilarity: • transform similarity into dissimilarity • linear normalization for positive matrices to get distances in [0,1] • affine normalization for matrices with positive or negative numbers, to get distances in [α,1] (for example α=0.1)
  27. 27. Building a tree cloud – tree reconstruction Many existing methods: • Neighbor-Joining Saitou & Nei, 1987 • Addtree variants Barthelemy & Luong, 1987 • Quartet heuristic Cilibrasi & Vitanyi, 2007
  28. 28. Building a tree cloud – tree decoration Choice of word sizes: • computed directly from frequency (apply a log!) or • computed from frequency ranking (exponential distribution) or • statistical significance with respect to a reference corpus
  29. 29. Building a tree cloud – tree decoration Colors: chronology old recent 150 most frequent words in Built with Obama's campaign speeches, TreeCloud and winsize=30, distance=oddsratio, color=chronology, NJ-tree.
  30. 30. Building a tree cloud – tree decoration Colors: dispersion sparse dense standard deviation of the position too blue? 150 most frequent words in Obama's campaign speeches, winsize=30, distance=oddsratio, color=dispersion, NJ-tree.
  31. 31. Building a tree cloud – tree decoration Colors: dispersion sparse dense standard deviation of the position word frequency too red? 150 most frequent words in Obama's campaign speeches, winsize=30, distance=oddsratio, color=norm-dispersion, NJ-tree.
  32. 32. Building a tree cloud – tree decoration Edge color or thickness: quality of the induced cluster. 150 most frequent words in Obama's campaign speeches, winsize=30, distance=oddsratio, color=chronology, NJ-tree.
  33. 33. Quality control Is there an objective quality measure of tree clouds?
  34. 34. Quality control Is there an objective quality measure of tree clouds? What is the best method to build a tree cloud from my data?
  35. 35. Quality control Is there an objective quality measure of tree clouds? What is the best method to build a tree cloud from my data? Tree cloud variations if small changes? bootstrap to evaluate: - stability of the result - robustness of the method
  36. 36. Quality control Is there an objective quality measure of tree clouds? What is the best method to build a tree cloud from my data? Tree cloud variations if small changes? bootstrap to evaluate: - stability of the result - robustness of the method Still, is there a more direct method? arboricity to show whether the distance matrix fits with a tree, which should imply stability? Guénoche & Garreta, 2001 Guénoche & Darlu, 2009
  37. 37. Quality control – bootstrap • Randomly delete words with probability 50%. • Built tree cloud of original text, and altered text. • Compute similarity of both trees (1-normalized RobinsonFoulds) similarity 0,95 0,9 0,85 0,8 0,75 0,7 0,65 0,6 0,55 0,5 4 altered versions of 10 mi dice gmean ms zscore poissonstirling Obama's speeches, chisquared liddell jaccard hyperlex oddsratio loglikelihood 3000 words in average, width=30, NJ-tree.
  38. 38. Quality control – arboricity Relationship between “bootstrap quality” and arboricity: 80 arboricity (%) 70 60 50 40 30 20 10 0 0,52 0,54 0,56 0,58 0,6 0,62 0,64 0,66 0,68 0,7 bootstrap quality
  39. 39. Quality control – arboricity Relationship between “bootstrap quality” and arboricity: 80 arboricity (%) 70 correlation 60 coefficient: 0.64 50 arboricity 40 below 50%: 30 danger! 20 10 0 0,52 0,54 0,56 0,58 0,6 0,62 0,64 0,66 0,68 0,7 bootstrap quality
  40. 40. Perspectives • Make the tool available on a web interface http://www.treecloud.org • Evaluate tree clouds for discourse analysis • Build the daily tree cloud of people popular on blogs, with http://labs.wikio.net
  41. 41. Thank you for your attention! Tree cloud of the words appearing twice or more in the IFCS 2009 call for paper lemmatization, width=20, distance=dice, NJ-tree. http://www.treecloud.org - http://www.splitstree.org
  42. 42. Tree clouds focused on one word Tree cloud of the neighborhood of “McCain” in Obama's campaign speeches http://www.treecloud.org - http://www.splitstree.org
  43. 43. Tree clouds focused on one word Tree cloud of the neighborhood of “Bush” in Obama's campaign speeches http://www.treecloud.org - http://www.splitstree.org
  44. 44. Tree clouds focused on one word Tree cloud of the neighborhood of “world” in Obama's campaign speeches http://www.treecloud.org - http://www.splitstree.org

×