Visualising a text with a tree cloud

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    7 Favorites & 1 Group

    Visualising a text with a tree cloud - Presentation Transcript

    1. IFCS 2009 Dresden – 17/03/2009 Visualising a text with a tree cloud Philippe Gambette, Jean Véronis
    2. Outline • Tag and word clouds • Enhanced tag clouds • Tree clouds • Construction steps • Quality control
    3. Tag clouds • Built from a set of tags • Font size related to frequency What is considered the first tag cloud, from D. Coupland: Microserfs, HarperCollins, Toronto, 1995
    4. Tag clouds • Built from a set of tags • Font size related to frequency • Gained popularity with Flickr Flickr's all time most popular tags
    5. Word clouds • Built from a set of words from a text • Font size related to frequency • Gained popularity with Wordle
    6. Word clouds • Built from a set of words from a text • Font size related to frequency • Gained popularity with Wordle GoogleImage(obama inaugural address wordle)
    7. Word clouds • Built from a set of words from a text • Font size related to frequency • Gained popularity with Wordle GoogleImage(obama inaugural address wordle)
    8. Word clouds • Built from a set of words from a text • Font size related to frequency • Gained popularity with Wordle GoogleImage(obama inaugural address wordle)
    9. Enhanced tag / word clouds Add information from the text: • color intensity to express recency in Amazon • shared tags in red on del.icio.us • group together cooccurring tags on the same line Hassan-Montero & Herrero-Solana, InScit'06 • optimize blank space and semantic proximity Kaser & Lemire, WWW'07 • “topigraphy”: 2D placement according to cooccurrence Fujimura, Fujimura, Matsubayashi, Yamada & Okuda, WWW'08
    10. Enhanced tag / word clouds Add information from the text: • color intensity to express recency in Amazon • shared tags in red on del.icio.us • group together cooccurring tags on the same line Hassan-Montero & Herrero-Solana, InScit'06 • optimize blank space and semantic proximity Kaser & Lemire, WWW'07 • “topigraphy”: 2D placement according to cooccurrence Fujimura, Fujimura, Matsubayashi, Yamada & Okuda, WWW'08
    11. Enhanced tag / word clouds Add information from the text: • color intensity to express recency in Amazon • shared tags in red on del.icio.us • group together cooccurring tags on the same line Hassan-Montero & Herrero-Solana, InScit'06 • optimize blank space and semantic proximity Kaser & Lemire, WWW'07 • “topigraphy”: 2D placement according to cooccurrence Fujimura, Fujimura, Matsubayashi, Yamada & Okuda, WWW'08
    12. Enhanced tag / word clouds Add information from the text: • color intensity to express recency in Amazon • shared tags in red on del.icio.us • group together cooccurring tags on the same line Hassan-Montero & Herrero-Solana, InScit'06 • optimize blank space and semantic proximity Kaser & Lemire, WWW'07 • “topigraphy”: 2D placement according to cooccurrence Fujimura, Fujimura, Matsubayashi, Yamada & Okuda, WWW'08
    13. Enhanced tag / word clouds Add information from the text: • color intensity to express recency in Amazon • shared tags in red on del.icio.us • group together cooccurring tags on the same line Hassan-Montero & Herrero-Solana, InScit'06 • optimize blank space and semantic proximity Kaser & Lemire, WWW'07 • “topigraphy”: 2D placement according to cooccurrence Fujimura, Fujimura, Matsubayashi, Yamada & Okuda, WWW'08
    14. Extract semantic information from a text • literature analysis: philological approach: only consider the text Brody • discourse analysis: tree analysis or cooccurrence graph, geodesic projection Brunet (Hyperbase), Viprey (Astartex) • text mining: semantic graph Grimmer (Wordmapper) • natural language processing: sense desamibiguation Véronis (Hyperlex)
    15. Extract semantic information from a text • literature analysis: philological approach: only consider the text Brody • discourse analysis: tree analysis or cooccurrence graph, geodesic projection Brunet (Hyperbase), Viprey (Astartex) • text mining: semantic graph Grimmer (Wordmapper) • natural language processing: sense desamibiguation Véronis (Hyperlex) Mayaffre, Quand travail, famille, et patrie cooccurrent dans le discours de Nicolas Sarkozy, JADT'08
    16. Extract semantic information from a text • literature analysis: philological approach: only consider the text Brody • discourse analysis: tree analysis or cooccurrence graph, geodesic projection Brunet (Hyperbase), Viprey (Astartex) • text mining: semantic graph Grimmer (Wordmapper) • natural language processing: sense desamibiguation Véronis (Hyperlex) Brunet, Les séquences (suite), JADT'08
    17. Extract semantic information from a text • literature analysis: philological approach: only consider the text Brody • discourse analysis: tree analysis or cooccurrence graph, geodesic projection Brunet (Hyperbase), Viprey (Astartex) • text mining: semantic graph Grimmer (Wordmapper) • natural language processing: word sense disamibiguation Véronis (Hyperlex) Barry, Viprey, Approche comparative des résultats d'exploration textuelle des discours de deux leaders africains Keita et Touré, JADT'08
    18. Extract semantic information from a text • literature analysis: philological approach: only consider the text Brody • discourse analysis: tree analysis or cooccurrence graph, geodesic projection Brunet (Hyperbase), Viprey (Astartex) • text mining: semantic graph Grimmer (Wordmapper) • natural language processing: word sense disamibiguation Véronis (Hyperlex) Peyrat-Guillard, Analyse du discours syndical sur l’entreprise, JADT'08
    19. Extract semantic information from a text • literature analysis: philological approach: only consider the text Brody • discourse analysis: tree analysis or cooccurrence graph, geodesic projection Brunet (Hyperbase), Viprey (Astartex) • text mining: semantic graph Grimmer (Wordmapper) • natural language processing: word sense disambiguation Véronis (Hyperlex) Disambiguation of word “barrage”: dam, play-off, roadblock, police cordon. Véronis, HyperLex: Lexical Cartography for Information Retrieval, 2004
    20. Tag cloud + tree = tree cloud SplitsTree: Huson 1998, Huson & Bryant 2006 Built with TreeCloud and GPL-licensed Treecloud in Python, available at http://www.treecloud.org
    21. The first tree cloud Tree cloud of the blog posts containing “Laurence Ferrari” from 25/11/2007 to 10/12/2007, by Jean Véronis http://aixtal.blogspot.com/2007/12/actu-une-ferrari-dans-un-arbre.html
    22. Building a tree cloud – extracting the words Extract words with frequency: • stoplist? without stoplist with stoplist
    23. Building a tree cloud – extracting the words Extract words with frequency: • lemmatization? • groups words with similar meaning? no lemmatization... sometimes interesting 70 most frequent words in Obama's campaign speeches, winsize=30, distance=dice, NJ-tree.
    24. Building a tree cloud – dissimilarity matrix Many semantic distance formulas based on cooccurrence
    25. Building a tree cloud – dissimilarity matrix Many semantic distance formulas based on cooccurrence Text sliding window S sliding step s width w cooccurrence matrices semantic dissimilarity O11, O12, O21, O22 matrix chi squared, mutual information, liddel, dice, jaccard, gmean, hyperlex, minimum sensitivity, odds ratio, zscore, log likelihood, poisson-stirling... Evert, Statistics of words cooccurrences, PhD Thesis, 2005
    26. Building a tree cloud – dissimilarity matrix Transformations needed on the dissimilarity: • transform similarity into dissimilarity • linear normalization for positive matrices to get distances in [0,1] • affine normalization for matrices with positive or negative numbers, to get distances in [α,1] (for example α=0.1)
    27. Building a tree cloud – tree reconstruction Many existing methods: • Neighbor-Joining Saitou & Nei, 1987 • Addtree variants Barthelemy & Luong, 1987 • Quartet heuristic Cilibrasi & Vitanyi, 2007
    28. Building a tree cloud – tree decoration Choice of word sizes: • computed directly from frequency (apply a log!) or • computed from frequency ranking (exponential distribution) or • statistical significance with respect to a reference corpus
    29. Building a tree cloud – tree decoration Colors: chronology old recent 150 most frequent words in Built with Obama's campaign speeches, TreeCloud and winsize=30, distance=oddsratio, color=chronology, NJ-tree.
    30. Building a tree cloud – tree decoration Colors: dispersion sparse dense standard deviation of the position too blue? 150 most frequent words in Obama's campaign speeches, winsize=30, distance=oddsratio, color=dispersion, NJ-tree.
    31. Building a tree cloud – tree decoration Colors: dispersion sparse dense standard deviation of the position word frequency too red? 150 most frequent words in Obama's campaign speeches, winsize=30, distance=oddsratio, color=norm-dispersion, NJ-tree.
    32. Building a tree cloud – tree decoration Edge color or thickness: quality of the induced cluster. 150 most frequent words in Obama's campaign speeches, winsize=30, distance=oddsratio, color=chronology, NJ-tree.
    33. Quality control Is there an objective quality measure of tree clouds?
    34. Quality control Is there an objective quality measure of tree clouds? What is the best method to build a tree cloud from my data?
    35. Quality control Is there an objective quality measure of tree clouds? What is the best method to build a tree cloud from my data? Tree cloud variations if small changes? bootstrap to evaluate: - stability of the result - robustness of the method
    36. Quality control Is there an objective quality measure of tree clouds? What is the best method to build a tree cloud from my data? Tree cloud variations if small changes? bootstrap to evaluate: - stability of the result - robustness of the method Still, is there a more direct method? arboricity to show whether the distance matrix fits with a tree, which should imply stability? Guénoche & Garreta, 2001 Guénoche & Darlu, 2009
    37. Quality control – bootstrap • Randomly delete words with probability 50%. • Built tree cloud of original text, and altered text. • Compute similarity of both trees (1-normalized RobinsonFoulds) similarity 0,95 0,9 0,85 0,8 0,75 0,7 0,65 0,6 0,55 0,5 4 altered versions of 10 mi dice gmean ms zscore poissonstirling Obama's speeches, chisquared liddell jaccard hyperlex oddsratio loglikelihood 3000 words in average, width=30, NJ-tree.
    38. Quality control – arboricity Relationship between “bootstrap quality” and arboricity: 80 arboricity (%) 70 60 50 40 30 20 10 0 0,52 0,54 0,56 0,58 0,6 0,62 0,64 0,66 0,68 0,7 bootstrap quality
    39. Quality control – arboricity Relationship between “bootstrap quality” and arboricity: 80 arboricity (%) 70 correlation 60 coefficient: 0.64 50 arboricity 40 below 50%: 30 danger! 20 10 0 0,52 0,54 0,56 0,58 0,6 0,62 0,64 0,66 0,68 0,7 bootstrap quality
    40. Perspectives • Make the tool available on a web interface http://www.treecloud.org • Evaluate tree clouds for discourse analysis • Build the daily tree cloud of people popular on blogs, with http://labs.wikio.net
    41. Thank you for your attention! Tree cloud of the words appearing twice or more in the IFCS 2009 call for paper lemmatization, width=20, distance=dice, NJ-tree. http://www.treecloud.org - http://www.splitstree.org
    42. Tree clouds focused on one word Tree cloud of the neighborhood of “McCain” in Obama's campaign speeches http://www.treecloud.org - http://www.splitstree.org
    43. Tree clouds focused on one word Tree cloud of the neighborhood of “Bush” in Obama's campaign speeches http://www.treecloud.org - http://www.splitstree.org
    44. Tree clouds focused on one word Tree cloud of the neighborhood of “world” in Obama's campaign speeches http://www.treecloud.org - http://www.splitstree.org

    + Philippe GambettePhilippe Gambette, 8 months ago

    custom

    2955 views, 7 favs, 12 embeds more stats

    A new visualization tool to display the words of a more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 2955
      • 2603 on SlideShare
      • 352 from embeds
    • Comments 0
    • Favorites 7
    • Downloads 38
    Most viewed embeds
    • 154 views on http://www.daniel-lemire.com
    • 154 views on http://aixtal.blogspot.com
    • 16 views on http://www.lirmm.fr
    • 8 views on http://www.netvibes.com
    • 5 views on http://apprentieweb.blogspot.com

    more

    All embeds
    • 154 views on http://www.daniel-lemire.com
    • 154 views on http://aixtal.blogspot.com
    • 16 views on http://www.lirmm.fr
    • 8 views on http://www.netvibes.com
    • 5 views on http://apprentieweb.blogspot.com
    • 5 views on http://gridknowledge.blogspot.com
    • 5 views on http://www.cirip.ro
    • 1 views on http://127.0.0.1:8795
    • 1 views on http://feeds2.feedburner.com
    • 1 views on http://static.slidesharecdn.com
    • 1 views on http://74.125.153.132
    • 1 views on http://lernys.wordpress.com

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories