Algorithms for the Thematic Analysis of Twitter Datasets Twitter: aneesha Email: aneesha.bakharia@gmail.com #comtech2011 Twitter Workshop Presented by: Aneesha Bakharia
Background PhD Candidate at Faculty of Science and Technology, QUT Research Algorithms for Interactive Content Analysis Surveys Workshops Interviews Large Doc Collections Corpus Twitter Blog Comments
Types of Qualitative Content Analysis (Hsieh and Shannon, 2006) Concentrate on Summative and Conventional (Inductive) Coding Approach Study Begins With Derivation of Codes Algorithms Summative Keywords Keywords identified before and during analysis Unsupervised and semi-supervised algorithms:  NMF ,  NTF   LDA  and traditional clustering algorithms. Conventional (Inductive) Observation Categories developed during analysis Directed (Deductive) Theory Categories derived from pre-existing theory prior to analysis Supervised classification algorithms: Support Vector Machines
Algorithms for Summative and Conventional Content Analysis Non-negative Matrix Factorisation  (Lee & Seung, 1999)  Simultaneous document (tweet) and word clustering Parts based representation Positive matrix decompositions Non-Negative Tensor Factorisation Additional dimension (time) Ideal to see temporal changes  in themes over time
Related Research Non-negative Matrix and Tensor Factorisation for Discussion Tracking (Bader, Berry and Langville, 2009) Discussion tracking in Enron email using PARAFAC (Bader and Berry, 2008) FutureLens: Software for Text Visualization and Tracking (Shutt, Puretskiy and Berry, 2009)
Non-negative Matrix Factorisation A ~ WH Tweet 1 Tweet 2 Tweet 3 Term-Tweet Matrix Specify No Themes (k) Features Matrix Weights Matrix Word 1 Word 2 Word n Tweet 1 1 0 2 Tweet 2 0 1 0 Tweet 3 0 1 1 Word 1 Word 2 Word n Theme 1 0.5 0 1 Theme 2 0 0.5 0 Theme 1 Theme 2 Tweet 1 1 0 Tweet 2 0 1 Tweet 3 0 1
Non-negative Matrix Factorisation Features Matrix Weights Matrix Theme 1 Theme 2 Word 1 Word 2 Word 2 Tweet 1 Tweet 1 Tweet 1 Word 1 Word 2 Word 3 Theme 1 0.5 0 1 Theme 2 0 0.5 0 Theme 1 Theme 2 Tweet 1 1 0 Tweet 2 0 1 Tweet 3 0 1
Applying NMF and LDA as Content Analysis aids
Non-negative Matrix Factorisation Tweet - Word Matrix Tweet – Author Matrix Word 1 Word 2 Word n Tweet 1 1 0 2 Tweet 2 0 1 0 Tweet 3 0 1 1 Word 1 Word 2 Word n Tweet  Author 1 1 0 2 Tweet  Author 2 0 1 0 Tweet  Author 3 0 1 1
Algorithms for the Thematic Analysis of Tweets Thematic Analysis with Non-negative Matrix Factorisation Convert text to term-document matrix NMF produces  word-theme matrix  theme-document matrix Allows theme overlap Need to specify number of themes (k) Allows for interactivity
#OzChi Analysis  –  OzChi 2010 Conference Theme 1 –  Elizabeth Churchill Keynote @xeeliz, dance, hci, yahoo, double rainbow, keynote Theme 2 –  John Seely Brown Keynote @jseelybrown, world, extreme, learning  Theme 3 –  24 hr Student Challenge @bjkraal, vote, support, posters Theme 4 –  Get the conf iphone app @parisba, conference, iphone app
TreeCloud Analysis of #OzChi Create Treeclouds: http://www.lirmm.fr/~gambette/treecloud/
OzChi Abstracts (2006 – 2010) http://www.randomsyntax.com/2010/11/24/uncovering-research-themes-from-5-years-of-ozchi-conferences-2006-2010/
Non-negative Tensor Matrix Factorisation Tweet – Word - Time Matrix Month April Word 1 Word 2 Word n Tweet 1 1 0 2 Tweet 2 0 1 0 Tweet 3 0 1 1 March Word 1 Word 2 Word n Tweet 1 1 0 2 Tweet 2 0 1 0 Tweet 3 0 1 1 Feb Word 1 Word 2 Word n Tweet 1 1 0 2 Tweet 2 0 1 0 Tweet 3 0 1 1 Jan Word 1 Word 2 Word n Tweet 1 1 0 2 Tweet 2 0 1 0 Tweet 3 0 1 1
Non-negative Tensor Matrix Factorisation Nonnegative Tensor Factorization for Knowledge Discovery http://cisml.utk.edu/Seminars/2010/Berry.pdf CISML Seminar Series, Fall 2010, Michael W. Berry
Interactive Theme Explorer developed  as part of research Algorithms for the Thematic Analysis of Tweets
Interactive Theme Explorer developed as part of research Plan to Integrate with yourTwapperKeeper (Business Intelligence) Share datasets and analysis Algorithms for the Thematic Analysis of Tweets
Python & Java Algorithms NMF:  http://www.csie.ntu.edu.tw/~cjlin/nmf/ NMF-LIB:  http://code.google.com/p/nmflib/ Latent Dirichlet Allocation (LDA) Apache Mahout:  http://mahout.apache.org/ WEKA http://www.cs.waikato.ac.nz/ml/weka/ Toolkit
Looking for Collaborators Twitter: aneesha Email: aneesha.bakharia@gmail.com Twitter Graphics from Webdesigner Depot http:// www.webdesignerdepot.com Graphics converted to wmf format  by Elizabeth Hall

Algorithms for the thematic analysis of twitter datasets

  • 1.
    Algorithms for theThematic Analysis of Twitter Datasets Twitter: aneesha Email: aneesha.bakharia@gmail.com #comtech2011 Twitter Workshop Presented by: Aneesha Bakharia
  • 2.
    Background PhD Candidateat Faculty of Science and Technology, QUT Research Algorithms for Interactive Content Analysis Surveys Workshops Interviews Large Doc Collections Corpus Twitter Blog Comments
  • 3.
    Types of QualitativeContent Analysis (Hsieh and Shannon, 2006) Concentrate on Summative and Conventional (Inductive) Coding Approach Study Begins With Derivation of Codes Algorithms Summative Keywords Keywords identified before and during analysis Unsupervised and semi-supervised algorithms: NMF , NTF LDA and traditional clustering algorithms. Conventional (Inductive) Observation Categories developed during analysis Directed (Deductive) Theory Categories derived from pre-existing theory prior to analysis Supervised classification algorithms: Support Vector Machines
  • 4.
    Algorithms for Summativeand Conventional Content Analysis Non-negative Matrix Factorisation (Lee & Seung, 1999) Simultaneous document (tweet) and word clustering Parts based representation Positive matrix decompositions Non-Negative Tensor Factorisation Additional dimension (time) Ideal to see temporal changes in themes over time
  • 5.
    Related Research Non-negativeMatrix and Tensor Factorisation for Discussion Tracking (Bader, Berry and Langville, 2009) Discussion tracking in Enron email using PARAFAC (Bader and Berry, 2008) FutureLens: Software for Text Visualization and Tracking (Shutt, Puretskiy and Berry, 2009)
  • 6.
    Non-negative Matrix FactorisationA ~ WH Tweet 1 Tweet 2 Tweet 3 Term-Tweet Matrix Specify No Themes (k) Features Matrix Weights Matrix Word 1 Word 2 Word n Tweet 1 1 0 2 Tweet 2 0 1 0 Tweet 3 0 1 1 Word 1 Word 2 Word n Theme 1 0.5 0 1 Theme 2 0 0.5 0 Theme 1 Theme 2 Tweet 1 1 0 Tweet 2 0 1 Tweet 3 0 1
  • 7.
    Non-negative Matrix FactorisationFeatures Matrix Weights Matrix Theme 1 Theme 2 Word 1 Word 2 Word 2 Tweet 1 Tweet 1 Tweet 1 Word 1 Word 2 Word 3 Theme 1 0.5 0 1 Theme 2 0 0.5 0 Theme 1 Theme 2 Tweet 1 1 0 Tweet 2 0 1 Tweet 3 0 1
  • 8.
    Applying NMF andLDA as Content Analysis aids
  • 9.
    Non-negative Matrix FactorisationTweet - Word Matrix Tweet – Author Matrix Word 1 Word 2 Word n Tweet 1 1 0 2 Tweet 2 0 1 0 Tweet 3 0 1 1 Word 1 Word 2 Word n Tweet Author 1 1 0 2 Tweet Author 2 0 1 0 Tweet Author 3 0 1 1
  • 10.
    Algorithms for theThematic Analysis of Tweets Thematic Analysis with Non-negative Matrix Factorisation Convert text to term-document matrix NMF produces word-theme matrix theme-document matrix Allows theme overlap Need to specify number of themes (k) Allows for interactivity
  • 11.
    #OzChi Analysis – OzChi 2010 Conference Theme 1 – Elizabeth Churchill Keynote @xeeliz, dance, hci, yahoo, double rainbow, keynote Theme 2 – John Seely Brown Keynote @jseelybrown, world, extreme, learning Theme 3 – 24 hr Student Challenge @bjkraal, vote, support, posters Theme 4 – Get the conf iphone app @parisba, conference, iphone app
  • 12.
    TreeCloud Analysis of#OzChi Create Treeclouds: http://www.lirmm.fr/~gambette/treecloud/
  • 13.
    OzChi Abstracts (2006– 2010) http://www.randomsyntax.com/2010/11/24/uncovering-research-themes-from-5-years-of-ozchi-conferences-2006-2010/
  • 14.
    Non-negative Tensor MatrixFactorisation Tweet – Word - Time Matrix Month April Word 1 Word 2 Word n Tweet 1 1 0 2 Tweet 2 0 1 0 Tweet 3 0 1 1 March Word 1 Word 2 Word n Tweet 1 1 0 2 Tweet 2 0 1 0 Tweet 3 0 1 1 Feb Word 1 Word 2 Word n Tweet 1 1 0 2 Tweet 2 0 1 0 Tweet 3 0 1 1 Jan Word 1 Word 2 Word n Tweet 1 1 0 2 Tweet 2 0 1 0 Tweet 3 0 1 1
  • 15.
    Non-negative Tensor MatrixFactorisation Nonnegative Tensor Factorization for Knowledge Discovery http://cisml.utk.edu/Seminars/2010/Berry.pdf CISML Seminar Series, Fall 2010, Michael W. Berry
  • 16.
    Interactive Theme Explorerdeveloped as part of research Algorithms for the Thematic Analysis of Tweets
  • 17.
    Interactive Theme Explorerdeveloped as part of research Plan to Integrate with yourTwapperKeeper (Business Intelligence) Share datasets and analysis Algorithms for the Thematic Analysis of Tweets
  • 18.
    Python & JavaAlgorithms NMF: http://www.csie.ntu.edu.tw/~cjlin/nmf/ NMF-LIB: http://code.google.com/p/nmflib/ Latent Dirichlet Allocation (LDA) Apache Mahout: http://mahout.apache.org/ WEKA http://www.cs.waikato.ac.nz/ml/weka/ Toolkit
  • 19.
    Looking for CollaboratorsTwitter: aneesha Email: aneesha.bakharia@gmail.com Twitter Graphics from Webdesigner Depot http:// www.webdesignerdepot.com Graphics converted to wmf format by Elizabeth Hall