Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

916 views

Published on

Presentation at WAPOR Buenos Aires, June 2015

Published in: Education
  • Be the first to comment

  • Be the first to like this

Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

  1. 1. Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis Damian Trilling & Jeroen Jonkman d.c.trilling@uva.nl @damian0604 www.damiantrilling.net Afdeling Communicatiewetenschap Universiteit van Amsterdam WAPOR, Buenos Aires, 16–19 June 2015
  2. 2. Overview Problems Sample implementation: INFRA Empirical example Conclusions Automated Framing analysis Packing and Unpacking the Bag of Words Trilling & Jonkman
  3. 3. Overview Problems Sample implementation: INFRA Empirical example Conclusions Automated Framing analysis Deductive • simple: word lists and search strings • advanced: supervised machine learning Packing and Unpacking the Bag of Words Trilling & Jonkman
  4. 4. Overview Problems Sample implementation: INFRA Empirical example Conclusions Automated Framing analysis Deductive • simple: word lists and search strings • advanced: supervised machine learning Inductive • word frequencies and co-occurrences • visualizations • principal component analysis • cluster analysis • latent dirichlet allocation • . . . Packing and Unpacking the Bag of Words Trilling & Jonkman
  5. 5. Overview Problems Sample implementation: INFRA Empirical example Conclusions Automated Framing analysis Inductive • word frequencies and co-occurrences • visualizations • principal component analysis • cluster analysis • latent dirichlet allocation • . . . This is the focus of our study Packing and Unpacking the Bag of Words Trilling & Jonkman
  6. 6. Overview Problems Sample implementation: INFRA Empirical example Conclusions Methodological issues Methodological issues What constitutes a frame? — and how does this translate to an operationalization? Packing and Unpacking the Bag of Words Trilling & Jonkman
  7. 7. Overview Problems Sample implementation: INFRA Empirical example Conclusions Methodological issues Methodological issues What constitutes a frame? — and how does this translate to an operationalization? • Is a frame fundamentally different from a (sub-)topic? (⇒ topic modeling) Packing and Unpacking the Bag of Words Trilling & Jonkman
  8. 8. Overview Problems Sample implementation: INFRA Empirical example Conclusions Methodological issues Methodological issues What constitutes a frame? — and how does this translate to an operationalization? • Is a frame fundamentally different from a (sub-)topic? (⇒ topic modeling) • Do we expect each element to occur in one and only one frame? (⇒ PCA) Packing and Unpacking the Bag of Words Trilling & Jonkman
  9. 9. Overview Problems Sample implementation: INFRA Empirical example Conclusions Methodological issues Methodological issues What constitutes a frame? — and how does this translate to an operationalization? • Is a frame fundamentally different from a (sub-)topic? (⇒ topic modeling) • Do we expect each element to occur in one and only one frame? (⇒ PCA) • Do we need to distinguish between actors, actions, . . . — or are all words taken into consideration equally? • . . . Packing and Unpacking the Bag of Words Trilling & Jonkman
  10. 10. Overview Problems Sample implementation: INFRA Empirical example Conclusions Practical issues Practical issues • no standard software (but: more and more R-packages and Python modules) • reliance on inaccessible, self-written, or proprietary software • lack of knowledge in the field • size of the datasets Packing and Unpacking the Bag of Words Trilling & Jonkman
  11. 11. Overview Problems Sample implementation: INFRA Empirical example Conclusions A catalogue of criteria A catalogue of criteria A toolkit for automated framing analysis should. . . 1 not depend on commercial software 2 run on all major operating systems 3 be scalable: usable on a laptop, but also on powerful servers to analyze millions of documents. 4 be flexible and open: adoptable to own needs 5 have a powerful database engine on the background Packing and Unpacking the Bag of Words Trilling & Jonkman
  12. 12. Overview Problems Sample implementation: INFRA Empirical example Conclusions Design Sample implementation: INFRA To meet these criteria, we wrote INFRA in Python, using the NoSQL database MongoDB. The toolkit will be made freely available, both as source code and via a web interface. Packing and Unpacking the Bag of Words Trilling & Jonkman
  13. 13. Data (e.g., Lexis Nexis articles) Import filter NoSQL database Cleaning and pre- processing filters Cleaned NoSQL database word frequencies and co-occurences log likelihood visualizations define details for analysis (e.g., im- portant actors) dictionary filter/named entity recognition Latent dirich- let allocation Principal com- ponent analysis Cluster analysis Data management phase Analysis phase
  14. 14. Overview Problems Sample implementation: INFRA Empirical example Conclusions Design Central storage Data management phase handled on the server; analyses can be handled either on the server (SSH) or locally (INFRA) External data MongoDB server Computer2 Computer3Computer1 Computer4 Server: Linux-VM with MongoDB server; Clients: Python, INFRA, mongo client Packing and Unpacking the Bag of Words Trilling & Jonkman
  15. 15. Overview Problems Sample implementation: INFRA Empirical example Conclusions Design Enjoying the advantages of BOW — and overcoming its shortcomings Packing and Unpacking the Bag of Words Trilling & Jonkman
  16. 16. Overview Problems Sample implementation: INFRA Empirical example Conclusions Design Enjoying the advantages of BOW — and overcoming its shortcomings In the preprocessing phase • all information is still • we can use custom regexp-based rules and filters e.g.: if a text contains [list of synomys of A] and [list of synomys of B], replace [synomys of A] with C • extremely useful for unifying actors that are referred in several ways Packing and Unpacking the Bag of Words Trilling & Jonkman
  17. 17. Overview Problems Sample implementation: INFRA Empirical example Conclusions Design Enjoying the advantages of BOW — and overcoming its shortcomings In the preprocessing phase • all information is still • we can use custom regexp-based rules and filters e.g.: if a text contains [list of synomys of A] and [list of synomys of B], replace [synomys of A] with C • extremely useful for unifying actors that are referred in several ways In the analysis phase • work with a much faster dataset that contains only the necessary information • no need to deal with misspellings and variations any more Packing and Unpacking the Bag of Words Trilling & Jonkman
  18. 18. Overview Problems Sample implementation: INFRA Empirical example Conclusions Design Towards a “best practice” of inductive framing analysis In the data management phase • spend much time on re-coding relevant multi-word entities to avoid noise (of course, “Barack” and “Obama” occur together) and recode synonyms (how would you otherwise reliably estimate frequencies?) ⇒ especially important for questions like “how is actor X framed?” • regular expressions instead of simple word lists! • make an informed decision on how to harmonize the dataset (stopword removal, stemming (?), POS tagging (?)) And: share these procedures! Packing and Unpacking the Bag of Words Trilling & Jonkman
  19. 19. Overview Problems Sample implementation: INFRA Empirical example Conclusions Design Towards a “best practice” of inductive framing analysis In the analysis phase • background knowledge necessary (face validity) • robustness: do slightly different parameters deliver similar results? • too small dataset ⇒ sensitivity for atypical events (scandals etc.) ⇒ discovering topic rather than frame • difference between statistical predictive power and meaningfulness Packing and Unpacking the Bag of Words Trilling & Jonkman
  20. 20. Overview Problems Sample implementation: INFRA Empirical example Conclusions Empirical example: Dutch business news Packing and Unpacking the Bag of Words Trilling & Jonkman
  21. 21. Overview Problems Sample implementation: INFRA Empirical example Conclusions Steps Preprocessing steps 1 Ingest and parse all possibly relevant articles (≈ 500 000) 2 Compose list of ≈ 1 500 regular expressions to substitute synonyms and combinations to correctly code actors, allowing for conditional substitutions 3 Remove stopwords, punctuation, etc. 4 Determine part-of-speech, keep only nouns, adjectives, adverbs Packing and Unpacking the Bag of Words Trilling & Jonkman
  22. 22. Overview Problems Sample implementation: INFRA Empirical example Conclusions Steps Analysis steps 1 Determine relevant actors with frequency counts, filtering out all non-Dutch words (alternative: named entity recognition) 2 Conduct PCA, cluster analysis, and LDA – additionally, count frequency of actor mentions 3 Finetuning, repeating, choose final model Packing and Unpacking the Bag of Words Trilling & Jonkman
  23. 23. Overview Problems Sample implementation: INFRA Empirical example Conclusions Output Example: Attention over time Overview of news attention: attention to 100 firms in company news and entropy (red line) from 2007 to 2013. Packing and Unpacking the Bag of Words Trilling & Jonkman
  24. 24. Overview Problems Sample implementation: INFRA Empirical example Conclusions Output Example: Topics Results of a topic model Packing and Unpacking the Bag of Words Trilling & Jonkman
  25. 25. Overview Problems Sample implementation: INFRA Empirical example Conclusions Output Example: Components Results of a principal component analysis Packing and Unpacking the Bag of Words Trilling & Jonkman
  26. 26. Overview Problems Sample implementation: INFRA Empirical example Conclusions Output Example: co-occurrences Results of a network visualization of co-occurrances Packing and Unpacking the Bag of Words Trilling & Jonkman
  27. 27. Overview Problems Sample implementation: INFRA Empirical example Conclusions Conclusions • We developed a toolkit that integrates all recent methods used for automated inductive framing analysis • It is free • It works with large-scale datasets • It can be used by a whole group together Packing and Unpacking the Bag of Words Trilling & Jonkman
  28. 28. Overview Problems Sample implementation: INFRA Empirical example Conclusions Next steps • RE the tool: graphical interface • RE the method: systematic validation study; comparing different approaches and settings Packing and Unpacking the Bag of Words Trilling & Jonkman
  29. 29. Overview Problems Sample implementation: INFRA Empirical example Conclusions Questions Questions? d.c.trilling@uva.nl @damian0604 www.damiantrilling.net Packing and Unpacking the Bag of Words Trilling & Jonkman

×