Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

15. political discourseinthenewskb


Published on

KB symposium historische kranten als big data,
Den Haag, 24 maart 2015

Published in: Government & Nonprofit
  • Be the first to comment

  • Be the first to like this

15. political discourseinthenewskb

  1. 1. Political Discourse in the News 
 (and other studies) Antske Fokkens, VU University Amsterdam
 Political discourse in the News is joint work with: Ellis Aizenberg, Wouter van Atteveldt, Carlotta Cassimassima, Franz-Xaver Geiger, Laura Hollink, Annick van der Peet, Chantal van Son
  2. 2. Overview • Introduction • Interdisciplinary research & research questions • Text analysis • From basic to complex: possibilities and challenges • Methodological issues • Conclusion
  3. 3. Introduction • Interdisciplinary research: • Social Science: manual annotation, research questions • Humanities: research questions • Computer Science: modeling, visualization • Computational Linguistics: text analysis
  4. 4. Introduction • Research questions: • Has personalization increased in political news? • What trends do we see in reported political conflicts? • How does news reporting relate to the parliamentary debates? • What perspectives are expressed by news (explicitly and implicitly)?
  5. 5. Approaches • Manual annotations: • Expert (communication science researchers and Master students) • Crowd (crowdsourcing) • Automatic annotation: • Basic as well as advanced NLP approaches
  6. 6. Text analysis • AmCAT (Wouter van Atteveldt): • Open source infrastructure that facilitates large-scale analysis and manual content analysis of text • BiographyNet/NewsReader pipeline (Piek Vossen’s cltl group): • NLP modules for event (and event relation) extraction & named entity recognition and disambiguation • OpeNER tools (Piek Vossen’s cltl group): • Sentiment analysis and opinion mining
  7. 7. Basic methods • Counting: • occurrences of names in text • identifying words from word lists (e.g. sentiment words) • Topic modeling (e.g. LDA)
  8. 8. Basic methods • Can easily be run on large datasets • Can address research questions (e.g. Aizenberg (2014) shows increase of personalization) • Limited to overall trends and tendencies • For some tasks, high risk of unreliable results: • e.g. erg is listed with ‘negative sentiment’
  9. 9. More advanced analyses
  10. 10. More advanced analyses • Can provide more detailed insight into the content of the text • Scalability becomes an issue (several complex language models) • to illustrate: • +/- 5 minutes per article (regular university cluster) • 11 days for 1.3 million articles on Hadoop cluster at SURFsara • Accuracy can be low for difficult tasks and because errors ‘pile up’
  11. 11. Methodological issues • Data interpretation • Biases • Example: OCR
  12. 12. Data interpretation • Basic methods: • results from counts are clear, but what do they say? • More advanced methods: • attempt to provide semantic interpretations, but what is the accuracy of the tools?
  13. 13. Biases • One way to deal with errors is to assume that it is just noise in a large pile of data • This assumption works, if errors are equally distributed across classes/information that matter for the research question • For instance, counting sentiment related terms: • are the lists for negative and positive terms of comparable quality? • does one of the list contain more ambiguous terms than the other?
  14. 14. Bias example OCR • Data from the KB still have some issues with OCR • There tend to be more issues with older data • Imagine we investigate whether emotional expressions in text increased over time:
 Does worse OCR lead to a lower percentage of identification in older text?
  15. 15. Dealing with biases • We cannot exclude the risk of biases completely • We can: • try to make sure researchers using output are aware of the details of the method (raise awareness of possible biases) • carry out both intrinsic and extrinsic evaluation, i.e. explicitly investigate the influence of a bias on overall results
  16. 16. Conclusion • Several research directions where technology (including NLP, linked data, visualizations) is used to support research in Humanities and Social Sciences • NLP approaches vary from basic to complex pipelines carrying out several steps • Basic approaches can easily be applied to large datasets are transparent, but do not say much • More advanced approaches provide detailed information, but cannot easily be applied to large sets and are less transparent • Insight into how data was processed and both intrinsic and extrinsic evaluation is needed to raise awareness about (or even avoid?) biases
  17. 17. Thank you!
  18. 18. References • AmCAT: • BiographyNet/NewsReader pipeline: • Rodrigo Agerri et al. (2015). Event Detection version 2.2. NewsReader Deliverable 4.2.2. • Methodological issues: • Antske Fokkens, Serge ter Braake, Niels Ockeloen, Piek Vossen, Susan Legêne and Guus Schreiber. 2014. BiographyNet: Methodological issues when NLP supports historical research. Proceedings of LREC 2014. • Niels Ockeloen, Antske Fokkens, Serge ter Braake, Piek Vossen, Victor de Boer, Guus Schreiber, and Susan Legêne. 2013. BiographyNet: Managing Provenance at multiple levels and from different perspectives. Proceedings of Linked Science 2013.