Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Reporterslab.orgPresentation for computational     journalism students        February 2012
STRUCTURED DATA.. And most reporters’ inability to deal with it
New York Times reporters used Word searches andannotations to analyze Wikileaks documents in 2010and 2011.
PANDA project trying to help gather data inside newsrooms
Barriers to Structured data analysis in                the newsroom•   Expensive•   Too hard to collect.•   It takes pract...
Web-scraping software:ephemeral or tooexpensive for a task notviewed as mission-critical.
Solutions• User-friendly tool for scraping websites for  structured data• Packages of algorithms from fraud and other  for...
TOO MUCH MATERIALWith too little information
Too many sources with too little news• Twitter, Facebook, LinkedIn and other social media• RSS feeds from other news organ...
Solutions• Archiving users’ feeds locally or in the cloud• Mash-up social media, rss feeds into an app  that reveals more ...
The buried treasureUNUSABLE RECORDS
Solutions• Visual extractor of data from scanned forms.• Separate scanned boxes of documents into  their pieces for furthe...
For unstructured dataANTIQUATED METHODS
Our way                         A newer way• Hand-enter individual items   • Leverage web scraping and  into spreadsheets ...
Reporterslab.org working to tameaudio and video
Associated Pressproject to bring orderto unstructured data
Wordseer forhistorical text
Jigsaw
REPORTERSLAB.ORGCreating sample data and documents for researchers based on realstories
Computational journalism projects
Computational journalism projects
Computational journalism projects
Upcoming SlideShare
Loading in …5
×

Computational journalism projects

385 views

Published on

Presentation to Duke University computer science students, February 2012, by Sarah Cohen, Knight Professor of the Practice

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Computational journalism projects

  1. 1. Reporterslab.orgPresentation for computational journalism students February 2012
  2. 2. STRUCTURED DATA.. And most reporters’ inability to deal with it
  3. 3. New York Times reporters used Word searches andannotations to analyze Wikileaks documents in 2010and 2011.
  4. 4. PANDA project trying to help gather data inside newsrooms
  5. 5. Barriers to Structured data analysis in the newsroom• Expensive• Too hard to collect.• It takes practice• It takes patience.• Once collected, data has a short shelf life – its value inside the newsroom effectively ends once a story is published.
  6. 6. Web-scraping software:ephemeral or tooexpensive for a task notviewed as mission-critical.
  7. 7. Solutions• User-friendly tool for scraping websites for structured data• Packages of algorithms from fraud and other forensic fields for use with public records datasets online.• Packages of queries and statistical tests for money, dates, geographical identifiers, names and codes, presented in standard English• Tools for fuzzy matching of datasets: include scoring, best match likelihood, interactive machine learning for different datasets.
  8. 8. TOO MUCH MATERIALWith too little information
  9. 9. Too many sources with too little news• Twitter, Facebook, LinkedIn and other social media• RSS feeds from other news organizations and blogs• Press releases from government agencies or beat subjects Lack of archiving is just as troubling as the lack of structure. Reporters can’t hold the powerful accountable without information from the past.
  10. 10. Solutions• Archiving users’ feeds locally or in the cloud• Mash-up social media, rss feeds into an app that reveals more insight into the sources• Formalize each reporter’s definition of “news” through machine learning.• Alerts for important source material. Example: changing time of a press conference.
  11. 11. The buried treasureUNUSABLE RECORDS
  12. 12. Solutions• Visual extractor of data from scanned forms.• Separate scanned boxes of documents into their pieces for further analysis• Use speech recognition tools on government audio and video• OCR video to find the speaker at a hearing
  13. 13. For unstructured dataANTIQUATED METHODS
  14. 14. Our way A newer way• Hand-enter individual items • Leverage web scraping and into spreadsheets paid crowdsourcing for data• Transcribe entry (MT) interviews, hearings and • Use speech recognition for other audio and video the first pass on searchable content for searching audio and video• Read each document • Use clustering, information extraction and other methods for overview of documents
  15. 15. Reporterslab.org working to tameaudio and video
  16. 16. Associated Pressproject to bring orderto unstructured data
  17. 17. Wordseer forhistorical text
  18. 18. Jigsaw
  19. 19. REPORTERSLAB.ORGCreating sample data and documents for researchers based on realstories

×