Reporterslab.orgPresentation for computational journalism students February 2012
STRUCTURED DATA.. And most reporters’ inability to deal with it
New York Times reporters used Word searches andannotations to analyze Wikileaks documents in 2010and 2011.
PANDA project trying to help gather data inside newsrooms
Barriers to Structured data analysis in the newsroom• Expensive• Too hard to collect.• It takes practice• It takes patience.• Once collected, data has a short shelf life – its value inside the newsroom effectively ends once a story is published.
Web-scraping software:ephemeral or tooexpensive for a task notviewed as mission-critical.
Solutions• User-friendly tool for scraping websites for structured data• Packages of algorithms from fraud and other forensic fields for use with public records datasets online.• Packages of queries and statistical tests for money, dates, geographical identifiers, names and codes, presented in standard English• Tools for fuzzy matching of datasets: include scoring, best match likelihood, interactive machine learning for different datasets.
Too many sources with too little news• Twitter, Facebook, LinkedIn and other social media• RSS feeds from other news organizations and blogs• Press releases from government agencies or beat subjects Lack of archiving is just as troubling as the lack of structure. Reporters can’t hold the powerful accountable without information from the past.
Solutions• Archiving users’ feeds locally or in the cloud• Mash-up social media, rss feeds into an app that reveals more insight into the sources• Formalize each reporter’s definition of “news” through machine learning.• Alerts for important source material. Example: changing time of a press conference.
Solutions• Visual extractor of data from scanned forms.• Separate scanned boxes of documents into their pieces for further analysis• Use speech recognition tools on government audio and video• OCR video to find the speaker at a hearing
Our way A newer way• Hand-enter individual items • Leverage web scraping and into spreadsheets paid crowdsourcing for data• Transcribe entry (MT) interviews, hearings and • Use speech recognition for other audio and video the first pass on searchable content for searching audio and video• Read each document • Use clustering, information extraction and other methods for overview of documents
Reporterslab.org working to tameaudio and video
Associated Pressproject to bring orderto unstructured data