Big, Ugly Datasets for Thumb-Fingered Journalists @nclarkjudd, thumb-fingered journalist
We’re swimming in data Open Graph Social Media Data Mining Government Data
It’s not getting easier to use … With exceptions, like TimeFlow
This is where we come in There’s an increasing need for journalists at all levels to be equipped to acquire and analyze big, ugly datasets Without the resources of a New York Times or Washington Post, how do you do that?
What are you doing with data? Exploring: Looking for patterns, following hunches, finding context and background — looking to be surprised Deducing: Proving a hypothesis, pulling specific records — looking for something in particular
Know right questions to ask When you’re picking a dataset to use, understand its: Provenance Sampling Method Quality Completeness
Data Workflow Understand your needs Acquire your data (Download, FOIL, Sources) Clean your data Load it into a Relational Database Management System (RDBMS) Analyze what you’ve got Output relevant segments for visualization
Cleaning Your Data Use a script or a robust text editor like vi It’s difficult. It takes a while. It gets done.
Fail and Iterate Again: It probably won’t work the first time. It’s difficult. It takes a while. It gets done.
Analyze Check your script. Did I write my query correctly? Write queries multiple ways. Do the numbers add up the same when the RDBMS makes sums and when I do them? Use checksums: Can I compare results from a segment of this data with previously published and vetted results? Are they the same? Consult experts: Ask — Does this mean what I think it means? Do these results make sense? Output smaller segments of your data to another tool such as Socrata or ManyEyes in order to generate graphs, tables, and visualizations
Assignment You are an investigative team that does freelance work around the country and are working up a pitch for your next project. Pick a subject matter you want to investigate Identify a dataset or datasets that will help you formulate your story. For this exercise, only pick one available on the Web already, e.g. through Data.gov. Plan: What do you need to clean these data? The schema you’ll make to house the dataset(s) What are you doing with this data — are you using it for exploratory or deductive reasoning? What will your queries look like? Will you join multiple databases together? If so, how are you sure the results will be relevant? How will you express the results of your inquiry? What questions won’t the data answer that you want to address in your project? Who will you turn to as you start looking for these answers?