Big Ugly Datasets For Thumb-Fingered Journalists


Published on

Presentation by Nick Judd. Audio is here:

1 Comment
1 Like
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Big Ugly Datasets For Thumb-Fingered Journalists

  1. 1. Big, Ugly Datasets for Thumb-Fingered Journalists <br />@nclarkjudd, thumb-fingered journalist<br />
  2. 2. We’re swimming in data<br />Open Graph<br />Social Media Data Mining<br />Government Data<br />
  3. 3. It’s not getting easier to use<br />… With exceptions, like TimeFlow<br />
  4. 4. This is where we come in <br />There’s an increasing need for journalists at all levels to be equipped to acquire and analyze big, ugly datasets<br />Without the resources of a New York Times or Washington Post, how do you do that?<br />
  5. 5. What are you doing with data?<br />Exploring: Looking for patterns, following hunches, finding context and background — looking to be surprised<br />Deducing: Proving a hypothesis, pulling specific records — looking for something in particular<br />
  6. 6. Know right questions to ask<br />When you’re picking a dataset to use, understand its:<br />Provenance<br />Sampling<br />Method<br />Quality<br />Completeness<br />
  7. 7. Data Workflow<br />Understand your needs<br />Acquire your data (Download, FOIL, Sources)<br />Clean your data<br />Load it into a Relational Database Management System (RDBMS)<br />Analyze what you’ve got<br />Output relevant segments for visualization<br />
  8. 8. Cleaning Your Data<br />Use a script or a robust text editor like vi<br />It’s difficult. It takes a while. It gets done.<br />
  9. 9. Load your data <br />
  10. 10. Fail and Iterate<br />Again: It probably won’t work the first time.<br />It’s difficult. It takes a while. It gets done.<br />
  11. 11. Analyze<br />Check your script. Did I write my query correctly?<br />Write queries multiple ways. Do the numbers add up the same when the RDBMS makes sums and when I do them?<br />Use checksums: Can I compare results from a segment of this data with previously published and vetted results? Are they the same?<br />Consult experts: Ask — Does this mean what I think it means? Do these results make sense?<br />Output smaller segments of your data to another tool such as Socrata or ManyEyes in order to generate graphs, tables, and visualizations<br />
  12. 12. Share<br />Photo: Britta Bohllinger / Flickr<br /><ul><li>
  13. 13.
  14. 14.</li></li></ul><li>Resources <br /><br /><br /><br /><br /><br />
  15. 15. Assignment<br />You are an investigative team that does freelance work around the country and are working up a pitch for your next project.<br />Pick a subject matter you want to investigate<br />Identify a dataset or datasets that will help you formulate your story. For this exercise, only pick one available on the Web already, e.g. through<br />Plan:<br />What do you need to clean these data?<br />The schema you’ll make to house the dataset(s)<br />What are you doing with this data — are you using it for exploratory or deductive reasoning?<br />What will your queries look like? Will you join multiple databases together? If so, how are you sure the results will be relevant?<br />How will you express the results of your inquiry?<br />What questions won’t the data answer that you want to address in your project? Who will you turn to as you start looking for these answers?<br />