Big, Ugly Datasets for Thumb-Fingered Journalists	<br />@nclarkjudd, thumb-fingered journalist<br />
We’re swimming in data<br />Open Graph<br />Social Media Data Mining<br />Government Data<br />
It’s not getting easier to use<br />… With exceptions, like TimeFlow<br />
This is where we come in	<br />There’s an increasing need for journalists at all levels to be equipped to acquire and anal...
What are you doing with data?<br />Exploring: Looking for patterns, following hunches, finding context and background — lo...
Know right questions to ask<br />When you’re picking a dataset to use, understand its:<br />Provenance<br />Sampling<br />...
Data Workflow<br />Understand your needs<br />Acquire your data (Download, FOIL, Sources)<br />Clean your data<br />Load i...
Cleaning Your Data<br />Use a script or a robust text editor like vi<br />It’s difficult. It takes a while. It gets done.<...
Load your data	<br />
Fail and Iterate<br />Again: It probably won’t work the first time.<br />It’s difficult. It takes a while. It gets done.<b...
Analyze<br />Check your script. Did I write my query correctly?<br />Write queries multiple ways. Do the numbers add up th...
Share<br />Photo: Britta Bohllinger / Flickr<br /><ul><li>SPJ.org
IRE.org
HacksHackers.com</li></li></ul><li>Resources	<br />http://dev.mysql.com/doc/refman/5.1/en/<br />http://github.com/FlowingM...
Upcoming SlideShare
Loading in...5
×

Big Ugly Datasets For Thumb-Fingered Journalists

1,083

Published on

Presentation by Nick Judd. Audio is here: http://ow.ly/2RMQG

1 Comment
1 Like
Statistics
Notes
No Downloads
Views
Total Views
1,083
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
1
Likes
1
Embeds 0
No embeds

No notes for slide

Big Ugly Datasets For Thumb-Fingered Journalists

  1. 1. Big, Ugly Datasets for Thumb-Fingered Journalists <br />@nclarkjudd, thumb-fingered journalist<br />
  2. 2. We’re swimming in data<br />Open Graph<br />Social Media Data Mining<br />Government Data<br />
  3. 3. It’s not getting easier to use<br />… With exceptions, like TimeFlow<br />
  4. 4. This is where we come in <br />There’s an increasing need for journalists at all levels to be equipped to acquire and analyze big, ugly datasets<br />Without the resources of a New York Times or Washington Post, how do you do that?<br />
  5. 5. What are you doing with data?<br />Exploring: Looking for patterns, following hunches, finding context and background — looking to be surprised<br />Deducing: Proving a hypothesis, pulling specific records — looking for something in particular<br />
  6. 6. Know right questions to ask<br />When you’re picking a dataset to use, understand its:<br />Provenance<br />Sampling<br />Method<br />Quality<br />Completeness<br />
  7. 7. Data Workflow<br />Understand your needs<br />Acquire your data (Download, FOIL, Sources)<br />Clean your data<br />Load it into a Relational Database Management System (RDBMS)<br />Analyze what you’ve got<br />Output relevant segments for visualization<br />
  8. 8. Cleaning Your Data<br />Use a script or a robust text editor like vi<br />It’s difficult. It takes a while. It gets done.<br />
  9. 9. Load your data <br />
  10. 10. Fail and Iterate<br />Again: It probably won’t work the first time.<br />It’s difficult. It takes a while. It gets done.<br />
  11. 11. Analyze<br />Check your script. Did I write my query correctly?<br />Write queries multiple ways. Do the numbers add up the same when the RDBMS makes sums and when I do them?<br />Use checksums: Can I compare results from a segment of this data with previously published and vetted results? Are they the same?<br />Consult experts: Ask — Does this mean what I think it means? Do these results make sense?<br />Output smaller segments of your data to another tool such as Socrata or ManyEyes in order to generate graphs, tables, and visualizations<br />
  12. 12. Share<br />Photo: Britta Bohllinger / Flickr<br /><ul><li>SPJ.org
  13. 13. IRE.org
  14. 14. HacksHackers.com</li></li></ul><li>Resources <br />http://dev.mysql.com/doc/refman/5.1/en/<br />http://github.com/FlowingMedia/TimeFlow/wiki<br />http://www.lagmonster.org/docs/vi.html<br />http://www.socrata.com/<br />http://www.data.gov<br />
  15. 15. Assignment<br />You are an investigative team that does freelance work around the country and are working up a pitch for your next project.<br />Pick a subject matter you want to investigate<br />Identify a dataset or datasets that will help you formulate your story. For this exercise, only pick one available on the Web already, e.g. through Data.gov.<br />Plan:<br />What do you need to clean these data?<br />The schema you’ll make to house the dataset(s)<br />What are you doing with this data — are you using it for exploratory or deductive reasoning?<br />What will your queries look like? Will you join multiple databases together? If so, how are you sure the results will be relevant?<br />How will you express the results of your inquiry?<br />What questions won’t the data answer that you want to address in your project? Who will you turn to as you start looking for these answers?<br />

×