Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Data: Collect, Clean and Manipulate               ONA 2012              San Francisco       Jennifer LaFleur, ProPublica  ...
Why data? It takes you beyond the anecdote It’s easier than counting sheets of paper
Why data? Contrasts are in the data
Caution: This slide contains extreme nerdiness
Why Computer-Assisted Reporting? Contrasts are in the data Your most powerful figures are in the data
Source: CaliforniaHealth Dept.data, Medicare billingdataFindings: Somehospitals had“alarming rates of aThird World nutriti...
Why data? Contrasts are in the data Your most powerful figures are in the data You can make connections you might not be a...
Data: Youthprison workers,criminalconvictions andgrievance dataFindings:Employees withcriminalbackgroundswere more likelyt...
Data: Federalbridgeinspections andstimulus funding.Findings: Someof the nation’sworst bridges didnot get stimulusfunds.
Why data? Contrasts are in the data Your most powerful figures are in the data You can make connections you might not be a...
Source: NHTSAcomplaint dataFindings:“…unintendedacceleration hasbeen a problemacross the autoindustry.”
HT/Florencia  Coelho
Collecting the data
Where’s the data?  If something is inspected  Licensed  Enforced or  Purchased…There probably is a database
Where’s the data?    If there is a report    Or a form    There probably is a database
Where’s the data?Sometimes data is readily availableonline for download
Source: CensusFindings: “Fueled bythe dismal economyand highunemployment, more Americans…aredoubling up”
Source: Medicaid nursing homesurvey data and financedata, housing dataFindings: “…a shortage of placesfor the disabled to ...
Where’s the data?Sometimes you have to scrape it.That usually involves programsthat automate searching tasks onWeb sites.
Where’s the data?    More often you need to go to an    agency to get the data    This can be tricky if an agency    doesn...
Source: School districtcredit card purchasesFindings: District cardholders madequestionablepurchases with theircards.
Sometimes, there is no data.But it’s okay because there aretechniques for sampling and buildinga database.
ProPublica pulled a randomsample of 500 names from alist of individuals who hadbeen granted or deniedpardons (around 2,000...
Source: Loan details,foreclosure information andbankruptcy filingsFindings: Loans leading toforeclosure didn’t alwaysfollo...
When you have to ask for the data   Before filing a request: Ask for it   If they require a formal request, find   out who...
Dear Records Administrator:I’m writing to request under the Texas Public Information Act an electronic copy of the current...
Getting electronic information   Know the law. Know how your state treats (or   doesn’t) the records you need.   Know what...
Just another way of saying noHuge costsDelay tactics“Oh you silly little journalist”Sending you the wrong thing“Your reque...
Negotiating: Some examples
Our database is on amainframe and it’svery complicated,Missy
We don’t havethe authorityto do that
That will cost$25,000.
We have processed your request. Thelabor cost for the request is asfollows.Item                 # of hoursRESEARCH        ...
From Texas Public Information Act:111.67. Estimates and Waivers of Public Information Charges (a) A governmental body is r...
We only keeptheinformationfor 7 days
Check retention schedules
That usesproprietarysoftware.
We don’t keepthat oncomputer
Okay, we do,but it’s a lotof files
Thatinformation isprotected bylaw
Cleaning data
Remember that data are not perfect
It doesn’t mean you can’t use it…Do integrity checks to find the flawsAdd caveats where necessaryDo your own analysis rath...
Integrity checks for every data set    Read the documentation. Understand the    contents of every field.    Know how many...
Integrity checks for every data set Internal data checks: Is there more money going to sub-contractors than went to the pr...
If your data is inExcel, use the filterfunction to see whatthe values are inindividual fields.
Integrity checks for every data set    Check for missing data, misplaced data or blank    fields    Use a standard naming ...
2010 Census ACS: Median HH Income by Metro Area
Be creative when you look for duplicates
Beyond the basics Keep a notes file Don’t work off your original database Know the source Check against summary reports Us...
Truck accidents by year and agency
Beyond the basics Check with experts Are there standards? (ex: a drop by more than 10 perc pts is a red flag) Find out wha...
Voter FraudDozens of St. Louis voters are being wrongly accusedof casting ballots from fraudulent addresses in lastyears N...
Texas test score data official results versus district     Duncanville district reported     4th grade writing        Offi...
Three rounds of analysis  after bouncing off subjects  and experts  Demographically based  Voir dire  Socioeconomics
Checks when you’re matching dataA name is not enough. Lots of people have the same name                                   ...
Source: Illinois health data, police dataFindings: Dangerous systemic failed to protect elderly patients inIllinois nursin...
Even people with seemingly unique names aren’t so unique
Evaluating outside studies Get the questionnaire and methodology Beware of nonscientific methods: Web surveys, man on the ...
Reporting data Consider reporting rates not raw numbers Avoid false precision: 53.14 percent said … in a poll with a 5 per...
When the data is the problem – you might stillhave a storyErroneous government databases – can oftenbe a story themselves
Manipulating data for stories         and apps
Know which tool to use•   Reporting individual records•   Counting/summing•   Mapping•   Statistics
Source: Medicaidoutcomes data fordialysis facilitiesFindings: A CMSonline tool did nottell the whole storyabout facilities...
Source: Washington Health Department dataFindings: “MRSA has been quietly killing in hospitals for decades.” But noone had...
Source: Dept. of Ed data and surveys of campus crisis clinicsFindings: Many campuses had lax enforcement and reporting loo...
Source: EPA and state data on hazardous chemical locationsFindings: Dallas County has 900+ sites that store hazardous chem...
Source: Daminspection datafrom Texas andfederal governmentFindings: Damrecords had notbeen updated toaccount forpopulation...
Source: 311 calls for downed treesFindings: After a tornado swept across New York City, 311calls for downed trees helps tr...
Source: City BudgetFindings: Some neighborhoods suffermore than others as mayor cuts budgets
Disparities in waterusage  “Water use highest in  poor areas of the city”  Mapping and statistical  analysis
Presenting the data  Include a methodology explaining what you did and  what you don’t know.  For really complicated analy...
For more informationwww.ire.orgwww.propublica.orgjennifer.lafleur@propublica.org
Ona 2012
Ona 2012
Ona 2012
Ona 2012
Ona 2012
Ona 2012
Ona 2012
Ona 2012
Ona 2012
Ona 2012
Ona 2012
Ona 2012
Ona 2012
Ona 2012
Ona 2012
Ona 2012
Ona 2012
Ona 2012
Ona 2012
Ona 2012
Ona 2012
Ona 2012
Ona 2012
Ona 2012
Ona 2012
Ona 2012
Ona 2012
Ona 2012
Ona 2012
Upcoming SlideShare
Loading in …5
×

Ona 2012

3,684 views

Published on

Collect, clean, manipulate data

Published in: Technology, Business
  • Be the first to comment

Ona 2012

  1. 1. Data: Collect, Clean and Manipulate ONA 2012 San Francisco Jennifer LaFleur, ProPublica jennifer.lafleur@propublica.org @j_la28
  2. 2. Why data? It takes you beyond the anecdote It’s easier than counting sheets of paper
  3. 3. Why data? Contrasts are in the data
  4. 4. Caution: This slide contains extreme nerdiness
  5. 5. Why Computer-Assisted Reporting? Contrasts are in the data Your most powerful figures are in the data
  6. 6. Source: CaliforniaHealth Dept.data, Medicare billingdataFindings: Somehospitals had“alarming rates of aThird World nutritionaldisorder among itsMedicare patients.”
  7. 7. Why data? Contrasts are in the data Your most powerful figures are in the data You can make connections you might not be able to make otherwise
  8. 8. Data: Youthprison workers,criminalconvictions andgrievance dataFindings:Employees withcriminalbackgroundswere more likelyto be accused ofabusing inmates.
  9. 9. Data: Federalbridgeinspections andstimulus funding.Findings: Someof the nation’sworst bridges didnot get stimulusfunds.
  10. 10. Why data? Contrasts are in the data Your most powerful figures are in the data You can make connections you might not be able to make otherwise You can test assumptions
  11. 11. Source: NHTSAcomplaint dataFindings:“…unintendedacceleration hasbeen a problemacross the autoindustry.”
  12. 12. HT/Florencia Coelho
  13. 13. Collecting the data
  14. 14. Where’s the data? If something is inspected Licensed Enforced or Purchased…There probably is a database
  15. 15. Where’s the data? If there is a report Or a form There probably is a database
  16. 16. Where’s the data?Sometimes data is readily availableonline for download
  17. 17. Source: CensusFindings: “Fueled bythe dismal economyand highunemployment, more Americans…aredoubling up”
  18. 18. Source: Medicaid nursing homesurvey data and financedata, housing dataFindings: “…a shortage of placesfor the disabled to live outside anursing home and regulationsthat critics say make it hard toqualify for home services meanmany who want out continue toreceive expensive nursing care.”
  19. 19. Where’s the data?Sometimes you have to scrape it.That usually involves programsthat automate searching tasks onWeb sites.
  20. 20. Where’s the data? More often you need to go to an agency to get the data This can be tricky if an agency doesn’t want to release it. (Stay tuned for more on that…)
  21. 21. Source: School districtcredit card purchasesFindings: District cardholders madequestionablepurchases with theircards.
  22. 22. Sometimes, there is no data.But it’s okay because there aretechniques for sampling and buildinga database.
  23. 23. ProPublica pulled a randomsample of 500 names from alist of individuals who hadbeen granted or deniedpardons (around 2,000). Wecreated a database frommonths or researchingindividuals: their crime, age,sentence…We found that even aftercontrolling for other factors,whites were more likely to geta pardon.
  24. 24. Source: Loan details,foreclosure information andbankruptcy filingsFindings: Loans leading toforeclosure didn’t alwaysfollow conventional wisdom
  25. 25. When you have to ask for the data Before filing a request: Ask for it If they require a formal request, find out who it should go to and what you should ask for Letter should describe what you’re asking for  Note that you’re willing to negotiate  Ask for a cost estimate
  26. 26. Dear Records Administrator:I’m writing to request under the Texas Public Information Act an electronic copy of the current health-related services registry database for the state of Texas. I also am requesting electronic copies or adatabase of all complaints filed against health-related service registry members since Jan. 1, 2000.I frequently deal with large raw databases, so I would be able to accept information in several formatsincluding ASCII, dbf, xls, etc… and can accept the data on a variety of media (computer tape, CD-ROM, FTP, email attachment, etc...). Please include record layouts, code sheets or any otherdocumentation necessary to interpret the data.I am requesting all data fields. If there are any fields that you must withhold by law, please let meknow what those fields are, so I can amend my request.In the interest of expediency, and to minimize the research and/or duplication burden on your staff, Iwould be happy to speak with your database administrator to figure out a method that is easiest foryou.If you have questions or need more information, please contact me by telephone or email. Mytelephone number is: 214-977-8509. My email address is jlafleur@dallasnews.com.If you will be charging processing fees, please send me an itemized estimate explaining how thecosts were calculated.
  27. 27. Getting electronic information Know the law. Know how your state treats (or doesn’t) the records you need. Know what information you want. Do your homework Know what the appropriate cost should be. Know who does the data entry. Get to know Leon When something may not clearly be public use your sourcing
  28. 28. Just another way of saying noHuge costsDelay tactics“Oh you silly little journalist”Sending you the wrong thing“Your request was unclear”HIPAAPrivacyPrivatization
  29. 29. Negotiating: Some examples
  30. 30. Our database is on amainframe and it’svery complicated,Missy
  31. 31. We don’t havethe authorityto do that
  32. 32. That will cost$25,000.
  33. 33. We have processed your request. Thelabor cost for the request is asfollows.Item # of hoursRESEARCH 20CREATING FILES 6CODING 24TESTING 4Total (54 X$72) = $3,888.00
  34. 34. From Texas Public Information Act:111.67. Estimates and Waivers of Public Information Charges (a) A governmental body is required to provide a requestorwith an itemized statement of estimated charges if charges forcopies of public information will exceed $40, or if a charge inaccordance with §111.65 of this title (relating to Access toInformation Where Copies Are Not Requested) will exceed$40 for making public information available for inspection. Agovernmental body that fails to provide the requiredstatement may not collect more than $40. The itemizedstatement must be provided free of charge and must contain thefollowing information:
  35. 35. We only keeptheinformationfor 7 days
  36. 36. Check retention schedules
  37. 37. That usesproprietarysoftware.
  38. 38. We don’t keepthat oncomputer
  39. 39. Okay, we do,but it’s a lotof files
  40. 40. Thatinformation isprotected bylaw
  41. 41. Cleaning data
  42. 42. Remember that data are not perfect
  43. 43. It doesn’t mean you can’t use it…Do integrity checks to find the flawsAdd caveats where necessaryDo your own analysis rather than relying on anagency’s analysis of bad data
  44. 44. Integrity checks for every data set Read the documentation. Understand the contents of every field. Know how many records you should have. Check counts and totals against reports. Are all possibilities included? All states, all counties, correct ranges?
  45. 45. Integrity checks for every data set Internal data checks: Is there more money going to sub-contractors than went to the prime contractor? Are there more teachers than students? Do people have birth dates in the future or so long ago they would be long gone?
  46. 46. If your data is inExcel, use the filterfunction to see whatthe values are inindividual fields.
  47. 47. Integrity checks for every data set Check for missing data, misplaced data or blank fields Use a standard naming convention for files and tables (I wouldn’t recommend “final”) Check for duplicates Take margins of error into account if necessary (important if you’re using Census data).
  48. 48. 2010 Census ACS: Median HH Income by Metro Area
  49. 49. Be creative when you look for duplicates
  50. 50. Beyond the basics Keep a notes file Don’t work off your original database Know the source Check against summary reports Use the right tool Check for outliers when it comes to ups and downs
  51. 51. Truck accidents by year and agency
  52. 52. Beyond the basics Check with experts Are there standards? (ex: a drop by more than 10 perc pts is a red flag) Find out what others have done Gut check Go physically see a record or spot check against documents
  53. 53. Voter FraudDozens of St. Louis voters are being wrongly accusedof casting ballots from fraudulent addresses in lastyears Nov. 7 election.They are among thousands of registered voters who,based on city property records, appear to live onvacant lots.
  54. 54. Texas test score data official results versus district Duncanville district reported 4th grade writing Official report for Duncanville 4th grade writingCourtesy Holly Hacker, The Dallas Morning News
  55. 55. Three rounds of analysis after bouncing off subjects and experts Demographically based Voir dire Socioeconomics
  56. 56. Checks when you’re matching dataA name is not enough. Lots of people have the same name Get dates of birth and other information to make sure you have the correct person.
  57. 57. Source: Illinois health data, police dataFindings: Dangerous systemic failed to protect elderly patients inIllinois nursing homes that also house mentally ill younger residents,including murderers, sex offenders, and armed robbers.
  58. 58. Even people with seemingly unique names aren’t so unique
  59. 59. Evaluating outside studies Get the questionnaire and methodology Beware of nonscientific methods: Web surveys, man on the street Know the sample size..sampling error Account for margin of error and non-response when drawing conclusions Run statistical tests on the data if possible
  60. 60. Reporting data Consider reporting rates not raw numbers Avoid false precision: 53.14 percent said … in a poll with a 5 percentage point margin of error Avoid number overload. About half is usually just as useful as 51 percent in most cases Adjust money for inflation When analyzing income, use median rather than average (Bill Gates factor)
  61. 61. When the data is the problem – you might stillhave a storyErroneous government databases – can oftenbe a story themselves
  62. 62. Manipulating data for stories and apps
  63. 63. Know which tool to use• Reporting individual records• Counting/summing• Mapping• Statistics
  64. 64. Source: Medicaidoutcomes data fordialysis facilitiesFindings: A CMSonline tool did nottell the whole storyabout facilities. Insome counties thegap inmeasures, such assurvival rate werevast.
  65. 65. Source: Washington Health Department dataFindings: “MRSA has been quietly killing in hospitals for decades.” But noone had tracked it until this story.
  66. 66. Source: Dept. of Ed data and surveys of campus crisis clinicsFindings: Many campuses had lax enforcement and reporting loop holesmean problems go unchecked.
  67. 67. Source: EPA and state data on hazardous chemical locationsFindings: Dallas County has 900+ sites that store hazardous chemicals
  68. 68. Source: Daminspection datafrom Texas andfederal governmentFindings: Damrecords had notbeen updated toaccount forpopulation growth
  69. 69. Source: 311 calls for downed treesFindings: After a tornado swept across New York City, 311calls for downed trees helps trace its path
  70. 70. Source: City BudgetFindings: Some neighborhoods suffermore than others as mayor cuts budgets
  71. 71. Disparities in waterusage “Water use highest in poor areas of the city” Mapping and statistical analysis
  72. 72. Presenting the data Include a methodology explaining what you did and what you don’t know. For really complicated analyses – consider a super nerdy white paper explaining all of your findings If you make data downloadable – include field descriptions and anything users should watch for
  73. 73. For more informationwww.ire.orgwww.propublica.orgjennifer.lafleur@propublica.org

×