• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Ona 2012
 

Ona 2012

on

  • 3,484 views

Collect, clean, manipulate data

Collect, clean, manipulate data

Statistics

Views

Total Views
3,484
Views on SlideShare
2,150
Embed Views
1,334

Actions

Likes
2
Downloads
15
Comments
0

6 Embeds 1,334

http://blogs.lanacion.com.ar 659
http://storify.com 637
https://twitter.com 27
http://www.verticeasociados.com 6
http://www.rdatavox.com 3
http://staging.storify.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Ona 2012 Ona 2012 Presentation Transcript

    • Data: Collect, Clean and Manipulate ONA 2012 San Francisco Jennifer LaFleur, ProPublica jennifer.lafleur@propublica.org @j_la28
    • Why data? It takes you beyond the anecdote It’s easier than counting sheets of paper
    • Why data? Contrasts are in the data
    • Caution: This slide contains extreme nerdiness
    • Why Computer-Assisted Reporting? Contrasts are in the data Your most powerful figures are in the data
    • Source: CaliforniaHealth Dept.data, Medicare billingdataFindings: Somehospitals had“alarming rates of aThird World nutritionaldisorder among itsMedicare patients.”
    • Why data? Contrasts are in the data Your most powerful figures are in the data You can make connections you might not be able to make otherwise
    • Data: Youthprison workers,criminalconvictions andgrievance dataFindings:Employees withcriminalbackgroundswere more likelyto be accused ofabusing inmates.
    • Data: Federalbridgeinspections andstimulus funding.Findings: Someof the nation’sworst bridges didnot get stimulusfunds.
    • Why data? Contrasts are in the data Your most powerful figures are in the data You can make connections you might not be able to make otherwise You can test assumptions
    • Source: NHTSAcomplaint dataFindings:“…unintendedacceleration hasbeen a problemacross the autoindustry.”
    • HT/Florencia Coelho
    • Collecting the data
    • Where’s the data? If something is inspected Licensed Enforced or Purchased…There probably is a database
    • Where’s the data? If there is a report Or a form There probably is a database
    • Where’s the data?Sometimes data is readily availableonline for download
    • Source: CensusFindings: “Fueled bythe dismal economyand highunemployment, more Americans…aredoubling up”
    • Source: Medicaid nursing homesurvey data and financedata, housing dataFindings: “…a shortage of placesfor the disabled to live outside anursing home and regulationsthat critics say make it hard toqualify for home services meanmany who want out continue toreceive expensive nursing care.”
    • Where’s the data?Sometimes you have to scrape it.That usually involves programsthat automate searching tasks onWeb sites.
    • Where’s the data? More often you need to go to an agency to get the data This can be tricky if an agency doesn’t want to release it. (Stay tuned for more on that…)
    • Source: School districtcredit card purchasesFindings: District cardholders madequestionablepurchases with theircards.
    • Sometimes, there is no data.But it’s okay because there aretechniques for sampling and buildinga database.
    • ProPublica pulled a randomsample of 500 names from alist of individuals who hadbeen granted or deniedpardons (around 2,000). Wecreated a database frommonths or researchingindividuals: their crime, age,sentence…We found that even aftercontrolling for other factors,whites were more likely to geta pardon.
    • Source: Loan details,foreclosure information andbankruptcy filingsFindings: Loans leading toforeclosure didn’t alwaysfollow conventional wisdom
    • When you have to ask for the data Before filing a request: Ask for it If they require a formal request, find out who it should go to and what you should ask for Letter should describe what you’re asking for  Note that you’re willing to negotiate  Ask for a cost estimate
    • Dear Records Administrator:I’m writing to request under the Texas Public Information Act an electronic copy of the current health-related services registry database for the state of Texas. I also am requesting electronic copies or adatabase of all complaints filed against health-related service registry members since Jan. 1, 2000.I frequently deal with large raw databases, so I would be able to accept information in several formatsincluding ASCII, dbf, xls, etc… and can accept the data on a variety of media (computer tape, CD-ROM, FTP, email attachment, etc...). Please include record layouts, code sheets or any otherdocumentation necessary to interpret the data.I am requesting all data fields. If there are any fields that you must withhold by law, please let meknow what those fields are, so I can amend my request.In the interest of expediency, and to minimize the research and/or duplication burden on your staff, Iwould be happy to speak with your database administrator to figure out a method that is easiest foryou.If you have questions or need more information, please contact me by telephone or email. Mytelephone number is: 214-977-8509. My email address is jlafleur@dallasnews.com.If you will be charging processing fees, please send me an itemized estimate explaining how thecosts were calculated.
    • Getting electronic information Know the law. Know how your state treats (or doesn’t) the records you need. Know what information you want. Do your homework Know what the appropriate cost should be. Know who does the data entry. Get to know Leon When something may not clearly be public use your sourcing
    • Just another way of saying noHuge costsDelay tactics“Oh you silly little journalist”Sending you the wrong thing“Your request was unclear”HIPAAPrivacyPrivatization
    • Negotiating: Some examples
    • Our database is on amainframe and it’svery complicated,Missy
    • We don’t havethe authorityto do that
    • That will cost$25,000.
    • We have processed your request. Thelabor cost for the request is asfollows.Item # of hoursRESEARCH 20CREATING FILES 6CODING 24TESTING 4Total (54 X$72) = $3,888.00
    • From Texas Public Information Act:111.67. Estimates and Waivers of Public Information Charges (a) A governmental body is required to provide a requestorwith an itemized statement of estimated charges if charges forcopies of public information will exceed $40, or if a charge inaccordance with §111.65 of this title (relating to Access toInformation Where Copies Are Not Requested) will exceed$40 for making public information available for inspection. Agovernmental body that fails to provide the requiredstatement may not collect more than $40. The itemizedstatement must be provided free of charge and must contain thefollowing information:
    • We only keeptheinformationfor 7 days
    • Check retention schedules
    • That usesproprietarysoftware.
    • We don’t keepthat oncomputer
    • Okay, we do,but it’s a lotof files
    • Thatinformation isprotected bylaw
    • Cleaning data
    • Remember that data are not perfect
    • It doesn’t mean you can’t use it…Do integrity checks to find the flawsAdd caveats where necessaryDo your own analysis rather than relying on anagency’s analysis of bad data
    • Integrity checks for every data set Read the documentation. Understand the contents of every field. Know how many records you should have. Check counts and totals against reports. Are all possibilities included? All states, all counties, correct ranges?
    • Integrity checks for every data set Internal data checks: Is there more money going to sub-contractors than went to the prime contractor? Are there more teachers than students? Do people have birth dates in the future or so long ago they would be long gone?
    • If your data is inExcel, use the filterfunction to see whatthe values are inindividual fields.
    • Integrity checks for every data set Check for missing data, misplaced data or blank fields Use a standard naming convention for files and tables (I wouldn’t recommend “final”) Check for duplicates Take margins of error into account if necessary (important if you’re using Census data).
    • 2010 Census ACS: Median HH Income by Metro Area
    • Be creative when you look for duplicates
    • Beyond the basics Keep a notes file Don’t work off your original database Know the source Check against summary reports Use the right tool Check for outliers when it comes to ups and downs
    • Truck accidents by year and agency
    • Beyond the basics Check with experts Are there standards? (ex: a drop by more than 10 perc pts is a red flag) Find out what others have done Gut check Go physically see a record or spot check against documents
    • Voter FraudDozens of St. Louis voters are being wrongly accusedof casting ballots from fraudulent addresses in lastyears Nov. 7 election.They are among thousands of registered voters who,based on city property records, appear to live onvacant lots.
    • Texas test score data official results versus district Duncanville district reported 4th grade writing Official report for Duncanville 4th grade writingCourtesy Holly Hacker, The Dallas Morning News
    • Three rounds of analysis after bouncing off subjects and experts Demographically based Voir dire Socioeconomics
    • Checks when you’re matching dataA name is not enough. Lots of people have the same name Get dates of birth and other information to make sure you have the correct person.
    • Source: Illinois health data, police dataFindings: Dangerous systemic failed to protect elderly patients inIllinois nursing homes that also house mentally ill younger residents,including murderers, sex offenders, and armed robbers.
    • Even people with seemingly unique names aren’t so unique
    • Evaluating outside studies Get the questionnaire and methodology Beware of nonscientific methods: Web surveys, man on the street Know the sample size..sampling error Account for margin of error and non-response when drawing conclusions Run statistical tests on the data if possible
    • Reporting data Consider reporting rates not raw numbers Avoid false precision: 53.14 percent said … in a poll with a 5 percentage point margin of error Avoid number overload. About half is usually just as useful as 51 percent in most cases Adjust money for inflation When analyzing income, use median rather than average (Bill Gates factor)
    • When the data is the problem – you might stillhave a storyErroneous government databases – can oftenbe a story themselves
    • Manipulating data for stories and apps
    • Know which tool to use• Reporting individual records• Counting/summing• Mapping• Statistics
    • Source: Medicaidoutcomes data fordialysis facilitiesFindings: A CMSonline tool did nottell the whole storyabout facilities. Insome counties thegap inmeasures, such assurvival rate werevast.
    • Source: Washington Health Department dataFindings: “MRSA has been quietly killing in hospitals for decades.” But noone had tracked it until this story.
    • Source: Dept. of Ed data and surveys of campus crisis clinicsFindings: Many campuses had lax enforcement and reporting loop holesmean problems go unchecked.
    • Source: EPA and state data on hazardous chemical locationsFindings: Dallas County has 900+ sites that store hazardous chemicals
    • Source: Daminspection datafrom Texas andfederal governmentFindings: Damrecords had notbeen updated toaccount forpopulation growth
    • Source: 311 calls for downed treesFindings: After a tornado swept across New York City, 311calls for downed trees helps trace its path
    • Source: City BudgetFindings: Some neighborhoods suffermore than others as mayor cuts budgets
    • Disparities in waterusage “Water use highest in poor areas of the city” Mapping and statistical analysis
    • Presenting the data Include a methodology explaining what you did and what you don’t know. For really complicated analyses – consider a super nerdy white paper explaining all of your findings If you make data downloadable – include field descriptions and anything users should watch for
    • For more informationwww.ire.orgwww.propublica.orgjennifer.lafleur@propublica.org