Source: Medicaid nursing homesurvey data and financedata, housing dataFindings: “…a shortage of placesfor the disabled to live outside anursing home and regulationsthat critics say make it hard toqualify for home services meanmany who want out continue toreceive expensive nursing care.”
Where’s the data?Sometimes you have to scrape it.That usually involves programsthat automate searching tasks onWeb sites.
Where’s the data? More often you need to go to an agency to get the data This can be tricky if an agency doesn’t want to release it. (Stay tuned for more on that…)
Source: School districtcredit card purchasesFindings: District cardholders madequestionablepurchases with theircards.
Sometimes, there is no data.But it’s okay because there aretechniques for sampling and buildinga database.
ProPublica pulled a randomsample of 500 names from alist of individuals who hadbeen granted or deniedpardons (around 2,000). Wecreated a database frommonths or researchingindividuals: their crime, age,sentence…We found that even aftercontrolling for other factors,whites were more likely to geta pardon.
Source: Loan details,foreclosure information andbankruptcy filingsFindings: Loans leading toforeclosure didn’t alwaysfollow conventional wisdom
When you have to ask for the data Before filing a request: Ask for it If they require a formal request, find out who it should go to and what you should ask for Letter should describe what you’re asking for Note that you’re willing to negotiate Ask for a cost estimate
Dear Records Administrator:I’m writing to request under the Texas Public Information Act an electronic copy of the current health-related services registry database for the state of Texas. I also am requesting electronic copies or adatabase of all complaints filed against health-related service registry members since Jan. 1, 2000.I frequently deal with large raw databases, so I would be able to accept information in several formatsincluding ASCII, dbf, xls, etc… and can accept the data on a variety of media (computer tape, CD-ROM, FTP, email attachment, etc...). Please include record layouts, code sheets or any otherdocumentation necessary to interpret the data.I am requesting all data fields. If there are any fields that you must withhold by law, please let meknow what those fields are, so I can amend my request.In the interest of expediency, and to minimize the research and/or duplication burden on your staff, Iwould be happy to speak with your database administrator to figure out a method that is easiest foryou.If you have questions or need more information, please contact me by telephone or email. Mytelephone number is: 214-977-8509. My email address is email@example.com.If you will be charging processing fees, please send me an itemized estimate explaining how thecosts were calculated.
Getting electronic information Know the law. Know how your state treats (or doesn’t) the records you need. Know what information you want. Do your homework Know what the appropriate cost should be. Know who does the data entry. Get to know Leon When something may not clearly be public use your sourcing
Just another way of saying noHuge costsDelay tactics“Oh you silly little journalist”Sending you the wrong thing“Your request was unclear”HIPAAPrivacyPrivatization
We have processed your request. Thelabor cost for the request is asfollows.Item # of hoursRESEARCH 20CREATING FILES 6CODING 24TESTING 4Total (54 X$72) = $3,888.00
From Texas Public Information Act:111.67. Estimates and Waivers of Public Information Charges (a) A governmental body is required to provide a requestorwith an itemized statement of estimated charges if charges forcopies of public information will exceed $40, or if a charge inaccordance with §111.65 of this title (relating to Access toInformation Where Copies Are Not Requested) will exceed$40 for making public information available for inspection. Agovernmental body that fails to provide the requiredstatement may not collect more than $40. The itemizedstatement must be provided free of charge and must contain thefollowing information:
It doesn’t mean you can’t use it…Do integrity checks to find the flawsAdd caveats where necessaryDo your own analysis rather than relying on anagency’s analysis of bad data
Integrity checks for every data set Read the documentation. Understand the contents of every field. Know how many records you should have. Check counts and totals against reports. Are all possibilities included? All states, all counties, correct ranges?
Integrity checks for every data set Internal data checks: Is there more money going to sub-contractors than went to the prime contractor? Are there more teachers than students? Do people have birth dates in the future or so long ago they would be long gone?
If your data is inExcel, use the filterfunction to see whatthe values are inindividual fields.
Integrity checks for every data set Check for missing data, misplaced data or blank fields Use a standard naming convention for files and tables (I wouldn’t recommend “final”) Check for duplicates Take margins of error into account if necessary (important if you’re using Census data).
2010 Census ACS: Median HH Income by Metro Area
Beyond the basics Check with experts Are there standards? (ex: a drop by more than 10 perc pts is a red flag) Find out what others have done Gut check Go physically see a record or spot check against documents
Voter FraudDozens of St. Louis voters are being wrongly accusedof casting ballots from fraudulent addresses in lastyears Nov. 7 election.They are among thousands of registered voters who,based on city property records, appear to live onvacant lots.
Texas test score data official results versus district Duncanville district reported 4th grade writing Official report for Duncanville 4th grade writingCourtesy Holly Hacker, The Dallas Morning News
Three rounds of analysis after bouncing off subjects and experts Demographically based Voir dire Socioeconomics
Checks when you’re matching dataA name is not enough. Lots of people have the same name Get dates of birth and other information to make sure you have the correct person.
Source: Illinois health data, police dataFindings: Dangerous systemic failed to protect elderly patients inIllinois nursing homes that also house mentally ill younger residents,including murderers, sex offenders, and armed robbers.
Even people with seemingly unique names aren’t so unique
Evaluating outside studies Get the questionnaire and methodology Beware of nonscientific methods: Web surveys, man on the street Know the sample size..sampling error Account for margin of error and non-response when drawing conclusions Run statistical tests on the data if possible
Reporting data Consider reporting rates not raw numbers Avoid false precision: 53.14 percent said … in a poll with a 5 percentage point margin of error Avoid number overload. About half is usually just as useful as 51 percent in most cases Adjust money for inflation When analyzing income, use median rather than average (Bill Gates factor)
When the data is the problem – you might stillhave a storyErroneous government databases – can oftenbe a story themselves
Source: 311 calls for downed treesFindings: After a tornado swept across New York City, 311calls for downed trees helps trace its path
Source: City BudgetFindings: Some neighborhoods suffermore than others as mayor cuts budgets
Disparities in waterusage “Water use highest in poor areas of the city” Mapping and statistical analysis
Presenting the data Include a methodology explaining what you did and what you don’t know. For really complicated analyses – consider a super nerdy white paper explaining all of your findings If you make data downloadable – include field descriptions and anything users should watch for
For more firstname.lastname@example.org