SlideShare a Scribd company logo
Data: Collect, Clean and Manipulate
               ONA 2012
              San Francisco

       Jennifer LaFleur, ProPublica
        jennifer.lafleur@propublica.org
                     @j_la28
Why data?

 It takes you beyond the anecdote
 It’s easier than counting sheets of paper
Why data?

 Contrasts are in the data
Caution: This slide contains extreme nerdiness
Why Computer-Assisted Reporting?

 Contrasts are in the data
 Your most powerful figures are in the data
Source: California
Health Dept.
data, Medicare billing
data

Findings: Some
hospitals had
“alarming rates of a
Third World nutritional
disorder among its
Medicare patients.”
Why data?

 Contrasts are in the data
 Your most powerful figures are in the data
 You can make connections you might not be
 able to make otherwise
Data: Youth
prison workers,
criminal
convictions and
grievance data

Findings:
Employees with
criminal
backgrounds
were more likely
to be accused of
abusing inmates.
Data: Federal
bridge
inspections and
stimulus funding.

Findings: Some
of the nation’s
worst bridges did
not get stimulus
funds.
Why data?

 Contrasts are in the data
 Your most powerful figures are in the data
 You can make connections you might not be
 able to make otherwise
 You can test assumptions
Source: NHTSA
complaint data

Findings:
“…unintended
acceleration has
been a problem
across the auto
industry.”
HT/Florencia
  Coelho
Collecting the data
Where’s the data?

  If something is inspected
  Licensed
  Enforced or
  Purchased
…There probably is a database
Where’s the data?

    If there is a report
    Or a form
    There probably is a database
Where’s the data?

Sometimes data is readily available
online for download
Source: Census

Findings: “Fueled by
the dismal economy
and high
unemployment, mor
e Americans…are
doubling up”
Source: Medicaid nursing home
survey data and finance
data, housing data

Findings: “…a shortage of places
for the disabled to live outside a
nursing home and regulations
that critics say make it hard to
qualify for home services mean
many who want out continue to
receive expensive nursing care.”
Where’s the data?

Sometimes you have to scrape it.

That usually involves programs
that automate searching tasks on
Web sites.
Where’s the data?

    More often you need to go to an
    agency to get the data
    This can be tricky if an agency
    doesn’t want to release it. (Stay
    tuned for more on that…)
Source: School district
credit card purchases

Findings: District card
holders made
questionable
purchases with their
cards.
Sometimes, there is no data.
But it’s okay because there are
techniques for sampling and building
a database.
ProPublica pulled a random
sample of 500 names from a
list of individuals who had
been granted or denied
pardons (around 2,000). We
created a database from
months or researching
individuals: their crime, age,
sentence…

We found that even after
controlling for other factors,
whites were more likely to get
a pardon.
Source: Loan details,
foreclosure information and
bankruptcy filings

Findings: Loans leading to
foreclosure didn’t always
follow conventional wisdom
When you have to ask for the data

   Before filing a request: Ask for it
   If they require a formal request, find
   out who it should go to and what you
   should ask for
   Letter should describe what you’re
   asking for
        Note that you’re willing to
          negotiate
        Ask for a cost estimate
Dear Records Administrator:

I’m writing to request under the Texas Public Information Act an electronic copy of the current health-
related services registry database for the state of Texas. I also am requesting electronic copies or a
database of all complaints filed against health-related service registry members since Jan. 1, 2000.

I frequently deal with large raw databases, so I would be able to accept information in several formats
including ASCII, dbf, xls, etc… and can accept the data on a variety of media (computer tape, CD-
ROM, FTP, email attachment, etc...). Please include record layouts, code sheets or any other
documentation necessary to interpret the data.

I am requesting all data fields. If there are any fields that you must withhold by law, please let me
know what those fields are, so I can amend my request.

In the interest of expediency, and to minimize the research and/or duplication burden on your staff, I
would be happy to speak with your database administrator to figure out a method that is easiest for
you.

If you have questions or need more information, please contact me by telephone or email. My
telephone number is: 214-977-8509. My email address is jlafleur@dallasnews.com.

If you will be charging processing fees, please send me an itemized estimate explaining how the
costs were calculated.
Getting electronic information
   Know the law. Know how your state treats (or
   doesn’t) the records you need.
   Know what information you want.
   Do your homework
   Know what the appropriate cost should be.
   Know who does the data entry.
   Get to know Leon
   When something may not clearly be public use
   your sourcing
Just another way of saying no


Huge costs
Delay tactics
“Oh you silly little journalist”
Sending you the wrong thing
“Your request was unclear”
HIPAA
Privacy
Privatization
Negotiating: Some examples
Our database is on a
mainframe and it’s
very complicated,
Missy
We don’t have
the authority
to do that
That will cost
$25,000.
We have processed your request. The
labor cost for the request is as
follows.

Item                 # of hours
RESEARCH                   20
CREATING FILES              6
CODING                     24
TESTING                     4
Total (54 X$72) =    $3,888.00
From Texas Public Information Act:

111.67. Estimates and Waivers of Public Information Charges
 (a) A governmental body is required to provide a requestor
with an itemized statement of estimated charges if charges for
copies of public information will exceed $40, or if a charge in
accordance with §111.65 of this title (relating to Access to
Information Where Copies Are Not Requested) will exceed
$40 for making public information available for inspection. A
governmental body that fails to provide the required
statement may not collect more than $40. The itemized
statement must be provided free of charge and must contain the
following information:
We only keep
the
information
for 7 days
Check retention schedules
That uses
proprietary
software.
We don’t keep
that on
computer
Okay, we do,
but it’s a lot
of files
That
information is
protected by
law
Cleaning data
Remember that data are not perfect
It doesn’t mean you can’t use it…

Do integrity checks to find the flaws
Add caveats where necessary
Do your own analysis rather than relying on an
agency’s analysis of bad data
Integrity checks for every data set

    Read the documentation. Understand the
    contents of every field.
    Know how many records you should have.
    Check counts and totals against reports.
    Are all possibilities included? All states, all
    counties, correct ranges?
Integrity checks for every data set

 Internal data checks:
 Is there more money going to sub-contractors than went to
 the prime contractor?
 Are there more teachers than students?
 Do people have birth dates in the future or so long ago they
 would be long gone?
If your data is in
Excel, use the filter
function to see what
the values are in
individual fields.
Integrity checks for every data set

    Check for missing data, misplaced data or blank
    fields
    Use a standard naming convention for files and
    tables (I wouldn’t recommend “final”)
    Check for duplicates
    Take margins of error into account if necessary
    (important if you’re using Census data).
2010 Census ACS: Median HH Income by Metro Area
Be creative when you look for duplicates
Beyond the basics

 Keep a notes file
 Don’t work off your original database
 Know the source
 Check against summary reports
 Use the right tool
 Check for outliers when it comes to ups and
 downs
Truck accidents by year and agency
Beyond the basics

 Check with experts
 Are there standards? (ex: a drop by more than
 10 perc pts is a red flag)
 Find out what others have done
 Gut check
 Go physically see a record or spot check
 against documents
Voter Fraud

Dozens of St. Louis voters are being wrongly accused
of casting ballots from fraudulent addresses in last
year's Nov. 7 election.

They are among thousands of registered voters who,
based on city property records, appear to live on
vacant lots.
Texas test score data official
 results versus district
     Duncanville district reported
     4th grade writing




        Official report for Duncanville
        4th grade writing



Courtesy Holly Hacker, The Dallas Morning News
Three rounds of analysis
  after bouncing off subjects
  and experts
  Demographically based
  Voir dire
  Socioeconomics
Checks when you’re matching data

A name is not enough. Lots of people have the same name



                                     Get dates of birth and
                                     other information to
                                     make sure you have
                                     the correct person.
Source: Illinois health data, police data

Findings: Dangerous systemic failed to protect elderly patients in
Illinois nursing homes that also house mentally ill younger residents,
including murderers, sex offenders, and armed robbers.
Even people with seemingly unique names aren’t so unique
Evaluating outside studies

 Get the questionnaire and methodology
 Beware of nonscientific methods: Web surveys,
 man on the street
 Know the sample size..sampling error
 Account for margin of error and non-response
 when drawing conclusions
 Run statistical tests on the data if possible
Reporting data
 Consider reporting rates not raw numbers
 Avoid false precision: 53.14 percent said … in a poll
 with a 5 percentage point margin of error
 Avoid number overload. About half is usually just as
 useful as 51 percent in most cases
 Adjust money for inflation
 When analyzing income, use median rather than
 average (Bill Gates factor)
When the data is the problem – you might still
have a story

Erroneous government databases – can often
be a story themselves
Manipulating data for stories
         and apps
Know which tool to use

•   Reporting individual records
•   Counting/summing
•   Mapping
•   Statistics
Source: Medicaid
outcomes data for
dialysis facilities

Findings: A CMS
online tool did not
tell the whole story
about facilities. In
some counties the
gap in
measures, such as
survival rate were
vast.
Source: Washington Health Department data
Findings: “MRSA has been quietly killing in hospitals for decades.” But no
one had tracked it until this story.
Source: Dept. of Ed data and surveys of campus crisis clinics

Findings: Many campuses had lax enforcement and reporting loop holes
mean problems go unchecked.
Source: EPA and state data on hazardous chemical locations
Findings: Dallas County has 900+ sites that store hazardous chemicals
Source: Dam
inspection data
from Texas and
federal government

Findings: Dam
records had not
been updated to
account for
population growth
Source: 311 calls for downed trees

Findings: After a tornado swept across New York City, 311
calls for downed trees helps trace its path
Source: City Budget

Findings: Some neighborhoods suffer
more than others as mayor cuts budgets
Disparities in water
usage

  “Water use highest in
  poor areas of the city”
  Mapping and statistical
  analysis
Presenting the data
  Include a methodology explaining what you did and
  what you don’t know.
  For really complicated analyses – consider a super
  nerdy white paper explaining all of your findings
  If you make data downloadable – include field
  descriptions and anything users should watch for
For more information

www.ire.org
www.propublica.org

jennifer.lafleur@propublica.org

More Related Content

Similar to Ona 2012

Nr14: Ten tips for data journalists
Nr14: Ten tips for data journalistsNr14: Ten tips for data journalists
Nr14: Ten tips for data journalistsJennifer LaFleur
 
Be a Better Business Watchdog -- CAR for Business Journalists
Be a Better Business Watchdog -- CAR for Business JournalistsBe a Better Business Watchdog -- CAR for Business Journalists
Be a Better Business Watchdog -- CAR for Business Journalists
Reynolds Center for Business Journalism
 
Mendelson: Driving daily enterprise coverage
Mendelson: Driving daily enterprise coverageMendelson: Driving daily enterprise coverage
Mendelson: Driving daily enterprise coverage
News Leaders Association's NewsTrain
 
Data driven enterprise off your beat - denver news train - april 11-12, 2019
Data driven enterprise off your beat - denver news train - april 11-12, 2019Data driven enterprise off your beat - denver news train - april 11-12, 2019
Data driven enterprise off your beat - denver news train - april 11-12, 2019
News Leaders Association's NewsTrain
 
Data Journalism for Business Reporting by Jaimi Dowdell and Mark Horvit
Data Journalism for Business Reporting by Jaimi Dowdell and Mark HorvitData Journalism for Business Reporting by Jaimi Dowdell and Mark Horvit
Data Journalism for Business Reporting by Jaimi Dowdell and Mark Horvit
Reynolds Center for Business Journalism
 
Sla03tt
Sla03ttSla03tt
Sla03tt
FNian
 
Page 579Assess the Constituent Data. What is included Omi.docx
Page 579Assess the Constituent Data. What is included Omi.docxPage 579Assess the Constituent Data. What is included Omi.docx
Page 579Assess the Constituent Data. What is included Omi.docx
bunyansaturnina
 
Data-Driven Enterprise off Your Beat by Manuel Torres - Monroe, La., NewsTrai...
Data-Driven Enterprise off Your Beat by Manuel Torres - Monroe, La., NewsTrai...Data-Driven Enterprise off Your Beat by Manuel Torres - Monroe, La., NewsTrai...
Data-Driven Enterprise off Your Beat by Manuel Torres - Monroe, La., NewsTrai...
News Leaders Association's NewsTrain
 
Difference between Crime and DevianceTheories offer an explanation.docx
Difference between Crime and DevianceTheories offer an explanation.docxDifference between Crime and DevianceTheories offer an explanation.docx
Difference between Crime and DevianceTheories offer an explanation.docx
eve2xjazwa
 
Getting it the rightest
Getting it the rightestGetting it the rightest
Getting it the rightest
Jennifer LaFleur
 
How to Use Data to Get Grants
How to Use Data to Get GrantsHow to Use Data to Get Grants
How to Use Data to Get Grants
4Good.org
 
How To Write An Introduction To A Research Paper
How To Write An Introduction To A Research PaperHow To Write An Introduction To A Research Paper
How To Write An Introduction To A Research Paper
Michelle Brown
 
Background Screening Webinar 1.0
Background Screening Webinar 1.0Background Screening Webinar 1.0
Background Screening Webinar 1.0
Jmschwietz1
 
Money & Politics: Illuminating the Connection
Money & Politics: Illuminating the ConnectionMoney & Politics: Illuminating the Connection
Money & Politics: Illuminating the Connection
Steve Toub
 
Discovering and mapping your community needs
Discovering and mapping your community needsDiscovering and mapping your community needs
Discovering and mapping your community needs
The HealthPath Foundation of Ohio
 
Magellan Strategies Colorado Voter Segmentation Overview 052114
Magellan Strategies Colorado Voter Segmentation Overview  052114Magellan Strategies Colorado Voter Segmentation Overview  052114
Magellan Strategies Colorado Voter Segmentation Overview 052114
Magellan Strategies
 
Backgrounds for Churches and Nonprofits
Backgrounds for Churches and NonprofitsBackgrounds for Churches and Nonprofits
Backgrounds for Churches and Nonprofits
Imperative Information Group
 

Similar to Ona 2012 (20)

Nr14: Ten tips for data journalists
Nr14: Ten tips for data journalistsNr14: Ten tips for data journalists
Nr14: Ten tips for data journalists
 
ACP Digging Deeper
ACP Digging DeeperACP Digging Deeper
ACP Digging Deeper
 
Be a Better Business Watchdog -- CAR for Business Journalists
Be a Better Business Watchdog -- CAR for Business JournalistsBe a Better Business Watchdog -- CAR for Business Journalists
Be a Better Business Watchdog -- CAR for Business Journalists
 
Mendelson: Driving daily enterprise coverage
Mendelson: Driving daily enterprise coverageMendelson: Driving daily enterprise coverage
Mendelson: Driving daily enterprise coverage
 
Data driven enterprise off your beat - denver news train - april 11-12, 2019
Data driven enterprise off your beat - denver news train - april 11-12, 2019Data driven enterprise off your beat - denver news train - april 11-12, 2019
Data driven enterprise off your beat - denver news train - april 11-12, 2019
 
Data Journalism for Business Reporting by Jaimi Dowdell and Mark Horvit
Data Journalism for Business Reporting by Jaimi Dowdell and Mark HorvitData Journalism for Business Reporting by Jaimi Dowdell and Mark Horvit
Data Journalism for Business Reporting by Jaimi Dowdell and Mark Horvit
 
Sla03tt
Sla03ttSla03tt
Sla03tt
 
Page 579Assess the Constituent Data. What is included Omi.docx
Page 579Assess the Constituent Data. What is included Omi.docxPage 579Assess the Constituent Data. What is included Omi.docx
Page 579Assess the Constituent Data. What is included Omi.docx
 
Data-Driven Enterprise off Your Beat by Manuel Torres - Monroe, La., NewsTrai...
Data-Driven Enterprise off Your Beat by Manuel Torres - Monroe, La., NewsTrai...Data-Driven Enterprise off Your Beat by Manuel Torres - Monroe, La., NewsTrai...
Data-Driven Enterprise off Your Beat by Manuel Torres - Monroe, La., NewsTrai...
 
Difference between Crime and DevianceTheories offer an explanation.docx
Difference between Crime and DevianceTheories offer an explanation.docxDifference between Crime and DevianceTheories offer an explanation.docx
Difference between Crime and DevianceTheories offer an explanation.docx
 
Getting it the rightest
Getting it the rightestGetting it the rightest
Getting it the rightest
 
How to Use Data to Get Grants
How to Use Data to Get GrantsHow to Use Data to Get Grants
How to Use Data to Get Grants
 
How To Write An Introduction To A Research Paper
How To Write An Introduction To A Research PaperHow To Write An Introduction To A Research Paper
How To Write An Introduction To A Research Paper
 
Background Screening Webinar 1.0
Background Screening Webinar 1.0Background Screening Webinar 1.0
Background Screening Webinar 1.0
 
Money & Politics: Illuminating the Connection
Money & Politics: Illuminating the ConnectionMoney & Politics: Illuminating the Connection
Money & Politics: Illuminating the Connection
 
Discovering and mapping your community needs
Discovering and mapping your community needsDiscovering and mapping your community needs
Discovering and mapping your community needs
 
Magellan Strategies Colorado Voter Segmentation Overview 052114
Magellan Strategies Colorado Voter Segmentation Overview  052114Magellan Strategies Colorado Voter Segmentation Overview  052114
Magellan Strategies Colorado Voter Segmentation Overview 052114
 
Backgrounds for Churches and Nonprofits
Backgrounds for Churches and NonprofitsBackgrounds for Churches and Nonprofits
Backgrounds for Churches and Nonprofits
 
SOC2002 Lecture 6
SOC2002 Lecture 6SOC2002 Lecture 6
SOC2002 Lecture 6
 
ESA Best Practices
ESA Best PracticesESA Best Practices
ESA Best Practices
 

More from Jennifer LaFleur

How drawing exercises your brain
How drawing exercises your brainHow drawing exercises your brain
How drawing exercises your brain
Jennifer LaFleur
 
Brain flipping ire17
Brain flipping ire17Brain flipping ire17
Brain flipping ire17
Jennifer LaFleur
 
Investigating Disabiity Issues
Investigating Disabiity IssuesInvestigating Disabiity Issues
Investigating Disabiity Issues
Jennifer LaFleur
 
Cats stats
Cats statsCats stats
Cats stats
Jennifer LaFleur
 
Crunching the numbers NR14
Crunching the numbers NR14Crunching the numbers NR14
Crunching the numbers NR14Jennifer LaFleur
 
Data journalism at Techraking 6
Data journalism at Techraking 6Data journalism at Techraking 6
Data journalism at Techraking 6Jennifer LaFleur
 
Mind the Gap NICAR14 (holes in data)
Mind the Gap NICAR14 (holes in data)Mind the Gap NICAR14 (holes in data)
Mind the Gap NICAR14 (holes in data)Jennifer LaFleur
 
Data journalism without data
Data journalism without dataData journalism without data
Data journalism without dataJennifer LaFleur
 
Diagnosing dirty data_ire2013
Diagnosing dirty data_ire2013Diagnosing dirty data_ire2013
Diagnosing dirty data_ire2013
Jennifer LaFleur
 
Transparency ire13
Transparency ire13Transparency ire13
Transparency ire13
Jennifer LaFleur
 

More from Jennifer LaFleur (11)

How drawing exercises your brain
How drawing exercises your brainHow drawing exercises your brain
How drawing exercises your brain
 
Brain flipping ire17
Brain flipping ire17Brain flipping ire17
Brain flipping ire17
 
Investigating Disabiity Issues
Investigating Disabiity IssuesInvestigating Disabiity Issues
Investigating Disabiity Issues
 
Cats stats
Cats statsCats stats
Cats stats
 
ACP Getting the Goods
ACP Getting the GoodsACP Getting the Goods
ACP Getting the Goods
 
Crunching the numbers NR14
Crunching the numbers NR14Crunching the numbers NR14
Crunching the numbers NR14
 
Data journalism at Techraking 6
Data journalism at Techraking 6Data journalism at Techraking 6
Data journalism at Techraking 6
 
Mind the Gap NICAR14 (holes in data)
Mind the Gap NICAR14 (holes in data)Mind the Gap NICAR14 (holes in data)
Mind the Gap NICAR14 (holes in data)
 
Data journalism without data
Data journalism without dataData journalism without data
Data journalism without data
 
Diagnosing dirty data_ire2013
Diagnosing dirty data_ire2013Diagnosing dirty data_ire2013
Diagnosing dirty data_ire2013
 
Transparency ire13
Transparency ire13Transparency ire13
Transparency ire13
 

Recently uploaded

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 

Ona 2012

  • 1. Data: Collect, Clean and Manipulate ONA 2012 San Francisco Jennifer LaFleur, ProPublica jennifer.lafleur@propublica.org @j_la28
  • 2. Why data? It takes you beyond the anecdote It’s easier than counting sheets of paper
  • 3.
  • 4. Why data? Contrasts are in the data
  • 5. Caution: This slide contains extreme nerdiness
  • 6.
  • 7.
  • 8. Why Computer-Assisted Reporting? Contrasts are in the data Your most powerful figures are in the data
  • 9. Source: California Health Dept. data, Medicare billing data Findings: Some hospitals had “alarming rates of a Third World nutritional disorder among its Medicare patients.”
  • 10.
  • 11. Why data? Contrasts are in the data Your most powerful figures are in the data You can make connections you might not be able to make otherwise
  • 12. Data: Youth prison workers, criminal convictions and grievance data Findings: Employees with criminal backgrounds were more likely to be accused of abusing inmates.
  • 13. Data: Federal bridge inspections and stimulus funding. Findings: Some of the nation’s worst bridges did not get stimulus funds.
  • 14. Why data? Contrasts are in the data Your most powerful figures are in the data You can make connections you might not be able to make otherwise You can test assumptions
  • 15. Source: NHTSA complaint data Findings: “…unintended acceleration has been a problem across the auto industry.”
  • 18. Where’s the data? If something is inspected Licensed Enforced or Purchased …There probably is a database
  • 19. Where’s the data? If there is a report Or a form There probably is a database
  • 20.
  • 21. Where’s the data? Sometimes data is readily available online for download
  • 22. Source: Census Findings: “Fueled by the dismal economy and high unemployment, mor e Americans…are doubling up”
  • 23. Source: Medicaid nursing home survey data and finance data, housing data Findings: “…a shortage of places for the disabled to live outside a nursing home and regulations that critics say make it hard to qualify for home services mean many who want out continue to receive expensive nursing care.”
  • 24. Where’s the data? Sometimes you have to scrape it. That usually involves programs that automate searching tasks on Web sites.
  • 25.
  • 26. Where’s the data? More often you need to go to an agency to get the data This can be tricky if an agency doesn’t want to release it. (Stay tuned for more on that…)
  • 27. Source: School district credit card purchases Findings: District card holders made questionable purchases with their cards.
  • 28.
  • 29.
  • 30. Sometimes, there is no data. But it’s okay because there are techniques for sampling and building a database.
  • 31. ProPublica pulled a random sample of 500 names from a list of individuals who had been granted or denied pardons (around 2,000). We created a database from months or researching individuals: their crime, age, sentence… We found that even after controlling for other factors, whites were more likely to get a pardon.
  • 32.
  • 33. Source: Loan details, foreclosure information and bankruptcy filings Findings: Loans leading to foreclosure didn’t always follow conventional wisdom
  • 34. When you have to ask for the data Before filing a request: Ask for it If they require a formal request, find out who it should go to and what you should ask for Letter should describe what you’re asking for  Note that you’re willing to negotiate  Ask for a cost estimate
  • 35. Dear Records Administrator: I’m writing to request under the Texas Public Information Act an electronic copy of the current health- related services registry database for the state of Texas. I also am requesting electronic copies or a database of all complaints filed against health-related service registry members since Jan. 1, 2000. I frequently deal with large raw databases, so I would be able to accept information in several formats including ASCII, dbf, xls, etc… and can accept the data on a variety of media (computer tape, CD- ROM, FTP, email attachment, etc...). Please include record layouts, code sheets or any other documentation necessary to interpret the data. I am requesting all data fields. If there are any fields that you must withhold by law, please let me know what those fields are, so I can amend my request. In the interest of expediency, and to minimize the research and/or duplication burden on your staff, I would be happy to speak with your database administrator to figure out a method that is easiest for you. If you have questions or need more information, please contact me by telephone or email. My telephone number is: 214-977-8509. My email address is jlafleur@dallasnews.com. If you will be charging processing fees, please send me an itemized estimate explaining how the costs were calculated.
  • 36. Getting electronic information Know the law. Know how your state treats (or doesn’t) the records you need. Know what information you want. Do your homework Know what the appropriate cost should be. Know who does the data entry. Get to know Leon When something may not clearly be public use your sourcing
  • 37. Just another way of saying no Huge costs Delay tactics “Oh you silly little journalist” Sending you the wrong thing “Your request was unclear” HIPAA Privacy Privatization
  • 39. Our database is on a mainframe and it’s very complicated, Missy
  • 40.
  • 41. We don’t have the authority to do that
  • 42.
  • 44.
  • 45. We have processed your request. The labor cost for the request is as follows. Item # of hours RESEARCH 20 CREATING FILES 6 CODING 24 TESTING 4 Total (54 X$72) = $3,888.00
  • 46. From Texas Public Information Act: 111.67. Estimates and Waivers of Public Information Charges (a) A governmental body is required to provide a requestor with an itemized statement of estimated charges if charges for copies of public information will exceed $40, or if a charge in accordance with §111.65 of this title (relating to Access to Information Where Copies Are Not Requested) will exceed $40 for making public information available for inspection. A governmental body that fails to provide the required statement may not collect more than $40. The itemized statement must be provided free of charge and must contain the following information:
  • 50.
  • 51. We don’t keep that on computer
  • 52. Okay, we do, but it’s a lot of files
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 60. Remember that data are not perfect
  • 61. It doesn’t mean you can’t use it… Do integrity checks to find the flaws Add caveats where necessary Do your own analysis rather than relying on an agency’s analysis of bad data
  • 62. Integrity checks for every data set Read the documentation. Understand the contents of every field. Know how many records you should have. Check counts and totals against reports. Are all possibilities included? All states, all counties, correct ranges?
  • 63. Integrity checks for every data set Internal data checks: Is there more money going to sub-contractors than went to the prime contractor? Are there more teachers than students? Do people have birth dates in the future or so long ago they would be long gone?
  • 64.
  • 65.
  • 66. If your data is in Excel, use the filter function to see what the values are in individual fields.
  • 67. Integrity checks for every data set Check for missing data, misplaced data or blank fields Use a standard naming convention for files and tables (I wouldn’t recommend “final”) Check for duplicates Take margins of error into account if necessary (important if you’re using Census data).
  • 68. 2010 Census ACS: Median HH Income by Metro Area
  • 69. Be creative when you look for duplicates
  • 70. Beyond the basics Keep a notes file Don’t work off your original database Know the source Check against summary reports Use the right tool Check for outliers when it comes to ups and downs
  • 71. Truck accidents by year and agency
  • 72. Beyond the basics Check with experts Are there standards? (ex: a drop by more than 10 perc pts is a red flag) Find out what others have done Gut check Go physically see a record or spot check against documents
  • 73. Voter Fraud Dozens of St. Louis voters are being wrongly accused of casting ballots from fraudulent addresses in last year's Nov. 7 election. They are among thousands of registered voters who, based on city property records, appear to live on vacant lots.
  • 74.
  • 75. Texas test score data official results versus district Duncanville district reported 4th grade writing Official report for Duncanville 4th grade writing Courtesy Holly Hacker, The Dallas Morning News
  • 76. Three rounds of analysis after bouncing off subjects and experts Demographically based Voir dire Socioeconomics
  • 77. Checks when you’re matching data A name is not enough. Lots of people have the same name Get dates of birth and other information to make sure you have the correct person.
  • 78. Source: Illinois health data, police data Findings: Dangerous systemic failed to protect elderly patients in Illinois nursing homes that also house mentally ill younger residents, including murderers, sex offenders, and armed robbers.
  • 79. Even people with seemingly unique names aren’t so unique
  • 80. Evaluating outside studies Get the questionnaire and methodology Beware of nonscientific methods: Web surveys, man on the street Know the sample size..sampling error Account for margin of error and non-response when drawing conclusions Run statistical tests on the data if possible
  • 81.
  • 82. Reporting data Consider reporting rates not raw numbers Avoid false precision: 53.14 percent said … in a poll with a 5 percentage point margin of error Avoid number overload. About half is usually just as useful as 51 percent in most cases Adjust money for inflation When analyzing income, use median rather than average (Bill Gates factor)
  • 83. When the data is the problem – you might still have a story Erroneous government databases – can often be a story themselves
  • 84.
  • 85.
  • 86. Manipulating data for stories and apps
  • 87. Know which tool to use • Reporting individual records • Counting/summing • Mapping • Statistics
  • 88. Source: Medicaid outcomes data for dialysis facilities Findings: A CMS online tool did not tell the whole story about facilities. In some counties the gap in measures, such as survival rate were vast.
  • 89.
  • 90. Source: Washington Health Department data Findings: “MRSA has been quietly killing in hospitals for decades.” But no one had tracked it until this story.
  • 91. Source: Dept. of Ed data and surveys of campus crisis clinics Findings: Many campuses had lax enforcement and reporting loop holes mean problems go unchecked.
  • 92.
  • 93. Source: EPA and state data on hazardous chemical locations Findings: Dallas County has 900+ sites that store hazardous chemicals
  • 94. Source: Dam inspection data from Texas and federal government Findings: Dam records had not been updated to account for population growth
  • 95. Source: 311 calls for downed trees Findings: After a tornado swept across New York City, 311 calls for downed trees helps trace its path
  • 96.
  • 97. Source: City Budget Findings: Some neighborhoods suffer more than others as mayor cuts budgets
  • 98.
  • 99.
  • 100. Disparities in water usage “Water use highest in poor areas of the city” Mapping and statistical analysis
  • 101. Presenting the data Include a methodology explaining what you did and what you don’t know. For really complicated analyses – consider a super nerdy white paper explaining all of your findings If you make data downloadable – include field descriptions and anything users should watch for