Becca Aaronson's presentation from "Visualizing Health Care Data" a ReportingOnHealth.org webinar, 7.23.15
http://www.reportingonhealth.org/content/visualizing-health-care-data
4. A good source
The source of your data — the entity, and
individuals who collected, analyzed and
published the data — must be reliable. And as
a reporter, you must consider any bias that
occurred during the collection, analysis or
interpretation of the data, just as you would
when considering a human source.
5. Census Data
Here’s how the U.S. Census categorizes “White” people:
“White. A person having origins in any of the original
peoples of Europe, the Middle East, or North Africa. It
includes people who indicate their race as ‘White’ or report
entries such as Irish, German, Italian, Lebanese, Arab,
Moroccan, or Caucasian.”
6. The “Lie Factor”
“Lie factor” = Size of the effect shown on the graphic / Size
of the effect in the data
Every graphic’s “lie factor” should have a value between
0.95 and 1.05.
9. Design variation
Changing the design of a graphic can confuse or mislead
readers, especially if graphics with varying designs are
side-by-side or not clearly labelled.
11. Tufte’s 6 principles:
1. The representation of numbers, as physically measured on the surface of the
graphics itself, should be directly proportional to the numerical quantities
represented.
2. Clear, detailed, and thorough labeling should be used to defeat graphics distortion
and ambiguity.
3. Show data variation, not design variation.
4. In time-series displays of money, deflated and standardized units of monetary
measurement are nearly always better than nominal units.
5. The number of information-carrying dimensions depicted should not exceed the
number of dimensions in the data.
6. Graphics must not quote data out of context.
12. The population map
WARNING: Don’t accidentally make a
population map.
Here’s a good article on when maps shouldn’t
be maps. And check out Darla Cameron’s 2015
NICAR lightning talk on alternative solutions.
13. Map: 2011 American Community Survey, poverty levels
By raw number...
By percent...
14. Politics of Prevention
Let’s take a look at my fellowship project
together, and the data visualizations created to
support the series.
Map: Find Texas Remaining Abortion Clinics
15. How much is a limb worth?
ProPublica’s “How Much is a Limb Worth?” is
amazing for many reasons, particularly how
they visualized the data. Let’s watch this short
clip on Scott Klein and Lena Groeger explaining
how they built it.
18. Los Angeles children
Let’s say you’re working on a story about poverty impacts
children with disabilities living in the Los Angeles area.
According to 2013 American Community Survey data, a
greater proportion of Los Angeles children with disabilities
have incomes below the federal poverty level in the past
12-months.
Let’s visualize it!
19. 1. Go to Chartbuilder. Click “Chart grid” in Step
1, then delete the default data in Step 2.
2. Open this spreadsheet.
3. Copy the data on the worksheet “Under 18 -
Poverty Level”
4. Paste your data in Step 2.
21. 5. Under Step 3, select 2 rows and 1 column.
6. You still need to label the data. On Step 4, add
“%” or “ percent” as a suffix
7. Add a title and source information on Step 5.
8. Download your image, and you’re ready to go.
22.
23. Let’s take a step back...
The data you just copy/pasted was cleaned up from
American Community Survey data. Let’s go through how
we got the data ready for presentation.
24. Get the data
● Download the raw data or view the original data on our street on
the Worksheet labelled “ACS_13_1YR_B18130_with_ann.” The
fields we want to use are highlighted in blue.
● Copy all of the columns from the first blue column to the last blue
column — HD01_VD02 to HD01_VD15. In your own
spreadsheet, create a new worksheet titled “Data fields.” Right
click on cell A1 and select “Paste special > Paste transpose.”
● Put your cursor on B1, click the arrow on the top right corner,
and select “Sort Sheet A-Z”
25. Here’s a screenshot of the worksheet “Data Fields,” which
includes the information we want to analyze to see how many Los
Angeles children with disabilities are also below poverty level,
compared to children without disabilities.
26. Next, we want to add the children “Under 5 Years” and “5-17 Years”
by disability and income status to get an estimate for all children in
each of the sub-groups. Create a new worksheet titled
“Calculations” and set up the following structure:
27. Now, copy the data you’ll be working with and paste it underneath. For ease of
reference, I’ve just brought over the 4 data fields we need to add together, and
organized them by age group. You can bring over all of your data, but make sure to
widen the header on the data description column, so that you can double-check you’re
referencing the correct fields.
28. Sum the two age groups with corresponding disability and income status
to fill in your spreadsheet:
29. Next, calculate what percent of children with disabilities fall into each
income range. Then do the same for children with no disabilities.
30. Now you’re ready to chart!
Let’s try another charting tool, Datawrapper.de
Open the website, select “+New Chart” in the
upper right corner. You can copy/paste the
estimated totals that we just calculated. If you
have a large dataset, you can also save it as a
.csv and upload the file.
31.
32. Click on a column header to change the format, add prefixes or
suffixes (like ‘$’ or ‘%’) or hide a column of data from the
visualization.
33. Test different layouts
See what the data looks like as a “bar chart” or
“column chart” and other views.
Click on “2 Check & Refine” to go back a step.
On the top left corner of your data table, click
“Transpose,” then “Proceed.” Now look at the
various chart types. Notice a difference?
34. Prepare to publish
Click “Refine” to choose different colors for your
chart. Click “Annotate” to add a title and source
information about your data.
When you’re ready, hit “Publish.” You’ll need an
account to save your graphic.
Editor's Notes
I’m a big fan of the phrase “interviewing the data,” because a good data source, shouldn’t be considered that different from a good human source.
When interviewing a person, a journalist thinks, “How reliable is this person? How can I trust that the information they give me is accurate? What bias might this person have that’s reflected in their account or opinions?” We have many ways of answering those questions. You know a person is reliable, because they are an academic and noted expert in the field of interest. You can trust a person’s account of a hospital-horror tale, because that person provided documentation or paperwork that backs up his or her claim about what happened. And you know that if a person works for an insurance company, or a policy firm with a political leaning, their opinion and account must be balanced by evidence from an opposing point-of-view.
Many people rely on U.S. Census Bureau data, because it is seemingly unbiased and reliable. While it’s a great source, and neutral compared to many datasets, it also relies on certain assumptions, and is collected and organized in a way that may not wholly reflect the people it’s categorizing or others’ assumptions about how the data should be organized. For example, “Hispanic” (or Latino) is considered an ethnicity, not a race, so most people who identify as “Hispanic” or “Latino” usually choose the race “White.”
The Census also measures whether people identify as “Hispanic” or “Non-Hispanic,” and then attempts to filter out people who aren’t typically considered “White” with the category “Non-Hispanic White.”
If you’re writing a story about minorities’ access to health services in your region, and use Census data on “White” populations versus other racial groups, you’d completely miss people who identify as Hispanic or Latino, which our audience typically thinks of as a minority class. This is why it’s important to interview the data, not only to determine whether it’s a reliable source, but also to ensure that the story you tell with the data is accurate.
Edward Tufte, a famous American statistician and guru of all things data visualization coined the “lie factor” formula for determining whether a graphic is accurately portraying an effect shown by the data.
“Lie factor” = Size of the effect shown on the graphic / Size of the effect in the data
Every graphic’s “lie factor” should have a value between 0.95 and 1.05. In this example, the line representing 27.5 miles per gallon is 5.3 inches long, and the line representing 18 miles per gallon is 0.6 inches.
This infographic, which Tufte uses in his book, has a “lie factor” of 2.8. The graphic is using the size of the doctors to show a change in the percentage of doctors solely devoted to family practice. It’s one-dimensional data, but the image of doctor is two-dimensional. Although the illustrator adjusted the height of the doctor to reflect the percent of doctors, they did not adjust the width, which means the area of the images of the doctor isn’t proportional to the data.
Can anyone tell me what this graphic taken from Visualizing Health gallery is showing?
There are two really big faux-pas here. Design variation, and lack of clear labelling. The graphic is attempting to compare the population of Brooklyn as a part of the U.S. population to the percent of measles cases in Brooklyn and the U.S. overall. But you can’t make a direct comparison of Brooklyn’s population as a part of the U.S., because it’s in the shape of a pie and the cases are laid out as a block of 100 circles. On top of all that, the labels “population” and “case” aren’t descriptive enough — the way I’ve interpreted this graph for you, could very well be wrong.
Darla Cameron, at the Washington Post, gave a great lightning talk at NICAR 2015 on how easy it is to accidentally make a population map when trying to illustrate data geographically.
Here’s a map that I made in 2011 that shows poverty levels by Texas-county using American Community Survey data. If you look at the estimated number of people at or below poverty level by most of the categories, it looks like a population map. But if you look at any of the categories by percent of population living at or below poverty level, you can start to see some regional differences.
This graphic is also a good example of the principle: just because the data is organized geographically, doesn’t mean the best solution is to make a map. I learned this lesson the hard way, so you don’t have to! Here’s a good article on when maps shouldn’t be maps.
ProPublica has some of the most skilled data journalists and news apps developers in the industry, so don’t let them intimidate you. There are plenty of ways that reporters can enhance their storytelling with simpler data visualizations. Here’s an example of something I built with our new health care reporter to visualize the potential impact of losing Medicaid waiver funding in Texas.
Click on a column header to change the format, add prefixes or suffixes (like ‘$’ or ‘%’) or hide a column of data from the visualization.