2. Corvelle Drives Concepts to Completion
Yogi Schulz
Biography
Partner in Corvelle Consulting
Information technology related management
consulting
Microsoft Canada columnist & CBC Radio guest
PPDM Association board member
Industry presenter:
– Project World - 6 years
– PMI – SAC - 3 years
– CIPS – many years
– PPDM Association - several years
2
4. Corvelle Drives Concepts to Completion
Presentation Outline
Introduction
Learning
objectives
Nine hazards
of data
misinterpretation
Recommendations
& actions
• Insufficient Domain
Expertise
• Important Variables
Omitted
• Aggregation Obscures Truth
• Inferences are Off Base
• Sources of Variation
Overlooked
• Statistical Significance
Trumps Critical Thinking
• Numerical Analysis Missing
Something
• Correlation Mistaken for
Causation
• Explanation adds Distortion
4
5. Corvelle Drives Concepts to Completion
Learning Objectives
Understand how to accurately interpret data
Recognize factors
that lead to
misinterpreting data
Understand actions
to minimize risk of
misinterpreting data
5
6. Corvelle Drives Concepts to Completion 6
Having all this health data available
is great, but I think I need a degree
in data analytics to sort it all out.
8. Corvelle Drives Concepts to Completion
Insufficient Domain Expertise
Issues:
– Domain experts are not data scientists
– Data scientists are not domain experts
– Imbalance of expertise
Solutions:
– Is sufficient expertise involved?
– Is the result possible or even likely?
– What experience makes you skeptical?
8
1
9. Corvelle Drives Concepts to Completion
Do you have sufficient expertise and
experience to interpret the data?
9
“I know nothing about the subject,
but I’d be happy to give you my
expert opinion!”
10. Corvelle Drives Concepts to Completion
Important Variables Omitted
Issues:
– Too much complexity
– Odd results or strange data
– Too many variables ignored
Solutions:
– Review the procedures
– Revise the research design
– Narrow the research goal
10
2
11. Corvelle Drives Concepts to Completion 11
Seriously? No worries!
It’s always
something.
I’ve factored in
lift, thrust, drag
and wind speed . . .
Just not
gravity.
12. Corvelle Drives Concepts to Completion
Aggregation Obscures Truth
Issues:
– Story varies by aggregation level
– Aggregation produces surprising relationships
– Pilot results ambiguous
Solutions:
– Check if low level trends hold up
– Identify potential sources of variation
– Confirm research design
12
Top-level report
Sub-level 1
report
Sub-level 2
report
3
13. Corvelle Drives Concepts to Completion
Aggregation to
confirm Trends
13
Use the
SAP database
to aggregate
our findings.
Then use
the survey
database.
14. Corvelle Drives Concepts to Completion
Inferences are Off Base
Issues:
– Misunderstanding group characteristics
– Rose-coloured thinking
– Ideology-based agenda
Solutions:
– Challenge your research project
– Review the statistical calculations
– Review the work
14
4
15. Corvelle Drives Concepts to Completion 15
When you two have finished
arguing your shaky inferences,
I have actual data!
16. Corvelle Drives Concepts to Completion
Sources of Variation Overlooked
Issues:
– Obvious sources overlooked
– Less obvious sources overlooked
– Hidden sources not considered
Solutions:
– Search for sources of variation
– Expand data gathering
– Look for unexpected correlations
16
5
17. Corvelle Drives Concepts to Completion
Not so subtle impacts
on research outcomes
17
“My diabetic research shows that
test subjects are 98% more likely
to take their diabetic pills
when the pills are covered in chocolate!”
18. Corvelle Drives Concepts to Completion
Statistical Significance
Trumps Critical Thinking
Issues:
– Lack of critical thinking
– Over-emphasizing statistical significance
– Assumptions about big data
Solutions:
– Use statistical significance as a screen
– Review hypothesis
– Develop a persuasive story
18
6
20. Corvelle Drives Concepts to Completion 20
If you torture the data
long enough, they will confess.
21. Corvelle Drives Concepts to Completion
Numerical Analysis
Missing Something
Issues:
– Superficial numerical analysis
– Insufficient analysis expertise
– Bias in analysis
Solutions:
– Visualize data
– Check for false positives and false negatives
– Verify numerical analysis independently
21
7
22. Corvelle Drives Concepts to Completion 22
New study reveals that
reading too many studies
may cause heart disease.
23. Corvelle Drives Concepts to Completion
Correlation Mistaken
for Causation
Issues:
– Correlation described as causation
– Delusional or misleading correlations
– Weak correlation stretched to strong correlation
– Random correlation positioned as real correlation
Solutions:
– Create a plausible story
– Use correlation as scientific evidence
– Review the calculations
23
8
24. Corvelle Drives Concepts to Completion
Correlation ≠
Causation
24
I used to think
correlation implied
causation.
Then I took a
statistics class.
Now I don’t.
Sounds like the
class helped.
Well, maybe.
25. Corvelle Drives Concepts to Completion
Explanation adds Distortion
Issues:
– Overly complex explanation
– Too much jargon
– Exaggeration of inferences
Solutions:
– Express results clearly
– Develop simple, attractive charts
– Stick to the supportable inferences
25
9
26. Corvelle Drives Concepts to Completion
Dubious Explanation
of Survey Data
Public statement:
A survey of more than 25,000
Albertans reviewing the K-12
school curriculum found
“there exists a strong desire for
the removal of Shakespeare as
a required author.”
26
27. Corvelle Drives Concepts to Completion
Dubious Explanation
of Survey Data
Reality:
25,000 survey respondents
60 added a comment about Shakespeare
50 called for removal of Shakespeare from
curriculum
27
Alberta Education is interpreting comments
to mean five out of six Albertans favour
removing Shakespeare from curriculum
29. Corvelle Drives Concepts to Completion
Recommendations
Improve your ability
to interpret data
Watch out for these
nine hazards of data
misinterpretation
Ensure your research
is accurate and
defensible
29
• Insufficient Domain Expertise
• Important Variables Omitted
• Aggregation Obscures Truth
• Inferences are Off Base
• Sources of Variation
Overlooked
• Statistical Significance Trumps
Critical Thinking
• Numerical Analysis Missing
Something
• Correlation Mistaken for
Causation
• Explanation adds Distortion
30. Corvelle Drives Concepts to Completion
Questions &
Discussion
30
Can you help us
understand data
better?Please
fill out
evaluation
form
31. Corvelle Drives Concepts to Completion
Understanding Data:
What do these numbers mean?
Corvelle Consulting
300, 400 - 5 Ave. S. W.
Calgary, Alberta T2P 0L6
Phone: (403) 860-5348
E-mail: YogiSchulz@corvelle.com
Web: www.corvelle.com
Yogi Schulz
Partner of Corvelle Consulting
Information technology related
management consulting
Microsoft Canada columnist
& CBC Radio host
Industry presenter
Former PPDM Association
board member
31
32. Corvelle Drives Concepts to Completion
Plausible Inferences for
Ranges of p Values
32
1.00
.10
.05
.01
.001
0.0
p > 0.10
0.05 < p < 0.10
0.01 < p < 0.05
0.001 < p < 0.01
0.0 < p < 0.001
Not significant
Marginally significant
Fairly significant
Strongly significant
Definitely significant
Values of p Plausible inference
34. Corvelle Drives Concepts to Completion
Bibliography
A data analyst needs these 5 skills
– https://www.viva-viva.ca/index.php/news-events/114-a-data-analyst-needs-these-5-skills
Analyzing, Interpreting and Reporting Basic Research Results
– http://managementhelp.org/businessresearch/analysis.htm
Analyzing Data and Communicating Results
– http://strengtheningnonprofits.org/resources/e-learning/online/analyzingdata/
Believe It Or Not, Most Published Research Findings Are Probably False
– Simon Oxenham
– http://bigthink.com/neurobonkers/believe-it-or-not-most-published-research-findings-are-
probably-false
Big Data Solutions for Healthcare
– Stanislas Odinot, April 11, 2013
– https://www.slideshare.net/LarryCover/big-data-solutions-for-healthcare
Big Data in Healthcare: Separating The Hype From The Reality
– Jared Crapo, Health Catalyst
– https://www.healthcatalyst.com/healthcare-big-data-realities
34
35. Corvelle Drives Concepts to Completion
Bibliography
Confirmationist and falsificationist paradigms of science
– Andrew, 5 September 2014
– http://andrewgelman.com/2014/09/05/confirmationist-falsificationist-paradigms-science/
Correlation, causation and coincidence
– 05 November 2015
– http://behindlabdoors.com/correlation-causation-and-coincidence/
Correlation does not imply causation
– https://en.wikipedia.org/wiki/Correlation_does_not_imply_causation
Correlation does not imply causation – except when it does
– The Original Skeptical Raptor, August 9, 2015
– http://www.skepticalraptor.com/skepticalraptorblog.php/correlation-does-not-imply-causation-except-
when-it-does/
Data Analysis, Interpretation and Presentation
– http://www.uio.no/studier/emner/matnat/ifi/INF4260/h10/undervisningsmateriale/DataAnalysis.pdf
Data Analysis and Interpretation
– Anne E. Egger, Ph.D., Anthony Carpi, Ph.D.
– http://www.visionlearning.com/en/library/Process-of-Science/49/Data-Analysis-and-Interpretation/154
Data analysis and presentation
– http://www.statcan.gc.ca/pub/12-539-x/2009001/analysis-analyse-eng.htm
35
36. Corvelle Drives Concepts to Completion
Bibliography
Data Collection and Interpretation
– The Gale Group Inc.
– http://www.encyclopedia.com/education/news-wires-white-papers-and-books/data-
collection-and-interpretation
Dataviz: Making Smarter, More Persuasive Data Visualizations
– Scott Berinato, March 30, 2016
– https://hbr.org/webinar/2016/05/dataviz-making-smarter-more-persuasive-data-
visualizations
Data-driven decision-making process
– http://www.txprofdev.org/apps/datadecisions/node/52.html
Describing and Interpreting Data
– www.uh.edu/~tech132/sln12.doc
Descriptive Statistics and Interpreting Statistics
– http://www.statisticssolutions.com/descriptive-statistics-and-interpreting-statistics/
Economics methods in Cochrane systematic reviews of health promotion and
public health related interventions
– Ian Shemilt, 15 November 2006
– https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-6-55
36
37. Corvelle Drives Concepts to Completion
Bibliography
Excel errors and science papers
– Spreadsheets are playing havoc with scientists
– Economist, September 7, 2016
– http://www.economist.com/blogs/graphicdetail/2016/09/daily-chart-3
How science goes wrong
– Scientific research has changed the world. Now it needs to change itself
– Economist, October 21 2013
– http://www.economist.com/news/leaders/21588069-scientific-research-has-changed-world-now-it-
needs-change-itself-how-science-goes-wrong
HealthToon: Taking down chronic diseases with advanced analytics
– Oliver Clark, April 17, 2015
– http://www.ibmbigdatahub.com/blog/healthtoon-taking-down-chronic-diseases-advanced-analytics
Hilarious Graphs Prove That Correlation Isn’t Causation
– https://www.fastcodesign.com/3030529/hilarious-graphs-prove-that-correlation-isnt-causation
– http://www.tylervigen.com/spurious-correlations
Interpretation of Data: The Basics
– Tania, May 30, 2014
– https://blog.udemy.com/interpretation-of-data/
37
38. Corvelle Drives Concepts to Completion
Bibliography
Interpreting and Presenting Data
– http://www.deq.state.or.us/lab/wqm/docs/InterpretingandPresentingData.pdf
Misconduct, not error, is the source of most retracted papers
– Ashutosh Jogalekar, October 2, 2012
– https://blogs.scientificamerican.com/the-curious-wavefunction/misconduct-and-not-error-is-
the-source-of-most-retracted-papers/
9 Causes Of Data Misinterpretation
– Lisa Morgan, 17 July 2015
– http://www.informationweek.com/big-data/big-data-analytics/9-causes-of-data-
misinterpretation/d/d-id/1321338
Presentation, Analysis and Interpretation of Data
– https://www.slideshare.net/31mikaella/presentation-analysis-and-interpretation-of-data
Retraction Watch
– http://retractionwatch.com/
Sample Data Interpretation Questions
– http://www.psychometric-success.com/faq/faq-sample-data-interpretation-questions.htm
Science and Engineering Practice of Analyzing and Interpreting Data
– http://ngss.nsta.org/Practices.aspx?id=4
38
39. Corvelle Drives Concepts to Completion
Bibliography
Statistical significance and its part in science downfalls
– Imagine if there were a simple single statistical measure everybody could use with any set of data and it
would reliably separate true from false
– Hilda Bastian, November 11, 2013
– https://blogs.scientificamerican.com/absolutely-maybe/statistical-significance-and-its-part-in-science-
downfalls/
Statistical Significance for CRO: 6 Things You Need to Know
– Tom Capper, May 08, 2014
– https://www.distilled.net/resources/statistical-significance-for-cro-6-things-you-need-to-know/
Statistics Done Wrong
– The woefully complete guide
– Alex Reinhart
– https://www.statisticsdonewrong.com/
Student Competencies & Requirements in Health Economics
– Cumming School of Medicine, 2017
– https://cumming.ucalgary.ca/gse/files/gse/competencies_health_economics_2017.pdf
Summary and discussion of: “Why Most Published Research Findings Are False”
– Statistics Journal Club, 36-825
– Dallas Card and Shashank Srivastava, December 10, 2014
– http://www.stat.cmu.edu/~ryantibs/journalclub/ioannidis.pdf
39
40. Corvelle Drives Concepts to Completion
Bibliography
Trouble at the lab
– Scientists like to think of science as self-correcting. To an alarming degree, it is not
– Economist, October 18 2013
– http://www.economist.com/news/briefing/21588057-scientists-think-science-self-correcting-
alarming-degree-it-not-trouble
Understanding the Growth, Value of Healthcare Big Data Analytics
– HealthITAnalytics, July 30, 2015
– http://healthitanalytics.com/news/understanding-the-growth-value-of-healthcare-big-data-
analytics
Watson Analytics
– https://www.ibm.com/analytics/watson-analytics/us-en/What’s Significant?
– Patrick Barlow, University of Tennessee
– https://www.slideshare.net/pbbarlow1/whats-significant-hypothesis-testing-effect-size-
confidence-intervals-the-pvalue-fallacy
Why Most Published Research Findings Are False
– John P. A. Ioannidis, PLoS Med 2(8): e124
– http://faculty.dbmi.pitt.edu/day/Bioinf2118/Bioinf-2118-2013/Ioannidis-
journal.pmed.0020124.pdf
40
Editor's Notes
Understanding Data: What do these numbers mean?
My name is Yogi Schulz
Thank you to xxxx for inviting me to speak today
We all see a lot of data every day
Some of it useful, some of it suspicious, some is obviously wrong and some of it is outright bad and misleading
We going to spend our time together today to learn how to better understand data
Presentation created by Yogi Schulz in April 2017 - YogiSchulz@corvelle.com
2
Data Volumes are Growing every Year
The growth is shown for many categories of health care data and for imaging in particular
Perhaps more alarmingly, the rate of growth appears to be accelerating
Perhaps this cartoon of a data flood is a better way to describe the trend that most health care organizations are choking on
It likely doesn’t matter if the cartoon characters represent clinicians, researchers, bureaucrats or care givers
In any case, this growing volume undermines our effort to understand the data
What examples of data overload have you observed or experienced?
Presentation Outline
Introduction
I’ll start with a few introductory remarks
Learning objectives
Then we’ll talk about the learning objectives for this presentation
We’ll spend most of the time today on these Nine hazards of data misinterpretation
Insufficient Domain Expertise
Important Variables Omitted
Aggregation Obscures Truth
Inferences are Off Base
Sources of Variation Overlooked
Statistical Significance Trumps Critical Thinking
Numerical Analysis Missing Something
Correlation Mistaken for Causation
Explanation adds Distortion
Recommendations & actions
We’ll wrap up with some recommendations
Corvelle Consulting
Corvelle Consulting
Nine Hazards of Data Misinterpretation
Data is misinterpreted more often than you might expect
Even with the best intentions, important variables may be omitted or a problem may be oversimplified or overcomplicated
Sometimes organizations act on trends that are not what they seem
Not surprisingly, when two people view the same analytical result, they may interpret it differently
Statistics can tell you 'this versus that‘
The real questions are:
Is the difference worth worrying about?
Have we collected enough data to allow us to confidently make a decision?
Are there any arithmetic errors in the data analysis?
It is entirely possible for health care leaders to obsess about something that is statistically insignificant, or for data scientists to omit important variables, simply because they do not understand the entire context of the problem they are trying to solve
The path to valuable insights can include a number of obstacles, some of which may not become apparent until well after the fact
Some individuals and groups take a top-down approach to data analysis, meaning that they focus on the medical problem they are trying to solve and they make a point of identifying variables that have been relevant in the past in a same or similar context
Others take a bottom-up approach, meaning that they attempt to correlate variables associated with what they are trying to improve such as readmission rates for specific conditions
The danger of the latter approach is a high probability that some correlations are statistically significant but are an artifact of the way the data has been analyzed, versus being an accurate indicator of underlying relationships
Because there are a lot of ways data can be misinterpreted, we need to understand how and why it can happen
We’ll spend most of our time here today on the nine hazards of data misinterpretation
Insufficient Domain Expertise
Domain or subject matter expertise and data expertise are both necessary for the analysis and accurate interpretations of data
As our available domain expertise decreases, the reliability of our research results also decreases
Issues:
Domain experts are not data scientists
Domain experts tend not to be conversant with the techniques for analyzing data and related statistical concepts
For example: Expert clinical practitioners often have limited exposure to statistical techniques
Data scientists are not domain experts
Data scientists do not have the same level of subject matter expertise that other experts in the organization have accumulated
For example: Data scientists often don’t know much about health care or research design
Imbalance of data expertise and domain expertise
can easily lead to misinterpretation of data
For example, the data scientist often doesn't understand the medical context of the variables they're looking at
That happens a lot in large organizations where people work in silos
Solutions or risk reduction:
Before you use data to examine a situation or to make a recommendation, question:
Is sufficient expertise involved?
Would adding a business analyst add value? Business analysts operate between domain expert and data scientist, and as such, can help ensure sufficient expertise is being applied
Is the result possible or even likely in the real world?
What real-world experience makes you skeptical about the data?
What real-world experience makes you think the data makes sense?
Do you have sufficient education and experience to interpret the data?
Clearly this person has not received enough training on how to set up the gurney
“I know nothing about the subject, but I’d be happy to give you my expert opinion!”
Do you know your expert collaborator or consultant well enough to be confident of his or her relevant expertise?
Corvelle Consulting
It’s really easy to omit variables
Sometimes the consequences can be disastrous
Seriously?
No worries!
I’ve factored in lift, thrust, drag and wind speed . . .
Just not gravity.
It’s always something.
Think carefully about your design and procedures to minimize the risk of nasty surprises later
Corvelle Consulting
Aggregation to confirm Trends
Use the SAP database to aggregate our findings.
That data is wrong.
Then use the survey database.
That data is also wrong.
Can you average our findings?
Sure. I can multiply them too.
This comic is funny because using fancy arithmetic or statistics will not improve accuracy, completeness or defensibility of research findings
Fancy math will also not compensate for poorly designed research
A better approach is to pursue continuous improvement of research design, execution and reporting processes
Inferences are Off Base
What inferences can we make about treatment impact on mother and her boy? The mother seems more upset than the little boy receiving the vaccination
What inferences can we make about the wisdom skateboarding with crutches and a cast? Skateboarding is risky at the best of times; Skateboarding with crutches and a cast enables us to make inferences about teenagers being oblivious to risk
Defensible inferences are essential to the process of assessing research data to form conclusions and then recommendations
Issues:
Misunderstanding group characteristics
All interferences from data are conditional, so it's wise to understand as many group characteristics about which inferences are being made as possible. If not, you run the risk of inferring the wrong properties about a population or a sample
For example: Is the treatment response affected by age, dosage, pre-existing conditions, sex or ethnicity?
Rose-coloured thinking
We all want to save the world; that leads us to interpret our data to be more indicative of the accuracy of our hypothesis than is supportable by the cold, hard facts
Are you stretching the data?
For example: If the treatment is effective for older adults, can I reasonably recommend the treatment for teenagers or children?
Ideology-based agenda
Sometimes conclusions are crafted to advance an ideology-based agenda
While I obviously believe it’s dishonest and unethical, we have all observed such action particular for political issues and sometimes to advance drug or device approvals
Solutions or risk reduction:
Challenge your research project
Are you making leaps in logic from what the data really says to your interferences?
Is there a bias in your design or data collection?
For example: do you inadvertently have only high-income trial participants due to travel requirements for tests?
Are my recommendations clearly and reasonably based on the data I have?
Review the statistical calculations with an experienced statistician
If you are not trained in statistical thinking, you will tend to misinterpret the data or the results more positively than is defensible
Review the work with an independent researcher
That review may be ego-deflating but it will save you from more public embarrassment later
Corvelle Consulting
Corvelle Consulting
“My diabetic research shows that test subjects are 98% more likely to take their diabetic pills when the pills are covered in chocolate!”
Does your research design include a bias that is causing you to overlook a source of variation?
Corvelle Consulting
Statistical Significance
The previous table is often charted as this ubiquitous bell curve
The term "null hypothesis" is a general statement or default position that there is no relationship between two measured phenomena, or no association among groups
Rejecting or disproving the null hypothesis — and thus concluding that there are grounds for believing that there is a relationship between two phenomena, such as that a potential treatment has a measurable effect — is a central task in the modern practice of science
the field of statistics gives precise criteria for rejecting a null hypothesis
https://en.wikipedia.org/wiki/Null_hypothesis
95% Non-significant effect
Variation is likely caused by a combination of errors and noise
2.5% Significant effect
The two 2.5% yellow areas together equal the p = 0.05 value I showed on the table on the previous slide
Explain more
P Value, Statistical Significance and Clinical Significance
The Journal of Clinical and Preventive Cardiology, Volume 2, Oct 2013
Padam Singh, PhD Gurgaon, India
J Clin Prev Cardiol. 2013;2(4):202-4
http://www.jcpcarchives.org/full/p-value-statistical-significance-and-clinical-significance-121.php
Corvelle Consulting
Corvelle Consulting
File: Close_to_home_2017_04_19.gif
I think the general public is confused by the seemingly contradictory findings of various studies of the same medical issue
Often these contradictory findings can be legitimately explained by the differences in the various studies that have been conducted
However, such subtleties aren’t easily explained in a short newspaper article or a TV sound bite
Corvelle Consulting
Corvelle Consulting
Corvelle Consulting
Corvelle Consulting
Corvelle Consulting
Number of article retractions by cause of retraction – first chart
The authors focus on papers in the life sciences
They find that about 67% of the 2047 retracted papers owed their retraction to plain old misconduct; only 21% or so can be traced back to error and honest mistakes
The misconduct can come in three forms - outright fraud, plagiarism and duplication
The other piece of bad news is that among these three, fraud contributes the most to the retraction with plagiarism and duplication tagging behind
Clearly there’s a material number of published studies that can mislead us
We need to be vigilant
I wonder what is causing the increase in the number of retractions?
More researchers creating more articles?
More stringent peer review?
More readers drawing problems to the attention of editors?
Number of article retractions by cause of retraction – second chart
Unfortunately these retracted studies continue to create problems in the research even after they have bee retracted because the retracted studies continue to be cited in subsequent research articles
This data suggests harm to reputations and perhaps harm to patients can be reduced
Misconduct, not error, is the source of most retracted papers
Ashutosh Jogalekar, October 2, 2012
https://blogs.scientificamerican.com/the-curious-wavefunction/misconduct-and-not-error-is-the-source-of-most-retracted-papers/
Why researchers keep citing retracted papers
https://qz.com/583497/researchers-keep-citing-these-retracted-papers/
5/19/2017
Questions & Discussion
“Can you help us understand data better?”
Understanding Data: What do these numbers mean?
Plausible Inferences for Ranges of p Values
We’re often delighted when we achieve a correlation of p = 0.05 in our data
This table is to remind us that this level of correlation is only fairly significant
I appreciate that the effort associated with driving p to a lower value is often horrendous, stratospheric or ridiculously expensive
So my point is to say that we should be cautious about inferring too much from a correlation of p = 0.05