This document discusses types of data torture and how to avoid forcing confessions from data. It defines two types of data torturing: opportunistic, where associations are found in the data and hypotheses formed to fit them; and Procrustean, where hypotheses are decided on in advance and the data made to fit. Clues to detecting data torture include whether findings came from primary or secondary hypotheses and whether all data groups were analyzed. The document advises asking data questions respectfully rather than torturing it to avoid forced confessions.
Call Girls In Mahipalpur O9654467111 Escorts Service
Don't Torture Data, Ask Nicely!
1. Digital You Can Trust |
TRUSTED CONF.
CONTACT: Roberta Cardoso
DATE: July 2019
DATA TORTURE
2. Digital You Can Trust |
BETA
ROBERTA CARDOSO
TECHNICAL DATA ANALYST
HI, I’M BETA…
● Brazilian
● Data Analyst
● Mother of a sweet 10yo girl
● Tutor of 3 dogs and 6 cats
● Balancing nature and technology in everyday life
3. DIGITAL MARKETING & ANALYTICS |
Digital You Can Trust |
If you want reliable
confessions, don’t
torture your data,
ask nicely.
AGENDA
Data Torture
Types of Data Torture
Clues to Data Torture
Are You Forcing
Confessions?
4. Digital You Can Trust |
If you torture the
data long enough, it
will confess to
anything.
Source: How to Lie With Statistics, 1954, ISBN 0393310728
- Darrell Huff
6. Digital You Can Trust |
Opportunistic Procrustean
Types of Data Torturing
Source: Mills, 1993, pp. 1196–1199
7. Digital You Can Trust |
Pores over the data until a
"significant" association is found and
then devises a plausible hypothesis
to fit the association.
Opportunistic
Makes very hard for readers to tell that
the positive association didn’t spring
from an a priori hypothesis.
is performed by deciding on the
hypothesis to be proved and making
the data fit the hypothesis.
Procrustean
Its results are often more believable if one starts
with a popular hypothesis. It is also more
destructive, because it may produce results that are
seen as definitive proof of the hypothesis
When many independent
tests are performed, the
probability of a correct
conclusion drops
drastically.
Source: Mills, 1993, pp. 1196–1199
8. Digital You Can Trust |
Pores over the data until a
"significant" association is found
between variables and then devises
a plausible hypothesis to fit the
association.
Opportunistic
Makes very hard for readers to tell that the positive
association didn’t spring from an a priori
hypothesis.
It is performed by deciding on the
hypothesis to be proved and making
the data fit the hypothesis.
Procrustean
It may produce results that are seen as
definitive proof of the hypothesis
It’s more difficult to carry
out than opportunistic
data torturing, because it
requires selective
reporting, but its results
are often more believable
Source: Mills, 1993, pp. 1196–1199
9. Digital You Can Trust |
There is a chance of
doing this
unintentionally.
10. Digital You Can Trust |
Comparing a current value
to an average or target
value.
Source: Marcey L. Abate - DATA TORTURING AND THE MISUSE OF STATISTICAL TOOLS
11. Digital You Can Trust |
Performing trend analysis.
Source: Marcey L. Abate - DATA TORTURING AND THE MISUSE OF STATISTICAL TOOLS
12. Digital You Can Trust |
Clues to Data Torture
Source: Marcey L. Abate - DATA TORTURING AND THE MISUSE OF STATISTICAL TOOLS
● Did the reported findings result from testing a primary hypothesis or an a posteriori
hypothesis?
● Does the hypothesis have good supporting data from previous studies?
● Have data been reported for all groups in the study or were certain study groups excluded
from analysis and why?
● Was the effect of multiple comparisons discussed and statistically managed?
● How many significant results were reported relative to the number of comparisons made?
● Was the research outcome defined before collecting the data?
16. DELIVERED
BY EXPERTS
Our global team of expert consultants and
practitioners have been hand-selected from
thousands of applicants.
Digital You Can Trust |
17. We’re a global online marketing
agency managed from one of
the finest beaches on the planet.
Digital You Can Trust |
GET IN
TOUCH
Editor's Notes
Proud resident of the Chapada Diamantina National Park in Brazil
Data Analyst with about 10 years of experience in Digital Marketing Analysis
I’m passionate about Project Management, process design and optimisation for Big Data
The first time I read something similar to this quote was on the website of a Data Science Institute. They were using it as their motto: "We torture data until it confesses".
In the beginning, it made total sense to me, because I'm a Data Analyst, and my job is to get answers from the data.
I Googled: "Data Torture", and found the root of this quote in a book from 1954, where the author picks apart how marketers manipulate statistics and data visualization to trick the public. The book is named "How to Lie With Statistics".
At that point, I was relieved that I hadn't changed my LinkedIn headline from Data Analyst to Data Torturer.
It became evident for me that data torturing is less about answering questions, and more about forcing confessions for whatever the torturer wants to prove.
Like other forms of torture, If it’s done skillfully, data torturing won’t leave incriminating evidence.
So, the unfortunate result of torturing data is getting anything but the truth.
In a word, data torturing is ethically problematic because neither the reported data nor the explanations or hypotheses the data torturer offers are all that trustworthy.
In 1993, Doctor James Mills published an article in The New England Journal of Medicine, where he refers to two types of Data Torturing:
1) Opportunistic
2) Procrustean
We’re going briefly cover both now:
Opportunistic torture is performed by running many independent tests - which decreases the probability of a correct conclusion.
For instance:
If the CvR for a current ad and it's creative variation differed by 5% or 10% how would we know whether the difference was due to chance?
For reasonably arbitrary reasons, a result is not due to chance if the probability value (p-value) is less than 0.05 - which means that there is a 5% chance to conclude that the two ads differ when they actually don't, and 95% probability that we will correctly infer that there is no difference between them.
The problem is: when many independent tests are performed that probability drops drastically, in a way that if we run the same test 20 times, it is only 36% of the probability of a correct conclusion.
Procrustean data torturing is about manipulating the data so that they prove the desired hypothesis.
It’s more difficult to carry out than opportunistic data torturing, because it requires selective reporting, but its results are often more believable.
It’s also more destructive, because it may produce results that are seen as definitive proof of the hypothesis.
It can take several forms:
Exposure may be redefined in a way that strengthens the association. For example, one study of the website organic traffic due to some SEO improvements on the outcome of an notable uplift in CTR presumed 30 days before the intervention; the choice of an inappropriately extended period to define the exposure produced a positive result by including unknown interventions unrelated to the tested optimization.
Study pages whose results don’t support the hypothesis may be intentionally dropped.
Can you see how easy it is to slip from one impression to something quite different by making different approaches to interpreting the data?
Data torture simply reflects that if you keep coming at the data from different angles you can get a whole range of answers, and there is also a chance you are doing this unintentionally.
A common method for analyzing data is to compare a current value to an average or target value.
This form of data torturing may lead to acting on a perceived difference when none really exists.
Comparisons to averages, specifications, and targets ignore common variability and treat every fluctuation as something special.
Another example of the dangers associated with this type of analysis is: when the CTR results for a particular web page has been plotted over time, and the monthly percentage drop twice the overall average and a “red flag” is given by ignoring the time progression.
Trend analysis are often misused to make decisions with either limited data points or with inadequate knowledge about the process of creating the data.
This practice may result in data torturing by wrongly identifying the type of trend or by leading one to conclude that a trend exists when in fact it does not.
Consider the data points shown in the Figure #1. It is difficult, if not impossible, to formulate a meaningful interpretation by only observing three data points without a broader contextual basis, it is all too common that this data would be labeled an “upward trend”.
In Figures #2 and #3, the last three data points in each run chart are the same points as those given in Figure #1.
Making decisions and taking action on perceived trends from limited data points will almost surely result in data torturing by either underreacting or overreacting.
Data torturing can rarely be proved. There are, however, clues that should arouse the reader's suspicion.
In conclusion, here are some of Mills’ recommendations for assessing allegedly statistically significant findings
These shortcomings make evident the importance of applying statistical thinking even when using basic statistical tools.
As repeatedly shown, failure to consider the processes, variation, and data within the mindset of statistical thinking can result in faulty decisions and actions.
In summary, because statistical thinking requires a focus on the process, the application of the associated concepts will increase the effectiveness of statistical tools and help to prevent data torturing.