SlideShare a Scribd company logo
1 of 65
Common Analytic Mistakes
Jim Parnitzke
Webinar Series
September, 2014
Introduction
Jim Parnitzke
Business Intelligence and Enterprise Architecture
Advisor, Expert, Trusted Partner, and Publisher
Common
Organizational
Blunders
Analytic
Fundamentals
Common
Measurement
Errors
Statistical
Mistakes
Visualization
Faults
Summary
Not Asking the
Hard Questions
Focusing on
Yesterday's
Metrics
Understanding
Metrics and
Their
Methodology
Bottlenecking
the Value to the
Organization
Overvaluing Data
Visualization
The Conference
Table
Compromise
Compromising
Data Ownership
Confusing Insight
With Action
Common Analytic Mistakes
Analytics projects driven by nothing more than management
pronouncements that "we need metrics, get us some"
don’t ever end well.
• Analysis for analysis sake is ridiculous.
• Ask the right questions to learn what data and metrics are
important and can make a difference.
• Know which quantitative measures don't matter and can be
disregarded.
Not Asking the Hard Questions
• Analytics projects can be driven by answering yesterday's questions, not
tomorrow's. It's an easy trap to fall into.
People are comfortable with what they know.
• A backwards-optimized view provides comfort... at a cost of NOT tracking metrics
that help drive business forward, numbers the organization doesn't yet know, or
what's unpredictable and uncomfortable but exactly where focus is needed.
Focusing on Yesterday's Metrics
Historical
Reporting
“What
happened”
Real Time
Reporting
“What are we
doing right
now”
Modelling
“What if we
did this or
that”
Predictive
Analytics
“What can we
expect”
Prescriptive
Analytics
“What is the
optimal
solution”
-- Descriptive -- -- Prescriptive ---- Predictive --
No easy answer to this. Try to understand how various systems count, resolve that
with your own needs, then accept the compromises you have no control over.
• Example:
Online media and marketing are immature. Systems that measure them are
immature. Don’t assume systems and their outputs are fully baked and results
can be taken at face value.
Reality check: They aren't; They can't.
A page view in one system isn't a page view in another. Definitions and methods
that define metrics such as page views, visits, unique visitors, and ad deliveries
are different in every system. Without a standard definition for an ad impression
you can be sure there aren't standards for all the peripheral metrics.
Counts can differ between products from the same vendor by as much as 30%.
Misunderstanding Metrics
Many well-intentioned project owners kill the project value because they
bottleneck their own success. Well-meaning professionals, usually the systems'
power users, create processes and procedures that place themselves at the center
of everything regarding the systems: report creation, running ad hoc queries,
report distribution, and troubleshooting.
This often prevents others from taking real advantage of the systems' intelligence.
The only solution is to train others, then step back and let them at it.
Mistakes and misinterpretations will happen, but the benefits of widespread
adoption across the enterprise will always outweigh them.
Bottlenecking Value to the Organization
When it comes to analytics, there's nothing we love more than lots of pretty
pictures and dazzling graphics.
Theirs is nothing wrong with pleasing visualization tools in data presentation.
Too many worry about pictures first, data and analysis second. Pictures cloud their
minds and vision, which is exactly why vendors put them there. Graphics are great
at grabbing attention, but not always great at putting data into action.
What looks good in reports should be a means to the end, not the end in itself.
Overvaluing Data Visualization
http://www.wheels.org/monkeywrench/?p=413
One reason analytics projects lose focus is they begin compromised.
Too many follow the conference table consensus approach.
The Conference Table Compromise
Every department gets a seat at the table.
Everyone contributes suggestions.
The final product is a compendium of all requests.
Although this method tends to create lots of good
feelings, it rarely results in the best set of metrics with
which to run the business. You will find it's tracking silly,
pet metrics someone at the table thought were
important and are completely irrelevant.
To keep projects focused, decide which metrics are important and stay,
and which are distracting and go.
When budgets are tight and all are clamoring for better
analytics, it's understandable that not everyone reads or fully
comprehends the fine print associated vendor "partnerships."
The nuances of data ownership may seem innocuous, but
there are consequences.
Many will use analytics services to build a databases of
anonymous consumer profiles and their behavior to use in ad
targeting when those consumers visit other sites without
compensation to the publishers whose sites were harvested.
Be careful with this one…
Compromising Data Ownership
What good are analysis and insight if you can't act on them?
• Almost all analytics systems bill themselves as actionable.
Many claim they're real time. Learn what they really mean.
• Few systems can enable an enterprise to take immediate, tactical steps to
leverage data for value. For most, "actionable" means the system can generate
reports, such as user navigation patterns publishers can mull over in meetings,
then plan the changes needed to improve.
• While that may meet the definition of actionable, it doesn't necessarily mean
real-time or even right-time action. Bottom line: Understand, don't assume.
Confusing Insight With Action
Data Quality
and Context
Never
Compare
Apples to
Oranges
Don't
Overstate
(alarm)
Unnecessarily
Calibrate Your
Time Series
Always Make
Your Point
Clearly
(and Colors
Matter.)
Statistical
Significance
Correlation vs.
Causation
Improper use
of averages
There is Such a
Thing as Too
Little Data!
Common
Organizational
Blunders
Analytic
Fundamentals
Common
Measurement
Errors
Statistical
Mistakes
Visualization
Faults
Summary
Common Analytic Mistakes
While a poor font choice can ruin a meeting, a poor interpretation of statistics or
data can kill you (or someone else, for that matter). Proper use of data is a viciously
complicated topic. See the section on Statistical Mistakes for examples.
If your findings would lead to the wrong conclusions, not presenting the data at all
would be a better choice. Here is a simple rule for designers:
Data Quality and Context
Your project isn’t ready if you have spent more time choosing a font or
color scheme than choosing your data.
• Real data is ugly – but there’s no substitute for real data
• Provenance - Critical Questions to Ask
• Who collected it?
• Why was it collected?
• What is its context in a broader subject?
• What is its context in the field of research that created it?
• How was it collected?
• What are its limitations?
• Which mathematical transformations are appropriate?
• Which methods of display are appropriate?
Data Quality and Context
Cleaning and formatting a single data set is hard. What if you’re:
• Building a live visualization that will run with many different data sets?
• Don’t have time to manually clean each data set?
There is no substitute for real data. It doesn’t help you plan for data discrepancies,
null values, outliers, or real-world problems.
• Use several random samples of real data if you cannot access an entire data set
• Invalid and missing data is a guarantee. If your data won’t be cleaned before
being graphed, do not clean your sample data.
• Real data may be so large it overwhelms your visualization generating it.
Be sure that if you use a sample of data you correctly scale up the sample size or
reduce it appropriately before creating a final visualization.
There’s no substitute for real data
Four different segments are being compared ,
but they are calibrated wrong. On the surface
this is hard to detect.
• The clean part is that there is very
little overlap between Search Traffic and
Referral Traffic. But Mobile is a platform.
Never Compare Apples to Oranges
Traffic (conversions in this case) is most likely in both Referrals and Search. It is unclear what to
make of that orange stacked bar. The graph is showing conversions already included in Search and
Referral (double counting) and because you have no idea what it is, it is impossible to know what
action to take.
Would you recommend a higher investment in Mobile based on this graph?
• The same for Social Media. It is likely that the Social Media conversions are already included in
Referrals and in Mobile. The green bar in the graph is useless.
Is a massive increase in investment in Social Media an imprecise conclusion?
What do you think is wrong with this graph?
It artificially inflates the importance of a
change in the metric that might not be all that
important. In this case the data is not
statistically significant, but there is no way we
can know that just from the data. Yet the scale
used for the y-axis implies that something
huge has happened.
Don't Overstate or Alarm Unnecessarily
Try to avoid being so dramatic in your presentation. It causes people to read things into
the performance that they should most likely not read. Setting the y-axis at zero may
not be necessary every time but this 1.5 point difference is a waste of everyone's time.
Another important thing. Label your x axis. Please.
This chart that shows nine months of
performance… by day! The "trend" is
completely useless.
• Looking at individual days over such a
long time period can hide insights and
important changes. It can be near
impossible to find anything of value.
• Try switching to looking at the exact
same time period but by week. Now see
some kind of trend, especially towards
the end of the graph (even this simple
insight was hidden before).
Calibrate Your Time Series
See: http://square.github.io/crossfilter for another good example.
What do you think the two colors in this graph
represent? How come only 29 percent of the
organizations have more than one person!
Problem one is that "red" denotes "good" in
this case and "green" represents "bad."
Here's something very, very simple you should
understand: Red is bad and Green is good.
Always. Period. People instinctively think this
way. So show "good" in green and "bad" in red.
It will communicate your point clearly and
faster.
Always Make Your Point Clearly
Problem two, much worse, was that it was harder than it should be to understand this data. First stacked
bar above: "Yes 71 percent of the organizations Yes, more than one person. And what is the 29 percent?
If the question is how many people are directly responsible for improving conversion rates and 71
percent have more than one person, then 29 percent are those that have less than one person or no
one? Or just less than one person? Unclear (and frustrating).
We all make this mistake. We create a table like the one below. We create a "heat map" in the table
highlighting where conversions rates appear good. We declare Organic to be the winner, Direct is
close behind. Then the other two. And we recommend doing more SEO.
Statistical Significance
None of this data could be significant – that the numbers seem to be so different might not mean
anything. It is entirely possible that it is completely immaterial that Direct is 34% and Email is 10%,
or that Referral is 7%.
We should evaluate the raw numbers to see if the percentage is meaningful at all.
• The data in the Direct row could represent conversions out of 10 visits and all the Referral data
could be represent conversions from 1,000,000 visits. Suggest computing statistical significance to
identify which comparison sets we can be confident are different, and in which cases we simply
don't have enough confidence.
Do you see the problem?
Confusing correlation and causation is
one of the most overlooked problems.
In the Cheese and Employment Status
percentage graph, it is clear that
retired Redditors prefer cheddar
cheese and freelance Redditors prefer
brie.
• This does not mean that once an
individual retires he or she will
develop a sudden affinity for
cheddar.
• Nor does becoming a freelancer
cause one to suddenly prefer brie.
Correlation vs. Causation
Improper use of averages
Averages can be a great way to get a quick
overview of some key areas, but use it wisely.
For example, average order value is a useful
metric. If we were to look at only the average
order value by month it’s enlightening because it
shows an increase over time, which indicates a
move in the right direction.
However, it’s more useful to look at average order
value by department by month, because this
shows us where the increase in average order
value is coming from; the women’s shoes
department.
If just looked at only average order value by
month, we might focus marketing across all
departments, which is not the most efficient
allocation of resources.
• Do other useful things. Look at your search keyword reports.
Do you see a few people coming on keywords you SEOed the site for?
Look at the keywords your site is showing up in through Google search results.
Are they the ones you were expecting?
• Even better… spend time with competitive intelligence tools like Insights for Search, Ad Planner, and
others to seek clues from your competitors and your industry ecosystem. At this stage you can learn
a lot more from their data than your data…
There is Such a Thing as Too Little Data!
Another "simple" mistake. We get excited about having data, especially if new at
this. We get our tables and charts together and we reporting data and having a
lot of fun.
This is very dangerous. You see there is such a thing as too little data.
You don't want to wait until you've collected millions of rows of data to make
any decision, but the table here is nearly useless. So can you do anything with
data like this?
Drawing
Conclusions
from
Incomplete
Information
Assuming a
Lab is a
Reasonable
Substitute
Forgetting
the Real
Experts are
Your
Customers
Gaming the
System
Sampling
Problems
Intelligence
is not binary
Keeping it
Simple - Too
Simple
Beware the
long tail
Simpson's
paradox
Common
Organizational
Blunders
Analytic
Fundamentals
Common
Measurement
Errors
Statistical
Mistakes
Visualization
Faults
Summary
Common Analytic Mistakes
Data may not tell the full story. For example your analytics show visitors spend a relatively high
amount of time on a particular page.
• Is that page great – or is it problematic? Maybe visitors simply love the content.
• Or, maybe they are getting stuck due to a problem with the page.
Your call center statistics show average call time has decreased.
• Is a decrease in average call time good news or bad news?
• When calls end more quickly, costs go down, but have you actually satisfied callers or left them
disgruntled, dissatisfied, and on their way to your competition?
Never draw conclusions from any analysis that does not tell the whole story.
Use bar charts to visualize relative sizes. If you see one bar that is twice as long as the other bar, you
expect the underlying quantity to be twice as big. This relative sizing fails if you do not start your bar
chart axis at 0.
Rule of thumb, if you want to illustrate a small change, use a line chart if starting your y-axis
anywhere other than 0.
Conclusions from Incomplete Information
The chart on the left compares Redditors
who like dogs vs. Redditors who prefer cats.
With the y-axis starting at 9,000, it looks like
dog lovers outnumber cat lovers by three
times. However, the graph on the right is a
more accurate representation of the data.
There are not quite twice as many Redditors
who prefer dogs to cats.
A limitation of bar charts that start at 0 is
they do not show small percent
differences. If you need to change the start
of your axis in order to highlight small
changes, switch to a line chart.
Conclusions from Incomplete Information
Usability groups and panels are certainly useful and have their place; the problem
is the sample sizes are small and the testing takes place in a controlled
environment. You bring people into a lab and tell them what you want them to do.
• Does that small group of eight participants represent your broader audience?
• Does measuring and observing them when they do what we tell them to do
provide the same results as real users who do what they want to do?
Observation is helpful, but applying science to the voice of customer and
measuring the customer experience through the lens of customer satisfaction is a
better way to achieve successful results.
Assuming a Lab is a Reasonable Substitute
Experts, like usability groups, have their place.
But who knows customer intentions, needs, and attitudes better than actual
customers?
When you really want to know, go to the source.
It takes more time and work, but the results are much more valuable.
Experts and consultants certainly have their place, but their advice and
recommendations must be driven by customer needs as much if not more than by
organizational needs.
The Real Experts are Your Customers
Many feedback and measurement systems create bias and inaccuracy. How? Ask the wrong people,
bias their decisions, or give them incentives for participation. Measuring correctly means creating as
little measurement bias as possible while generating as little measurement noise as possible.
• Avoid incenting people to complete surveys, especially when there is no need.
Never ask for personal data; some will decline to participate if only for privacy concerns.
• Never measure with the intent to prove a point. We may, unintentionally, create customer
measurements to prove our opinions are correct or support our theories, but to what end?
• Customer measurements must measure from the customers’ perspective and through the
customers’ eyes, not through a lens of preconceived views.
Gaming the System
Sampling works well when sampling is done correctly. Sample selection and
sample size are critical to creating a:
• credible,
• reliable,
• accurate,
• precise, and predictive methodology.
Sampling is a science in and of itself. You need samples representative of the larger
population that are randomly selected.
See the section on Statistical Mistakes for more on this.
Sampling Problems
Taking a binary approach to measuring satisfaction – in effect, asking whether a
customer is or is not satisfied – leads to simplistic and inaccurate measurement.
Intelligence is not binary.
• People are not just smart or stupid.
• People are not just tall or short.
• Customers are not just satisfied or dissatisfied.
“Yes” and “no” do not accurately explain or define levels or nuances of customer
satisfaction. The degree of satisfaction with the experience is what determines the
customer’s level of loyalty and positive word of mouth.
Claiming 97% of your customers are satisfied certainly makes for a catchy marketing
slogan but is far from a metric you can use to manage your business forward.
If you cannot trust and use the results, why do the research?
Intelligence is not binary
The “keep it simple” approach does not work for measuring customer
satisfaction (or for measuring anything regarding customer attitudes
and behaviors.)
• Customers are complex; they make decisions based on a number of
criteria, most rational, some less so. Asking three or four questions
does not create a usable metric or help to develop actionable
intelligence.
Measuring customer satisfaction by itself will not provide the best
view. Using a complete satisfaction measurement system – including
future behaviors and predictive metrics.
Many will take this simple approach and make major strategic decisions
based on a limited and therefore flawed approach to measurement.
Great managers do not make decisions based on hunches or limited
data; “directionally accurate” is simply not good enough.
Keeping it Simple - Too Simple
http://experiencematters.wordpress.com/tag/lowes/page/2/
In statistics, a long tail of some
distributions of numbers is the portion
of the distribution having a large
number of occurrences far from the
"head" or central part of the
distribution.
A probability distribution is said to
have a long tail, if a larger share of
population rests within its tail than
would under a normal distribution.
Beware the long tail
A long-tail distribution will arise when many values are unusually far from the
mean, which increase the magnitude of the skewness of the distribution.
Top 10,000 Popular Keywords
The term long tail has gained popularity in describing the retailing strategy of
selling a large number of unique items with relatively small quantities sold of each
in addition to selling fewer popular items in large quantities.
The distribution and inventory costs of businesses successfully applying this
strategy allow them to realize significant profit out of selling small volumes of hard-
to-find items to many customers instead of only selling large volumes of a reduced
number of popular items.
The total sales of this large number of "non-hit items" is called "the long tail".
• See also:
• Black swan theory
• Kolmogorov's zero – one law which is also known a tail event.
• Mass customization
• Micropublishing
• Swarm intelligence
Source: http://en.wikipedia.org/wiki/Long_tail
Beware the long tail (cont’d)
Simpson's paradox, or the Yule–Simpson effect, is a paradox where a trend that
appears in different groups of data disappears when these groups are combined,
and the reverse trend appears for the aggregate data.
Encountered in social-science and medical-science statistics, this effect is
confounding when frequency data are unduly given causal interpretations.
Using professional baseball as an example it is possible for one player to hit for a
higher batting average than another player during a given year, and to do so again
during the next year, but to have a lower batting average when the two years are
combined.
Simpson's paradox
Player 1995 Avg. 1996 Avg. Combined Avg.
Derek Jeter 12/48 0.250 183/582 0.314 195/630 0.310
David Justice 104/411 0.253 45/140 0.321 149/551 0.270
This phenomenon occurs where there are large differences in the number of at-
bats between the years. The same situation applies to calculating batting averages
for the first half of the baseball season, and during the second half, and then
combining all of the data for the season's batting average.
Simpson's paradox (cont’d)
Player 1995 Avg. 1996 Avg. Combined Avg.
Derek Jeter 12/48 0.250 183/582 0.314 195/630 0.310
David Justice 104/411 0.253 45/140 0.321 149/551 0.270
If weighting is used this phenomenon disappears. The table below has been
normalized for the largest totals so the same things are compared.
Player 1995 Avg. 1996 Avg. Combined Avg.
Derek Jeter 12/48*411 0.250 183/582*582 0.314 285.75/993 0.288
David Justice 104/411*411 0.253 45/140*582 0.321 291/993 0.293
Expecting too
much certainty
Misunderstanding
probability
Mistakes in
thinking about
causation
Problematical
choice of measure
Errors in sampling
Over-
interpretation
Mistakes involving
limitations of
hypothesis
tests/confidence
intervals
Using an
inappropriate
model or research
design
Common
Organizational
Blunders
Analytic
Fundamentals
Common
Measurement
Errors
Statistical
Mistakes
Visualization
Faults
Summary
Common Analytic Mistakes
One consequence of not taking uncertainty seriously enough is that we often write
results in terms that misleadingly suggest certainty.
For example, some might conclude from a study that a hypothesis is true or has
been proved, when it would be more correct to say that the evidence supports the
hypothesis or is consistent with the hypothesis.
Another mistake is misinterpreting results of statistical analyses in a deterministic
rather than probabilistic (also called stochastic) manner.
Expecting too much certainty
" ... as far as the propositions of mathematics refer to reality, they are not certain; and as far as they are
certain, they do not refer to reality."
Albert Einstein , Geometry and Experience,
Lecture before the Prussian Academy of Sciences, January 27, 1921
Source: Martha Smith, University of Texas http://www.ma.utexas.edu/users/mks/statmistakes/TOC.html
There are four perspectives on probability that are commonly used:
• Classical,
• Empirical (or Frequentist),
• Subjective, and
• Axiomatic.
Using one perspective when another is intended can lead to serious errors.
Common misunderstanding: If there are only two possible outcomes, and you don't
know which is true, the probability of each of these outcomes is 1/2.
In fact, probabilities in such "binary outcome" situations could be anything from 0
to 1. For example, if the outcomes of interest are "has cancer" and "does not have
cancer," the probabilities of having cancer are (in most cases) much less than 1/2.
The empirical (frequentist) perspective allows us to estimate such probabilities.
Misunderstanding probability
Source: Martha Smith, University of Texas http://www.ma.utexas.edu/users/mks/statmistakes/TOC.html
Confusing correlation with causation.
Example:
Students' shoe sizes and scores on a standard reading exam. They are correlated, but saying that larger
shoe size causes higher reading scores is as absurd as saying that high reading scores cause larger shoe
size. There is a clear lurking variable, namely, age. As the child gets older, both their shoe size and reading
ability increase. Do not Interpret causality deterministically when the evidence is statistical.
Mistakes in thinking about causation
Source: Martha Smith, University of Texas http://www.ma.utexas.edu/users/mks/statmistakes/TOC.html
In most research, one or more outcome variables are measured. Statistical analysis
is done on the outcome measures, and conclusions are drawn from the statistical
analysis. The analysis itself involves a choice of measure, called a summary statistic.
Misleading results occur when inadequate attention to the choice of either
outcome variables or summary statistics.
• Example: What is a good outcome variable for deciding whether cancer
treatment in a country has been improving?
A first thought might be "number of deaths in the country from cancer in
one year." But number of deaths might increase simply because the
population is increasing. Or it might go down if cancer incidence is
decreasing. "Percent of the population that dies of cancer in one year"
would take care of the first problem, but not the second.
In this case rate is a better measure than a count.
Problematical choice of measure
Believing a "random sample will be representative of the population".
In fact, this statement is false -- a random sample might, by chance, turn out to be
anything but representative. For example, it is possible that if you toss a coin ten
times, all the tosses will come up heads.
A slightly better explanation that is partly true : "Random sampling eliminates bias
by giving all individuals an equal chance to be chosen.“
There is very important, the reason why random sampling is important…
Mathematical theorems which justify most statistical procedures apply only to
random samples.
Errors in sampling
• Extrapolation to a larger population than the one studied
Example: running a marketing experiment with
undergraduates enrolled in marketing classes and
drawing a conclusion about people in general.
• Extrapolation beyond the range of data
Similar to extrapolating to a larger population, but
concerns the values of the variables rather than the
individuals.
• Ignoring Ecological Validity
Involves the setting (i.e., the "ecology") rather than the
individuals studied, or it may involve extrapolation to a
population having characteristics very different from the
population that is relevant for application.
Over-interpretation
• Using overly strong language in stating results
Statistical procedures do not prove results. They only give us information on
whether or not the data support or are consistent with a particular conclusion.
There is always uncertainty involved. Acknowledge this uncertainty.
• Considering statistical significance but no practical significance
Example: Suppose that a well-designed, well-carried out, and carefully analyzed
study shows that there is a statistically significant difference in life span between
people engaging in a certain exercise regime at least five hours a week for at least
two years and those not following the exercise regime.
If the difference in average life span between the two groups is three days...
Over-interpretation (cont’d)
Source: Martha Smith, University of Texas http://www.ma.utexas.edu/users/mks/statmistakes/TOC.html
Type I Error
• Rejecting the null hypothesis when it is in fact true is known as a Type I error.
Many people decide, before doing a hypothesis test, on a maximum p-value for
which they will reject the null hypothesis. This value is often denoted α (alpha)
and is the significance level.
Type II Error
• Not rejecting the null hypothesis when in fact the alternate hypothesis is true is
called a Type II error.
An analogy helpful in understanding the two types of error is to consider a
defendant in a trial. The null hypothesis is "defendant is not guilty;" the alternate is
"defendant is guilty."
• Type I error would correspond to convicting an innocent person;
• Type II error would correspond to setting a guilty person free.
Test/Hypothesis confidence intervals
Two drugs are known to be equally effective for a certain condition.
• Drug 1 has been used for decades with no reports of the side effects
• Drug 2 may cause serious side-effects
The null hypothesis is "the incidence of the side effect in both drugs is the same",
The alternate hypothesis is "the incidence of the side effect in Drug 2 is greater
than that in Drug 1."
Falsely rejecting the null hypothesis when it is in fact true (Type I error) would have
no great consequences for the consumer, but a Type II error (i.e., failing to reject
the null hypothesis when in fact the alternate is true, would result in deciding that
Drug 2 is no more harmful than Drug 1 when it is in fact more harmful) could have
serious consequences.
Setting a large significance level is in this case is appropriate.
Test/Hypothesis confidence intervals
Each inference technique (hypothesis test or confidence interval) you select has
model assumptions. Different techniques have very different model assumptions.
The validity of the technique depends on whether the model assumptions fit the
context of the data being analyzed.
• Common Mistakes Involving Model Assumptions
• Using a two-sample test comparing means when cases are paired
• Comparisons of treatments applied to people, animals, etc.
(Intent to Treat; Comparisons involving Drop-outs)
• Fixed vs Random Factors
• Analyzing Data without Regard to How the Data was Collected
• Dividing a Continuous Variable into Categories ("Chopped Data“, Cohorts)
• Pseudo-replication
• Mistakes in Regression
• Dealing with Missing Data
Inappropriate model or research design
Source: Martha Smith, University of Texas http://www.ma.utexas.edu/users/mks/statmistakes/TOC.html
Bad Math,
Bad
Geography
Misrepresent
Data
Serving the
Presentation
Without the
Data
Pie Charts –
Why?
Using Non-
Solid Lines in
a Line Chart
Bar charts
with
erroneous
scale,
Arranging
Data Non-
Intuitively
Obscuring
Your Data,
Making the
Reader Do
More Work
Misrepresent
Data Using
Different
Colors on a
Heat Map
Making it
Hard to
Compare
Data,
Showing Too
Much Detail
Not
Explaining
the
Interactivity
Keep It
Simple
Common
Organizational
Blunders
Analytic
Fundamentals
Common
Measurement
Errors
Statistical
Mistakes
Visualization
Faults
Summary
Common Analytic Mistakes
Visualization is a tool to aid analysis, not a substitute for analytical skill.
It is not a substitute for statistics:
Really understanding your data generally requires a combination of
• analytical skills,
• domain expertise, and
• effort.
Strategy:
• Be careful about promising real insight.
Work with a statistician or a domain expert if you need to offer reliable conclusions.
• Small design decisions - the color palette you use, or how you represent a particular variable
can skew the conclusions a visualization suggests.
• If using visualizations for analysis, try a variety of options, rather than relying on a single view
Visualization is not analysis
The infographic seems informative enough.
The self-perceptions of “Baby Boomers” is an
interesting data set to visualize. The problem,
however, is that this graphic represents 243% of
responses.
This is not necessarily indicative of faulty data.
It is a poor representation of the data.
Display the phrases individually, with size
determined by percentages to compare the
phrases to each other more easily. Remove the
percentages from a single body also clarifies that
the percentages aren’t mutually exclusive.
Bad Math
Credibility? Gone in an instant.
Bad Geography - Just Plain Dumb
Presented without comment… speechless, how does this happen?
And don't do something like this.. ever
Makes sure all representations are accurate. For
example, bubbles should be scaled according to
area, not diameter.
Misrepresenting Data
The graph is clearly trying to make the point that
Obamacare enrollment numbers for March 27 are well
below the March 31 goal. Take a closer look.
How come the first bar is less than half the size of the
second one? This is a common problem.
Depending on the scale (there isn’t one in this
example), the comparison among data points look very
deceiving.
Misrepresenting Data (Why?)
• If you’re working with data points that are really far or really close together, be sure to
pay attention to the scale, so that data is accurately portrayed.
• If there is no scale, compare different slices, tiles, bubbles, or bars against each other.
Do the numbers match up to these?
Which comes first: the presentation or the data? In an effort to make a more
“interesting” or “cool” design, don’t allow the presentation layer of a visualization
to become more important than the data itself.
In this example a considerable amount of work went into it and there are parts that
are informative, like the summary counters at the top left. However, without a scale
or axis, the time series on the bottom right is meaningless and the 3D chart in the
center is even more opaque. Tooltips (pop ups) would help, if they were there.
Serving the Presentation Without the Data
They are useful on rare occasions.
But most of the time they actually do not communicate anything of value.
Pie Charts – Why?
Beyond the obvious point
made by the line graph in
the background (we are
storing more data now than
we used to), this graph
seems to tell us “…don’t
know if any of this matters,
so we’re going to print
everything.”
Sometimes it just better to
use a table... really.
A common mistake with pie charts is to divide them into
percentages that simply do not add up.
The basic rule of a pie chart is that the sum of all percentages
included should be 100%.
• In this example, the percentages fall short of 100%, and the
segment sizes do not match their values. This happens due to
rounding errors, or when non-mutually exclusive categories
are plotted on the same chart. Unless included categories are
mutually exclusive, their percentage cannot be plotted
separately using the same chart.
Pie Charts - Error in Chart Percentages
Here is another gem…
Always ask yourself when considering a potential design:
Why is this better than a bar chart? If you’re visualizing a single quantitative
measure over a single categorical dimension, there is rarely a better option.
Pie Charts – Why?
• Line charts are preferred when using time-based data
• Scatter plots are best for exploring correlations between two linear measures.
• Bubble charts support more data points with a wider range of values
• Tree maps support hierarchical categories
If you have to use a pie chart:
• Don’t include more than five segments
• Place the largest section at 12 o’clock, going clockwise.
• Place the second largest section at 12 o’clock, going counterclockwise.
• The remaining sections can be placed below, continuing counterclockwise.
Comparison is a valuable way to
showcase differences, but it's useless
if your viewer can’t easily compare.
Make sure all data is presented in a
way that allows the reader to
compare data side-by-side.
Making it Hard to Compare Data
Is this clear to you?
• Using Non-Solid Lines in a Line Chart
Dashed and dotted lines can be distracting. Instead, use a solid line and colors that are easy to
distinguish from each other.
• Making the Reader Do More Work
Make it as easy as possible to understand data by aiding the reader with graphic elements. For
example, add a trend line to a scatter plot to highlight trends.
• Obscuring Your Data
Make sure no data is lost or obstructed by design. For example, use transparency in a standard
area chart to make sure the viewer can see all data
• Using Different Colors on a Heat Map
Some colors stand out more than others, giving unnecessary weight to that data. Instead, use a
single color with varying shades or a spectrum between two analogous colors to show intensity.
Other Common Mistakes
• Arranging the Data Non-Intuitively
Content should be presented in a logical and intuitive way to guide
readers through the data. Order categories alphabetically, sequentially,
or by value.
• Not Explaining Interactivity
Enabling users to use and interact with a visualization makes it more
engaging. If you don’t tell them how to use that interactivity you risk
limiting them to the initial view. How you label the interactivity is just
as important as doing it in the first place.
• Inform the user at the top of the visualization is good practice.
• Calling out the interaction on or near the tools that use it.
• Use a common design concept such as underlining the words to associate a
hyperlink, would have been helpful.
See http://www.visualizing.org/full-screen/39118
Other Common Mistakes (cont’d)
Hard to resist… the temptation
with a dataset with numerous
usable categorical and numerical
fields is to show everything at
once, and allow users to drill
down to the finest level of detail.
The visualization is superfluous;
the user could simply look at the
dataset itself if they wanted to
see the finest level of detail.
Show enough detail to tell a story,
but not so much that that story is
convoluted and hidden.
Showing Too Much Detail
We all want specific, relevant answers.
The closer you can get to providing
exactly what is wanted, the less effort
we expend looking for answers.
Irrelevant data makes finding the
relevant information more difficult;
irrelevant data is just noise.
• Showing several closely related graphs
can be a nice compromise between
showing too much in one graph and
not showing enough overall.
• A few clean, clear graphs are better
than a single complicated view.
• Try to represent your data in the
simplest way possible to avoid this.
Showing Too Much Detail (cont’d)
Data visualization is about simplicity. Using embellished or artistic representations may result in
more clarity but usually distracts from the actual data. Look at the example below.
Keep it Simple
• Why is the first image blue and the rest are red?
• The number in the second image is against the
paintbrush and not against the head while in all
other columns it is against the head. Is this
meaningful?
• We might just appreciate different figures and
think about the real-life characters represented by
them and move on without understanding the
data. Really…
• It’s important that visual representation of data is
free of the pitfalls that make data representation
ambiguous and irrelevant.
Summary
Common Analytic Mistakes
Common
Organizational
Blunders
Analytic
Fundamentals
Common
Measurement
Errors
Statistical
Mistakes
Visualization
Faults
Thank You!
Thank You…
Mr. Parnitzke is a hands-on technology executive, trusted partner, advisor, software publisher, and widely
recognized database management and enterprise architecture thought leader. Over his career he has served
in executive, technical, publisher (commercial software), and practice management roles across a wide range
of industries. Now a highly sought after technology management advisor and hands-on practitioner his
customers include many of the Fortune 500 as well as emerging businesses where he is known for taking
complex challenges and solving for them across all levels of the customer’s organization delivering distinctive
value and lasting relationships.
Contact:
j.parnitzke@comcast.net
Blogs:
Applied Enterprise Architecture (pragmaticarchitect.wordpress.com)
Essential Analytics (essentialanalytics.wordpress.com)
The Corner Office (cornerofficeguy.wordpress.com)
Data management professional (jparnitzke.wordpress.com)
The program office (theprogramoffice.wordpress.com)
Data Science Page (http://www.datasciencecentral.com/profile/JamesParnitzke)

More Related Content

What's hot

Be Data Informed Without Being a Data Scientist
Be Data Informed Without Being a Data ScientistBe Data Informed Without Being a Data Scientist
Be Data Informed Without Being a Data ScientistPamela Pavliscak
 
What People Analytics Can’t Capture
What People Analytics Can’t CaptureWhat People Analytics Can’t Capture
What People Analytics Can’t CaptureSoumyadeep Sengupta
 
Acceptance, Accessible, Actionable and Auditable
Acceptance, Accessible, Actionable and AuditableAcceptance, Accessible, Actionable and Auditable
Acceptance, Accessible, Actionable and AuditableAlban Gérôme
 
Competing on analytics
Competing on analyticsCompeting on analytics
Competing on analyticsGreg Seltzer
 
What people analytics can't capture
What people analytics can't captureWhat people analytics can't capture
What people analytics can't captureSushant Kumar
 
Business Intelligence Insights: How to Present Visual Data your Team Understands
Business Intelligence Insights: How to Present Visual Data your Team UnderstandsBusiness Intelligence Insights: How to Present Visual Data your Team Understands
Business Intelligence Insights: How to Present Visual Data your Team UnderstandsSanderson Group
 
Setting up Data Science for Success: The Data Layer
Setting up Data Science for Success: The Data LayerSetting up Data Science for Success: The Data Layer
Setting up Data Science for Success: The Data LayerCarl Anderson
 
Data Analytics in Azure Cloud
Data Analytics in Azure CloudData Analytics in Azure Cloud
Data Analytics in Azure CloudMicrosoft Canada
 
How To Get Into Data Science & Analytics - feliperego.com.au
How To Get Into Data Science & Analytics - feliperego.com.auHow To Get Into Data Science & Analytics - feliperego.com.au
How To Get Into Data Science & Analytics - feliperego.com.auFelipe Rego
 
Simplify your analytics strategy
Simplify your analytics strategySimplify your analytics strategy
Simplify your analytics strategyAayushi Shanker
 
Storytelling for analytics | Naveen Gattu | CDAO Apex 2020
Storytelling for analytics | Naveen Gattu | CDAO Apex 2020Storytelling for analytics | Naveen Gattu | CDAO Apex 2020
Storytelling for analytics | Naveen Gattu | CDAO Apex 2020Gramener
 
Telling A Story With Data
Telling A Story With DataTelling A Story With Data
Telling A Story With DataMashMetrics
 
Data monetization
Data monetizationData monetization
Data monetizationGramener
 
A Leader’s Guide to Data Analytics
A Leader’s Guide to Data AnalyticsA Leader’s Guide to Data Analytics
A Leader’s Guide to Data AnalyticsHarshit Sahni
 
KETL Quick guide to data analytics
KETL Quick guide to data analytics KETL Quick guide to data analytics
KETL Quick guide to data analytics KETL Limited
 

What's hot (20)

Be Data Informed Without Being a Data Scientist
Be Data Informed Without Being a Data ScientistBe Data Informed Without Being a Data Scientist
Be Data Informed Without Being a Data Scientist
 
What People Analytics Can’t Capture
What People Analytics Can’t CaptureWhat People Analytics Can’t Capture
What People Analytics Can’t Capture
 
Acceptance, Accessible, Actionable and Auditable
Acceptance, Accessible, Actionable and AuditableAcceptance, Accessible, Actionable and Auditable
Acceptance, Accessible, Actionable and Auditable
 
Competing on analytics
Competing on analyticsCompeting on analytics
Competing on analytics
 
What people analytics can't capture
What people analytics can't captureWhat people analytics can't capture
What people analytics can't capture
 
What is business analytics
What is business analyticsWhat is business analytics
What is business analytics
 
Business Intelligence Insights: How to Present Visual Data your Team Understands
Business Intelligence Insights: How to Present Visual Data your Team UnderstandsBusiness Intelligence Insights: How to Present Visual Data your Team Understands
Business Intelligence Insights: How to Present Visual Data your Team Understands
 
1305 track 3 siegel
1305 track 3 siegel1305 track 3 siegel
1305 track 3 siegel
 
Agile Analytics
Agile AnalyticsAgile Analytics
Agile Analytics
 
Setting up Data Science for Success: The Data Layer
Setting up Data Science for Success: The Data LayerSetting up Data Science for Success: The Data Layer
Setting up Data Science for Success: The Data Layer
 
Data Analytics in Azure Cloud
Data Analytics in Azure CloudData Analytics in Azure Cloud
Data Analytics in Azure Cloud
 
How To Get Into Data Science & Analytics - feliperego.com.au
How To Get Into Data Science & Analytics - feliperego.com.auHow To Get Into Data Science & Analytics - feliperego.com.au
How To Get Into Data Science & Analytics - feliperego.com.au
 
Simplify your analytics strategy
Simplify your analytics strategySimplify your analytics strategy
Simplify your analytics strategy
 
Data Strategy
Data StrategyData Strategy
Data Strategy
 
Storytelling for analytics | Naveen Gattu | CDAO Apex 2020
Storytelling for analytics | Naveen Gattu | CDAO Apex 2020Storytelling for analytics | Naveen Gattu | CDAO Apex 2020
Storytelling for analytics | Naveen Gattu | CDAO Apex 2020
 
Telling A Story With Data
Telling A Story With DataTelling A Story With Data
Telling A Story With Data
 
1115 track2 siegel
1115 track2 siegel1115 track2 siegel
1115 track2 siegel
 
Data monetization
Data monetizationData monetization
Data monetization
 
A Leader’s Guide to Data Analytics
A Leader’s Guide to Data AnalyticsA Leader’s Guide to Data Analytics
A Leader’s Guide to Data Analytics
 
KETL Quick guide to data analytics
KETL Quick guide to data analytics KETL Quick guide to data analytics
KETL Quick guide to data analytics
 

Viewers also liked

ΤΑ ΤΡΙΤΑΚΙΑ ΤΗΣ ΑΣΣΗΡΟΥ ΠΑΡΟΥΣΙΑΖΟΥΝ ΤΙΣ ΣΥΛΛΟΓΕΣ ΤΟΥΣ
ΤΑ ΤΡΙΤΑΚΙΑ ΤΗΣ ΑΣΣΗΡΟΥ ΠΑΡΟΥΣΙΑΖΟΥΝ ΤΙΣ ΣΥΛΛΟΓΕΣ ΤΟΥΣΤΑ ΤΡΙΤΑΚΙΑ ΤΗΣ ΑΣΣΗΡΟΥ ΠΑΡΟΥΣΙΑΖΟΥΝ ΤΙΣ ΣΥΛΛΟΓΕΣ ΤΟΥΣ
ΤΑ ΤΡΙΤΑΚΙΑ ΤΗΣ ΑΣΣΗΡΟΥ ΠΑΡΟΥΣΙΑΖΟΥΝ ΤΙΣ ΣΥΛΛΟΓΕΣ ΤΟΥΣΓΙΑΝΝΗΣ ΚΑΡΑΚΩΣΤΑΣ
 
Como mejorar el gasto energético del instituto
Como mejorar el gasto energético del institutoComo mejorar el gasto energético del instituto
Como mejorar el gasto energético del institutosheld
 
Sierra Club Petition to Force Dept. of Energy to Review Cove Point LNG Export...
Sierra Club Petition to Force Dept. of Energy to Review Cove Point LNG Export...Sierra Club Petition to Force Dept. of Energy to Review Cove Point LNG Export...
Sierra Club Petition to Force Dept. of Energy to Review Cove Point LNG Export...Marcellus Drilling News
 
Internship report water analysis
Internship report water analysisInternship report water analysis
Internship report water analysisArslan Arif
 
UN DIA DIOS HABLO CONMIGO
UN DIA DIOS HABLO CONMIGOUN DIA DIOS HABLO CONMIGO
UN DIA DIOS HABLO CONMIGOMariaam Salazar
 
Lean Manufacturing: Improve Productivity, Quality, and Lead-Time
Lean Manufacturing: Improve Productivity, Quality, and Lead-TimeLean Manufacturing: Improve Productivity, Quality, and Lead-Time
Lean Manufacturing: Improve Productivity, Quality, and Lead-TimeDarren Dolcemascolo
 

Viewers also liked (15)

FELI LE PARIEU1 -extrait
FELI LE PARIEU1 -extraitFELI LE PARIEU1 -extrait
FELI LE PARIEU1 -extrait
 
amongus_70x100
amongus_70x100amongus_70x100
amongus_70x100
 
Gaurang 42 resume
Gaurang 42 resumeGaurang 42 resume
Gaurang 42 resume
 
Act2 cahr
Act2 cahrAct2 cahr
Act2 cahr
 
GM530
GM530GM530
GM530
 
ChristopherCrutchfield2015
ChristopherCrutchfield2015ChristopherCrutchfield2015
ChristopherCrutchfield2015
 
ΤΑ ΤΡΙΤΑΚΙΑ ΤΗΣ ΑΣΣΗΡΟΥ ΠΑΡΟΥΣΙΑΖΟΥΝ ΤΙΣ ΣΥΛΛΟΓΕΣ ΤΟΥΣ
ΤΑ ΤΡΙΤΑΚΙΑ ΤΗΣ ΑΣΣΗΡΟΥ ΠΑΡΟΥΣΙΑΖΟΥΝ ΤΙΣ ΣΥΛΛΟΓΕΣ ΤΟΥΣΤΑ ΤΡΙΤΑΚΙΑ ΤΗΣ ΑΣΣΗΡΟΥ ΠΑΡΟΥΣΙΑΖΟΥΝ ΤΙΣ ΣΥΛΛΟΓΕΣ ΤΟΥΣ
ΤΑ ΤΡΙΤΑΚΙΑ ΤΗΣ ΑΣΣΗΡΟΥ ΠΑΡΟΥΣΙΑΖΟΥΝ ΤΙΣ ΣΥΛΛΟΓΕΣ ΤΟΥΣ
 
Copy of OPEC-oil-prices[2]
Copy of OPEC-oil-prices[2]Copy of OPEC-oil-prices[2]
Copy of OPEC-oil-prices[2]
 
Sinergia
SinergiaSinergia
Sinergia
 
Como mejorar el gasto energético del instituto
Como mejorar el gasto energético del institutoComo mejorar el gasto energético del instituto
Como mejorar el gasto energético del instituto
 
Sierra Club Petition to Force Dept. of Energy to Review Cove Point LNG Export...
Sierra Club Petition to Force Dept. of Energy to Review Cove Point LNG Export...Sierra Club Petition to Force Dept. of Energy to Review Cove Point LNG Export...
Sierra Club Petition to Force Dept. of Energy to Review Cove Point LNG Export...
 
Internship report water analysis
Internship report water analysisInternship report water analysis
Internship report water analysis
 
How Do you Define a Qualified Lead?
How Do you Define a Qualified Lead?How Do you Define a Qualified Lead?
How Do you Define a Qualified Lead?
 
UN DIA DIOS HABLO CONMIGO
UN DIA DIOS HABLO CONMIGOUN DIA DIOS HABLO CONMIGO
UN DIA DIOS HABLO CONMIGO
 
Lean Manufacturing: Improve Productivity, Quality, and Lead-Time
Lean Manufacturing: Improve Productivity, Quality, and Lead-TimeLean Manufacturing: Improve Productivity, Quality, and Lead-Time
Lean Manufacturing: Improve Productivity, Quality, and Lead-Time
 

Similar to CommonAnalyticMistakes_v1.17_Unbranded

Aftros
Aftros Aftros
Aftros Sezzar
 
Avoid organizationalmistakes by innovative thinking
Avoid organizationalmistakes by innovative thinkingAvoid organizationalmistakes by innovative thinking
Avoid organizationalmistakes by innovative thinkingSelf-employed
 
How to Start Being a Data Driven Business
How to Start Being a Data Driven BusinessHow to Start Being a Data Driven Business
How to Start Being a Data Driven BusinessShawna Tregunna
 
Tasks of a data analyst Microsoft Learning Path - PL 300 .pdf
Tasks of a data analyst Microsoft Learning Path - PL 300 .pdfTasks of a data analyst Microsoft Learning Path - PL 300 .pdf
Tasks of a data analyst Microsoft Learning Path - PL 300 .pdfTung415774
 
Analytics Tune Up! Google Analytics workshop for beginners, intermediates
Analytics Tune Up! Google Analytics workshop for beginners, intermediatesAnalytics Tune Up! Google Analytics workshop for beginners, intermediates
Analytics Tune Up! Google Analytics workshop for beginners, intermediatesBrian Alpert
 
what is ..how to process types and methods involved in data analysis
what is ..how to process types and methods involved in data analysiswhat is ..how to process types and methods involved in data analysis
what is ..how to process types and methods involved in data analysisData analysis ireland
 
7 Key Benefits of Data Visualization Tools_BacklinkContent.pptx
7 Key Benefits of Data Visualization Tools_BacklinkContent.pptx7 Key Benefits of Data Visualization Tools_BacklinkContent.pptx
7 Key Benefits of Data Visualization Tools_BacklinkContent.pptxAlok Mishra
 
Advanced Analysis Presentation
Advanced Analysis PresentationAdvanced Analysis Presentation
Advanced Analysis PresentationSemphonic
 
BIG DATA WORKBOOK OCT 2015
BIG DATA WORKBOOK OCT 2015BIG DATA WORKBOOK OCT 2015
BIG DATA WORKBOOK OCT 2015Fiona Lew
 
Techniques of Data Visualization for Data & Business Analytics
Techniques of Data Visualization for Data & Business AnalyticsTechniques of Data Visualization for Data & Business Analytics
Techniques of Data Visualization for Data & Business AnalyticsMercy Akinseinde
 
Machine Learning for Business - Eight Best Practices for Getting Started
Machine Learning for Business - Eight Best Practices for Getting StartedMachine Learning for Business - Eight Best Practices for Getting Started
Machine Learning for Business - Eight Best Practices for Getting StartedBhupesh Chaurasia
 
Dashboards Too Much Information
Dashboards Too Much InformationDashboards Too Much Information
Dashboards Too Much InformationSpectrum
 
Implementing business intelligence
Implementing business intelligenceImplementing business intelligence
Implementing business intelligenceAlistair Sergeant
 
Dashboards- Take a closer look at your data
Dashboards- Take a closer look at your dataDashboards- Take a closer look at your data
Dashboards- Take a closer look at your dataNathan Watson
 
How tech startups can leverage data analytics and visualization
How tech startups can leverage data analytics and visualizationHow tech startups can leverage data analytics and visualization
How tech startups can leverage data analytics and visualizationVishanth Bala
 
Applied_Data_Science_Presented_by_Yhat
Applied_Data_Science_Presented_by_YhatApplied_Data_Science_Presented_by_Yhat
Applied_Data_Science_Presented_by_YhatCharlie Hecht
 

Similar to CommonAnalyticMistakes_v1.17_Unbranded (20)

5thingsyourspreadsheetcantdo eng
5thingsyourspreadsheetcantdo eng5thingsyourspreadsheetcantdo eng
5thingsyourspreadsheetcantdo eng
 
Aftros
Aftros Aftros
Aftros
 
Avoid organizationalmistakes by innovative thinking
Avoid organizationalmistakes by innovative thinkingAvoid organizationalmistakes by innovative thinking
Avoid organizationalmistakes by innovative thinking
 
How to Start Being a Data Driven Business
How to Start Being a Data Driven BusinessHow to Start Being a Data Driven Business
How to Start Being a Data Driven Business
 
Tasks of a data analyst Microsoft Learning Path - PL 300 .pdf
Tasks of a data analyst Microsoft Learning Path - PL 300 .pdfTasks of a data analyst Microsoft Learning Path - PL 300 .pdf
Tasks of a data analyst Microsoft Learning Path - PL 300 .pdf
 
Analytics Tune Up! Google Analytics workshop for beginners, intermediates
Analytics Tune Up! Google Analytics workshop for beginners, intermediatesAnalytics Tune Up! Google Analytics workshop for beginners, intermediates
Analytics Tune Up! Google Analytics workshop for beginners, intermediates
 
what is ..how to process types and methods involved in data analysis
what is ..how to process types and methods involved in data analysiswhat is ..how to process types and methods involved in data analysis
what is ..how to process types and methods involved in data analysis
 
7 Key Benefits of Data Visualization Tools_BacklinkContent.pptx
7 Key Benefits of Data Visualization Tools_BacklinkContent.pptx7 Key Benefits of Data Visualization Tools_BacklinkContent.pptx
7 Key Benefits of Data Visualization Tools_BacklinkContent.pptx
 
Advanced Analysis Presentation
Advanced Analysis PresentationAdvanced Analysis Presentation
Advanced Analysis Presentation
 
BIG DATA WORKBOOK OCT 2015
BIG DATA WORKBOOK OCT 2015BIG DATA WORKBOOK OCT 2015
BIG DATA WORKBOOK OCT 2015
 
Techniques of Data Visualization for Data & Business Analytics
Techniques of Data Visualization for Data & Business AnalyticsTechniques of Data Visualization for Data & Business Analytics
Techniques of Data Visualization for Data & Business Analytics
 
Machine Learning for Business - Eight Best Practices for Getting Started
Machine Learning for Business - Eight Best Practices for Getting StartedMachine Learning for Business - Eight Best Practices for Getting Started
Machine Learning for Business - Eight Best Practices for Getting Started
 
Dashboards Too Much Information
Dashboards Too Much InformationDashboards Too Much Information
Dashboards Too Much Information
 
Unit2
Unit2Unit2
Unit2
 
Implementing business intelligence
Implementing business intelligenceImplementing business intelligence
Implementing business intelligence
 
Dashboards- Take a closer look at your data
Dashboards- Take a closer look at your dataDashboards- Take a closer look at your data
Dashboards- Take a closer look at your data
 
How tech startups can leverage data analytics and visualization
How tech startups can leverage data analytics and visualizationHow tech startups can leverage data analytics and visualization
How tech startups can leverage data analytics and visualization
 
Applied_Data_Science_Presented_by_Yhat
Applied_Data_Science_Presented_by_YhatApplied_Data_Science_Presented_by_Yhat
Applied_Data_Science_Presented_by_Yhat
 
Startup analytics
Startup analyticsStartup analytics
Startup analytics
 
Grow your analytics maturity
Grow your analytics maturityGrow your analytics maturity
Grow your analytics maturity
 

CommonAnalyticMistakes_v1.17_Unbranded

  • 1. Common Analytic Mistakes Jim Parnitzke Webinar Series September, 2014
  • 2. Introduction Jim Parnitzke Business Intelligence and Enterprise Architecture Advisor, Expert, Trusted Partner, and Publisher
  • 3. Common Organizational Blunders Analytic Fundamentals Common Measurement Errors Statistical Mistakes Visualization Faults Summary Not Asking the Hard Questions Focusing on Yesterday's Metrics Understanding Metrics and Their Methodology Bottlenecking the Value to the Organization Overvaluing Data Visualization The Conference Table Compromise Compromising Data Ownership Confusing Insight With Action Common Analytic Mistakes
  • 4. Analytics projects driven by nothing more than management pronouncements that "we need metrics, get us some" don’t ever end well. • Analysis for analysis sake is ridiculous. • Ask the right questions to learn what data and metrics are important and can make a difference. • Know which quantitative measures don't matter and can be disregarded. Not Asking the Hard Questions
  • 5. • Analytics projects can be driven by answering yesterday's questions, not tomorrow's. It's an easy trap to fall into. People are comfortable with what they know. • A backwards-optimized view provides comfort... at a cost of NOT tracking metrics that help drive business forward, numbers the organization doesn't yet know, or what's unpredictable and uncomfortable but exactly where focus is needed. Focusing on Yesterday's Metrics Historical Reporting “What happened” Real Time Reporting “What are we doing right now” Modelling “What if we did this or that” Predictive Analytics “What can we expect” Prescriptive Analytics “What is the optimal solution” -- Descriptive -- -- Prescriptive ---- Predictive --
  • 6. No easy answer to this. Try to understand how various systems count, resolve that with your own needs, then accept the compromises you have no control over. • Example: Online media and marketing are immature. Systems that measure them are immature. Don’t assume systems and their outputs are fully baked and results can be taken at face value. Reality check: They aren't; They can't. A page view in one system isn't a page view in another. Definitions and methods that define metrics such as page views, visits, unique visitors, and ad deliveries are different in every system. Without a standard definition for an ad impression you can be sure there aren't standards for all the peripheral metrics. Counts can differ between products from the same vendor by as much as 30%. Misunderstanding Metrics
  • 7. Many well-intentioned project owners kill the project value because they bottleneck their own success. Well-meaning professionals, usually the systems' power users, create processes and procedures that place themselves at the center of everything regarding the systems: report creation, running ad hoc queries, report distribution, and troubleshooting. This often prevents others from taking real advantage of the systems' intelligence. The only solution is to train others, then step back and let them at it. Mistakes and misinterpretations will happen, but the benefits of widespread adoption across the enterprise will always outweigh them. Bottlenecking Value to the Organization
  • 8. When it comes to analytics, there's nothing we love more than lots of pretty pictures and dazzling graphics. Theirs is nothing wrong with pleasing visualization tools in data presentation. Too many worry about pictures first, data and analysis second. Pictures cloud their minds and vision, which is exactly why vendors put them there. Graphics are great at grabbing attention, but not always great at putting data into action. What looks good in reports should be a means to the end, not the end in itself. Overvaluing Data Visualization http://www.wheels.org/monkeywrench/?p=413
  • 9. One reason analytics projects lose focus is they begin compromised. Too many follow the conference table consensus approach. The Conference Table Compromise Every department gets a seat at the table. Everyone contributes suggestions. The final product is a compendium of all requests. Although this method tends to create lots of good feelings, it rarely results in the best set of metrics with which to run the business. You will find it's tracking silly, pet metrics someone at the table thought were important and are completely irrelevant. To keep projects focused, decide which metrics are important and stay, and which are distracting and go.
  • 10. When budgets are tight and all are clamoring for better analytics, it's understandable that not everyone reads or fully comprehends the fine print associated vendor "partnerships." The nuances of data ownership may seem innocuous, but there are consequences. Many will use analytics services to build a databases of anonymous consumer profiles and their behavior to use in ad targeting when those consumers visit other sites without compensation to the publishers whose sites were harvested. Be careful with this one… Compromising Data Ownership
  • 11. What good are analysis and insight if you can't act on them? • Almost all analytics systems bill themselves as actionable. Many claim they're real time. Learn what they really mean. • Few systems can enable an enterprise to take immediate, tactical steps to leverage data for value. For most, "actionable" means the system can generate reports, such as user navigation patterns publishers can mull over in meetings, then plan the changes needed to improve. • While that may meet the definition of actionable, it doesn't necessarily mean real-time or even right-time action. Bottom line: Understand, don't assume. Confusing Insight With Action
  • 12. Data Quality and Context Never Compare Apples to Oranges Don't Overstate (alarm) Unnecessarily Calibrate Your Time Series Always Make Your Point Clearly (and Colors Matter.) Statistical Significance Correlation vs. Causation Improper use of averages There is Such a Thing as Too Little Data! Common Organizational Blunders Analytic Fundamentals Common Measurement Errors Statistical Mistakes Visualization Faults Summary Common Analytic Mistakes
  • 13. While a poor font choice can ruin a meeting, a poor interpretation of statistics or data can kill you (or someone else, for that matter). Proper use of data is a viciously complicated topic. See the section on Statistical Mistakes for examples. If your findings would lead to the wrong conclusions, not presenting the data at all would be a better choice. Here is a simple rule for designers: Data Quality and Context Your project isn’t ready if you have spent more time choosing a font or color scheme than choosing your data.
  • 14. • Real data is ugly – but there’s no substitute for real data • Provenance - Critical Questions to Ask • Who collected it? • Why was it collected? • What is its context in a broader subject? • What is its context in the field of research that created it? • How was it collected? • What are its limitations? • Which mathematical transformations are appropriate? • Which methods of display are appropriate? Data Quality and Context
  • 15. Cleaning and formatting a single data set is hard. What if you’re: • Building a live visualization that will run with many different data sets? • Don’t have time to manually clean each data set? There is no substitute for real data. It doesn’t help you plan for data discrepancies, null values, outliers, or real-world problems. • Use several random samples of real data if you cannot access an entire data set • Invalid and missing data is a guarantee. If your data won’t be cleaned before being graphed, do not clean your sample data. • Real data may be so large it overwhelms your visualization generating it. Be sure that if you use a sample of data you correctly scale up the sample size or reduce it appropriately before creating a final visualization. There’s no substitute for real data
  • 16. Four different segments are being compared , but they are calibrated wrong. On the surface this is hard to detect. • The clean part is that there is very little overlap between Search Traffic and Referral Traffic. But Mobile is a platform. Never Compare Apples to Oranges Traffic (conversions in this case) is most likely in both Referrals and Search. It is unclear what to make of that orange stacked bar. The graph is showing conversions already included in Search and Referral (double counting) and because you have no idea what it is, it is impossible to know what action to take. Would you recommend a higher investment in Mobile based on this graph? • The same for Social Media. It is likely that the Social Media conversions are already included in Referrals and in Mobile. The green bar in the graph is useless. Is a massive increase in investment in Social Media an imprecise conclusion?
  • 17. What do you think is wrong with this graph? It artificially inflates the importance of a change in the metric that might not be all that important. In this case the data is not statistically significant, but there is no way we can know that just from the data. Yet the scale used for the y-axis implies that something huge has happened. Don't Overstate or Alarm Unnecessarily Try to avoid being so dramatic in your presentation. It causes people to read things into the performance that they should most likely not read. Setting the y-axis at zero may not be necessary every time but this 1.5 point difference is a waste of everyone's time. Another important thing. Label your x axis. Please.
  • 18. This chart that shows nine months of performance… by day! The "trend" is completely useless. • Looking at individual days over such a long time period can hide insights and important changes. It can be near impossible to find anything of value. • Try switching to looking at the exact same time period but by week. Now see some kind of trend, especially towards the end of the graph (even this simple insight was hidden before). Calibrate Your Time Series See: http://square.github.io/crossfilter for another good example.
  • 19. What do you think the two colors in this graph represent? How come only 29 percent of the organizations have more than one person! Problem one is that "red" denotes "good" in this case and "green" represents "bad." Here's something very, very simple you should understand: Red is bad and Green is good. Always. Period. People instinctively think this way. So show "good" in green and "bad" in red. It will communicate your point clearly and faster. Always Make Your Point Clearly Problem two, much worse, was that it was harder than it should be to understand this data. First stacked bar above: "Yes 71 percent of the organizations Yes, more than one person. And what is the 29 percent? If the question is how many people are directly responsible for improving conversion rates and 71 percent have more than one person, then 29 percent are those that have less than one person or no one? Or just less than one person? Unclear (and frustrating).
  • 20. We all make this mistake. We create a table like the one below. We create a "heat map" in the table highlighting where conversions rates appear good. We declare Organic to be the winner, Direct is close behind. Then the other two. And we recommend doing more SEO. Statistical Significance None of this data could be significant – that the numbers seem to be so different might not mean anything. It is entirely possible that it is completely immaterial that Direct is 34% and Email is 10%, or that Referral is 7%. We should evaluate the raw numbers to see if the percentage is meaningful at all. • The data in the Direct row could represent conversions out of 10 visits and all the Referral data could be represent conversions from 1,000,000 visits. Suggest computing statistical significance to identify which comparison sets we can be confident are different, and in which cases we simply don't have enough confidence. Do you see the problem?
  • 21. Confusing correlation and causation is one of the most overlooked problems. In the Cheese and Employment Status percentage graph, it is clear that retired Redditors prefer cheddar cheese and freelance Redditors prefer brie. • This does not mean that once an individual retires he or she will develop a sudden affinity for cheddar. • Nor does becoming a freelancer cause one to suddenly prefer brie. Correlation vs. Causation
  • 22. Improper use of averages Averages can be a great way to get a quick overview of some key areas, but use it wisely. For example, average order value is a useful metric. If we were to look at only the average order value by month it’s enlightening because it shows an increase over time, which indicates a move in the right direction. However, it’s more useful to look at average order value by department by month, because this shows us where the increase in average order value is coming from; the women’s shoes department. If just looked at only average order value by month, we might focus marketing across all departments, which is not the most efficient allocation of resources.
  • 23. • Do other useful things. Look at your search keyword reports. Do you see a few people coming on keywords you SEOed the site for? Look at the keywords your site is showing up in through Google search results. Are they the ones you were expecting? • Even better… spend time with competitive intelligence tools like Insights for Search, Ad Planner, and others to seek clues from your competitors and your industry ecosystem. At this stage you can learn a lot more from their data than your data… There is Such a Thing as Too Little Data! Another "simple" mistake. We get excited about having data, especially if new at this. We get our tables and charts together and we reporting data and having a lot of fun. This is very dangerous. You see there is such a thing as too little data. You don't want to wait until you've collected millions of rows of data to make any decision, but the table here is nearly useless. So can you do anything with data like this?
  • 24. Drawing Conclusions from Incomplete Information Assuming a Lab is a Reasonable Substitute Forgetting the Real Experts are Your Customers Gaming the System Sampling Problems Intelligence is not binary Keeping it Simple - Too Simple Beware the long tail Simpson's paradox Common Organizational Blunders Analytic Fundamentals Common Measurement Errors Statistical Mistakes Visualization Faults Summary Common Analytic Mistakes
  • 25. Data may not tell the full story. For example your analytics show visitors spend a relatively high amount of time on a particular page. • Is that page great – or is it problematic? Maybe visitors simply love the content. • Or, maybe they are getting stuck due to a problem with the page. Your call center statistics show average call time has decreased. • Is a decrease in average call time good news or bad news? • When calls end more quickly, costs go down, but have you actually satisfied callers or left them disgruntled, dissatisfied, and on their way to your competition? Never draw conclusions from any analysis that does not tell the whole story. Use bar charts to visualize relative sizes. If you see one bar that is twice as long as the other bar, you expect the underlying quantity to be twice as big. This relative sizing fails if you do not start your bar chart axis at 0. Rule of thumb, if you want to illustrate a small change, use a line chart if starting your y-axis anywhere other than 0. Conclusions from Incomplete Information
  • 26. The chart on the left compares Redditors who like dogs vs. Redditors who prefer cats. With the y-axis starting at 9,000, it looks like dog lovers outnumber cat lovers by three times. However, the graph on the right is a more accurate representation of the data. There are not quite twice as many Redditors who prefer dogs to cats. A limitation of bar charts that start at 0 is they do not show small percent differences. If you need to change the start of your axis in order to highlight small changes, switch to a line chart. Conclusions from Incomplete Information
  • 27. Usability groups and panels are certainly useful and have their place; the problem is the sample sizes are small and the testing takes place in a controlled environment. You bring people into a lab and tell them what you want them to do. • Does that small group of eight participants represent your broader audience? • Does measuring and observing them when they do what we tell them to do provide the same results as real users who do what they want to do? Observation is helpful, but applying science to the voice of customer and measuring the customer experience through the lens of customer satisfaction is a better way to achieve successful results. Assuming a Lab is a Reasonable Substitute
  • 28. Experts, like usability groups, have their place. But who knows customer intentions, needs, and attitudes better than actual customers? When you really want to know, go to the source. It takes more time and work, but the results are much more valuable. Experts and consultants certainly have their place, but their advice and recommendations must be driven by customer needs as much if not more than by organizational needs. The Real Experts are Your Customers
  • 29. Many feedback and measurement systems create bias and inaccuracy. How? Ask the wrong people, bias their decisions, or give them incentives for participation. Measuring correctly means creating as little measurement bias as possible while generating as little measurement noise as possible. • Avoid incenting people to complete surveys, especially when there is no need. Never ask for personal data; some will decline to participate if only for privacy concerns. • Never measure with the intent to prove a point. We may, unintentionally, create customer measurements to prove our opinions are correct or support our theories, but to what end? • Customer measurements must measure from the customers’ perspective and through the customers’ eyes, not through a lens of preconceived views. Gaming the System
  • 30. Sampling works well when sampling is done correctly. Sample selection and sample size are critical to creating a: • credible, • reliable, • accurate, • precise, and predictive methodology. Sampling is a science in and of itself. You need samples representative of the larger population that are randomly selected. See the section on Statistical Mistakes for more on this. Sampling Problems
  • 31. Taking a binary approach to measuring satisfaction – in effect, asking whether a customer is or is not satisfied – leads to simplistic and inaccurate measurement. Intelligence is not binary. • People are not just smart or stupid. • People are not just tall or short. • Customers are not just satisfied or dissatisfied. “Yes” and “no” do not accurately explain or define levels or nuances of customer satisfaction. The degree of satisfaction with the experience is what determines the customer’s level of loyalty and positive word of mouth. Claiming 97% of your customers are satisfied certainly makes for a catchy marketing slogan but is far from a metric you can use to manage your business forward. If you cannot trust and use the results, why do the research? Intelligence is not binary
  • 32. The “keep it simple” approach does not work for measuring customer satisfaction (or for measuring anything regarding customer attitudes and behaviors.) • Customers are complex; they make decisions based on a number of criteria, most rational, some less so. Asking three or four questions does not create a usable metric or help to develop actionable intelligence. Measuring customer satisfaction by itself will not provide the best view. Using a complete satisfaction measurement system – including future behaviors and predictive metrics. Many will take this simple approach and make major strategic decisions based on a limited and therefore flawed approach to measurement. Great managers do not make decisions based on hunches or limited data; “directionally accurate” is simply not good enough. Keeping it Simple - Too Simple http://experiencematters.wordpress.com/tag/lowes/page/2/
  • 33. In statistics, a long tail of some distributions of numbers is the portion of the distribution having a large number of occurrences far from the "head" or central part of the distribution. A probability distribution is said to have a long tail, if a larger share of population rests within its tail than would under a normal distribution. Beware the long tail A long-tail distribution will arise when many values are unusually far from the mean, which increase the magnitude of the skewness of the distribution. Top 10,000 Popular Keywords
  • 34. The term long tail has gained popularity in describing the retailing strategy of selling a large number of unique items with relatively small quantities sold of each in addition to selling fewer popular items in large quantities. The distribution and inventory costs of businesses successfully applying this strategy allow them to realize significant profit out of selling small volumes of hard- to-find items to many customers instead of only selling large volumes of a reduced number of popular items. The total sales of this large number of "non-hit items" is called "the long tail". • See also: • Black swan theory • Kolmogorov's zero – one law which is also known a tail event. • Mass customization • Micropublishing • Swarm intelligence Source: http://en.wikipedia.org/wiki/Long_tail Beware the long tail (cont’d)
  • 35. Simpson's paradox, or the Yule–Simpson effect, is a paradox where a trend that appears in different groups of data disappears when these groups are combined, and the reverse trend appears for the aggregate data. Encountered in social-science and medical-science statistics, this effect is confounding when frequency data are unduly given causal interpretations. Using professional baseball as an example it is possible for one player to hit for a higher batting average than another player during a given year, and to do so again during the next year, but to have a lower batting average when the two years are combined. Simpson's paradox Player 1995 Avg. 1996 Avg. Combined Avg. Derek Jeter 12/48 0.250 183/582 0.314 195/630 0.310 David Justice 104/411 0.253 45/140 0.321 149/551 0.270
  • 36. This phenomenon occurs where there are large differences in the number of at- bats between the years. The same situation applies to calculating batting averages for the first half of the baseball season, and during the second half, and then combining all of the data for the season's batting average. Simpson's paradox (cont’d) Player 1995 Avg. 1996 Avg. Combined Avg. Derek Jeter 12/48 0.250 183/582 0.314 195/630 0.310 David Justice 104/411 0.253 45/140 0.321 149/551 0.270 If weighting is used this phenomenon disappears. The table below has been normalized for the largest totals so the same things are compared. Player 1995 Avg. 1996 Avg. Combined Avg. Derek Jeter 12/48*411 0.250 183/582*582 0.314 285.75/993 0.288 David Justice 104/411*411 0.253 45/140*582 0.321 291/993 0.293
  • 37. Expecting too much certainty Misunderstanding probability Mistakes in thinking about causation Problematical choice of measure Errors in sampling Over- interpretation Mistakes involving limitations of hypothesis tests/confidence intervals Using an inappropriate model or research design Common Organizational Blunders Analytic Fundamentals Common Measurement Errors Statistical Mistakes Visualization Faults Summary Common Analytic Mistakes
  • 38. One consequence of not taking uncertainty seriously enough is that we often write results in terms that misleadingly suggest certainty. For example, some might conclude from a study that a hypothesis is true or has been proved, when it would be more correct to say that the evidence supports the hypothesis or is consistent with the hypothesis. Another mistake is misinterpreting results of statistical analyses in a deterministic rather than probabilistic (also called stochastic) manner. Expecting too much certainty " ... as far as the propositions of mathematics refer to reality, they are not certain; and as far as they are certain, they do not refer to reality." Albert Einstein , Geometry and Experience, Lecture before the Prussian Academy of Sciences, January 27, 1921 Source: Martha Smith, University of Texas http://www.ma.utexas.edu/users/mks/statmistakes/TOC.html
  • 39. There are four perspectives on probability that are commonly used: • Classical, • Empirical (or Frequentist), • Subjective, and • Axiomatic. Using one perspective when another is intended can lead to serious errors. Common misunderstanding: If there are only two possible outcomes, and you don't know which is true, the probability of each of these outcomes is 1/2. In fact, probabilities in such "binary outcome" situations could be anything from 0 to 1. For example, if the outcomes of interest are "has cancer" and "does not have cancer," the probabilities of having cancer are (in most cases) much less than 1/2. The empirical (frequentist) perspective allows us to estimate such probabilities. Misunderstanding probability Source: Martha Smith, University of Texas http://www.ma.utexas.edu/users/mks/statmistakes/TOC.html
  • 40. Confusing correlation with causation. Example: Students' shoe sizes and scores on a standard reading exam. They are correlated, but saying that larger shoe size causes higher reading scores is as absurd as saying that high reading scores cause larger shoe size. There is a clear lurking variable, namely, age. As the child gets older, both their shoe size and reading ability increase. Do not Interpret causality deterministically when the evidence is statistical. Mistakes in thinking about causation Source: Martha Smith, University of Texas http://www.ma.utexas.edu/users/mks/statmistakes/TOC.html
  • 41. In most research, one or more outcome variables are measured. Statistical analysis is done on the outcome measures, and conclusions are drawn from the statistical analysis. The analysis itself involves a choice of measure, called a summary statistic. Misleading results occur when inadequate attention to the choice of either outcome variables or summary statistics. • Example: What is a good outcome variable for deciding whether cancer treatment in a country has been improving? A first thought might be "number of deaths in the country from cancer in one year." But number of deaths might increase simply because the population is increasing. Or it might go down if cancer incidence is decreasing. "Percent of the population that dies of cancer in one year" would take care of the first problem, but not the second. In this case rate is a better measure than a count. Problematical choice of measure
  • 42. Believing a "random sample will be representative of the population". In fact, this statement is false -- a random sample might, by chance, turn out to be anything but representative. For example, it is possible that if you toss a coin ten times, all the tosses will come up heads. A slightly better explanation that is partly true : "Random sampling eliminates bias by giving all individuals an equal chance to be chosen.“ There is very important, the reason why random sampling is important… Mathematical theorems which justify most statistical procedures apply only to random samples. Errors in sampling
  • 43. • Extrapolation to a larger population than the one studied Example: running a marketing experiment with undergraduates enrolled in marketing classes and drawing a conclusion about people in general. • Extrapolation beyond the range of data Similar to extrapolating to a larger population, but concerns the values of the variables rather than the individuals. • Ignoring Ecological Validity Involves the setting (i.e., the "ecology") rather than the individuals studied, or it may involve extrapolation to a population having characteristics very different from the population that is relevant for application. Over-interpretation
  • 44. • Using overly strong language in stating results Statistical procedures do not prove results. They only give us information on whether or not the data support or are consistent with a particular conclusion. There is always uncertainty involved. Acknowledge this uncertainty. • Considering statistical significance but no practical significance Example: Suppose that a well-designed, well-carried out, and carefully analyzed study shows that there is a statistically significant difference in life span between people engaging in a certain exercise regime at least five hours a week for at least two years and those not following the exercise regime. If the difference in average life span between the two groups is three days... Over-interpretation (cont’d) Source: Martha Smith, University of Texas http://www.ma.utexas.edu/users/mks/statmistakes/TOC.html
  • 45. Type I Error • Rejecting the null hypothesis when it is in fact true is known as a Type I error. Many people decide, before doing a hypothesis test, on a maximum p-value for which they will reject the null hypothesis. This value is often denoted α (alpha) and is the significance level. Type II Error • Not rejecting the null hypothesis when in fact the alternate hypothesis is true is called a Type II error. An analogy helpful in understanding the two types of error is to consider a defendant in a trial. The null hypothesis is "defendant is not guilty;" the alternate is "defendant is guilty." • Type I error would correspond to convicting an innocent person; • Type II error would correspond to setting a guilty person free. Test/Hypothesis confidence intervals
  • 46. Two drugs are known to be equally effective for a certain condition. • Drug 1 has been used for decades with no reports of the side effects • Drug 2 may cause serious side-effects The null hypothesis is "the incidence of the side effect in both drugs is the same", The alternate hypothesis is "the incidence of the side effect in Drug 2 is greater than that in Drug 1." Falsely rejecting the null hypothesis when it is in fact true (Type I error) would have no great consequences for the consumer, but a Type II error (i.e., failing to reject the null hypothesis when in fact the alternate is true, would result in deciding that Drug 2 is no more harmful than Drug 1 when it is in fact more harmful) could have serious consequences. Setting a large significance level is in this case is appropriate. Test/Hypothesis confidence intervals
  • 47. Each inference technique (hypothesis test or confidence interval) you select has model assumptions. Different techniques have very different model assumptions. The validity of the technique depends on whether the model assumptions fit the context of the data being analyzed. • Common Mistakes Involving Model Assumptions • Using a two-sample test comparing means when cases are paired • Comparisons of treatments applied to people, animals, etc. (Intent to Treat; Comparisons involving Drop-outs) • Fixed vs Random Factors • Analyzing Data without Regard to How the Data was Collected • Dividing a Continuous Variable into Categories ("Chopped Data“, Cohorts) • Pseudo-replication • Mistakes in Regression • Dealing with Missing Data Inappropriate model or research design Source: Martha Smith, University of Texas http://www.ma.utexas.edu/users/mks/statmistakes/TOC.html
  • 48. Bad Math, Bad Geography Misrepresent Data Serving the Presentation Without the Data Pie Charts – Why? Using Non- Solid Lines in a Line Chart Bar charts with erroneous scale, Arranging Data Non- Intuitively Obscuring Your Data, Making the Reader Do More Work Misrepresent Data Using Different Colors on a Heat Map Making it Hard to Compare Data, Showing Too Much Detail Not Explaining the Interactivity Keep It Simple Common Organizational Blunders Analytic Fundamentals Common Measurement Errors Statistical Mistakes Visualization Faults Summary Common Analytic Mistakes
  • 49. Visualization is a tool to aid analysis, not a substitute for analytical skill. It is not a substitute for statistics: Really understanding your data generally requires a combination of • analytical skills, • domain expertise, and • effort. Strategy: • Be careful about promising real insight. Work with a statistician or a domain expert if you need to offer reliable conclusions. • Small design decisions - the color palette you use, or how you represent a particular variable can skew the conclusions a visualization suggests. • If using visualizations for analysis, try a variety of options, rather than relying on a single view Visualization is not analysis
  • 50. The infographic seems informative enough. The self-perceptions of “Baby Boomers” is an interesting data set to visualize. The problem, however, is that this graphic represents 243% of responses. This is not necessarily indicative of faulty data. It is a poor representation of the data. Display the phrases individually, with size determined by percentages to compare the phrases to each other more easily. Remove the percentages from a single body also clarifies that the percentages aren’t mutually exclusive. Bad Math
  • 51. Credibility? Gone in an instant. Bad Geography - Just Plain Dumb Presented without comment… speechless, how does this happen?
  • 52. And don't do something like this.. ever Makes sure all representations are accurate. For example, bubbles should be scaled according to area, not diameter. Misrepresenting Data
  • 53. The graph is clearly trying to make the point that Obamacare enrollment numbers for March 27 are well below the March 31 goal. Take a closer look. How come the first bar is less than half the size of the second one? This is a common problem. Depending on the scale (there isn’t one in this example), the comparison among data points look very deceiving. Misrepresenting Data (Why?) • If you’re working with data points that are really far or really close together, be sure to pay attention to the scale, so that data is accurately portrayed. • If there is no scale, compare different slices, tiles, bubbles, or bars against each other. Do the numbers match up to these?
  • 54. Which comes first: the presentation or the data? In an effort to make a more “interesting” or “cool” design, don’t allow the presentation layer of a visualization to become more important than the data itself. In this example a considerable amount of work went into it and there are parts that are informative, like the summary counters at the top left. However, without a scale or axis, the time series on the bottom right is meaningless and the 3D chart in the center is even more opaque. Tooltips (pop ups) would help, if they were there. Serving the Presentation Without the Data
  • 55. They are useful on rare occasions. But most of the time they actually do not communicate anything of value. Pie Charts – Why? Beyond the obvious point made by the line graph in the background (we are storing more data now than we used to), this graph seems to tell us “…don’t know if any of this matters, so we’re going to print everything.” Sometimes it just better to use a table... really.
  • 56. A common mistake with pie charts is to divide them into percentages that simply do not add up. The basic rule of a pie chart is that the sum of all percentages included should be 100%. • In this example, the percentages fall short of 100%, and the segment sizes do not match their values. This happens due to rounding errors, or when non-mutually exclusive categories are plotted on the same chart. Unless included categories are mutually exclusive, their percentage cannot be plotted separately using the same chart. Pie Charts - Error in Chart Percentages Here is another gem…
  • 57. Always ask yourself when considering a potential design: Why is this better than a bar chart? If you’re visualizing a single quantitative measure over a single categorical dimension, there is rarely a better option. Pie Charts – Why? • Line charts are preferred when using time-based data • Scatter plots are best for exploring correlations between two linear measures. • Bubble charts support more data points with a wider range of values • Tree maps support hierarchical categories If you have to use a pie chart: • Don’t include more than five segments • Place the largest section at 12 o’clock, going clockwise. • Place the second largest section at 12 o’clock, going counterclockwise. • The remaining sections can be placed below, continuing counterclockwise.
  • 58. Comparison is a valuable way to showcase differences, but it's useless if your viewer can’t easily compare. Make sure all data is presented in a way that allows the reader to compare data side-by-side. Making it Hard to Compare Data Is this clear to you?
  • 59. • Using Non-Solid Lines in a Line Chart Dashed and dotted lines can be distracting. Instead, use a solid line and colors that are easy to distinguish from each other. • Making the Reader Do More Work Make it as easy as possible to understand data by aiding the reader with graphic elements. For example, add a trend line to a scatter plot to highlight trends. • Obscuring Your Data Make sure no data is lost or obstructed by design. For example, use transparency in a standard area chart to make sure the viewer can see all data • Using Different Colors on a Heat Map Some colors stand out more than others, giving unnecessary weight to that data. Instead, use a single color with varying shades or a spectrum between two analogous colors to show intensity. Other Common Mistakes
  • 60. • Arranging the Data Non-Intuitively Content should be presented in a logical and intuitive way to guide readers through the data. Order categories alphabetically, sequentially, or by value. • Not Explaining Interactivity Enabling users to use and interact with a visualization makes it more engaging. If you don’t tell them how to use that interactivity you risk limiting them to the initial view. How you label the interactivity is just as important as doing it in the first place. • Inform the user at the top of the visualization is good practice. • Calling out the interaction on or near the tools that use it. • Use a common design concept such as underlining the words to associate a hyperlink, would have been helpful. See http://www.visualizing.org/full-screen/39118 Other Common Mistakes (cont’d)
  • 61. Hard to resist… the temptation with a dataset with numerous usable categorical and numerical fields is to show everything at once, and allow users to drill down to the finest level of detail. The visualization is superfluous; the user could simply look at the dataset itself if they wanted to see the finest level of detail. Show enough detail to tell a story, but not so much that that story is convoluted and hidden. Showing Too Much Detail
  • 62. We all want specific, relevant answers. The closer you can get to providing exactly what is wanted, the less effort we expend looking for answers. Irrelevant data makes finding the relevant information more difficult; irrelevant data is just noise. • Showing several closely related graphs can be a nice compromise between showing too much in one graph and not showing enough overall. • A few clean, clear graphs are better than a single complicated view. • Try to represent your data in the simplest way possible to avoid this. Showing Too Much Detail (cont’d)
  • 63. Data visualization is about simplicity. Using embellished or artistic representations may result in more clarity but usually distracts from the actual data. Look at the example below. Keep it Simple • Why is the first image blue and the rest are red? • The number in the second image is against the paintbrush and not against the head while in all other columns it is against the head. Is this meaningful? • We might just appreciate different figures and think about the real-life characters represented by them and move on without understanding the data. Really… • It’s important that visual representation of data is free of the pitfalls that make data representation ambiguous and irrelevant.
  • 65. Thank You… Mr. Parnitzke is a hands-on technology executive, trusted partner, advisor, software publisher, and widely recognized database management and enterprise architecture thought leader. Over his career he has served in executive, technical, publisher (commercial software), and practice management roles across a wide range of industries. Now a highly sought after technology management advisor and hands-on practitioner his customers include many of the Fortune 500 as well as emerging businesses where he is known for taking complex challenges and solving for them across all levels of the customer’s organization delivering distinctive value and lasting relationships. Contact: j.parnitzke@comcast.net Blogs: Applied Enterprise Architecture (pragmaticarchitect.wordpress.com) Essential Analytics (essentialanalytics.wordpress.com) The Corner Office (cornerofficeguy.wordpress.com) Data management professional (jparnitzke.wordpress.com) The program office (theprogramoffice.wordpress.com) Data Science Page (http://www.datasciencecentral.com/profile/JamesParnitzke)

Editor's Notes

  1. Mr. Parnitzke is a hands-on technology executive, trusted partner, advisor, software publisher, and widely recognized database management and enterprise architecture thought leader. Over his career he has served in executive, technical, publisher (commercial software), and practice management roles across a wide range of industries.
  2. Retail Example: According to recent research by IDC Retail Insights, the Omnichannel shopper is the gold standard consumer. A multichannel shopper will: spend, on average 15 percent to 30 percent more than someone using just one channel, outspend simple multichannel shoppers by over 20 percent. What’s more, multichannel shoppers exhibit strong loyalty and are more likely to influence others to endorse a retailer. So, the hard questions are really in generating this kind of lift, and traditional simple focused marketing is not enough…
  3. One reason analytics projects lose focus is they begin compromised. When it's time to decide what metrics a company should track, too many follow the conference table consensus approach. They worry more about consensus than about value and accuracy.
  4. When budgets are tight and all are clamoring for better analytics, it's understandable that not everyone reads or fully comprehends the fine print associated with some vendors' "partnerships." In these models, the vendor may reserve ownership rights of the data, data aggregates, and the metadata derived from providing analytics services. It's not uncommon for companies to use subsidized analytics services to create aggregated research products they sell back to the marketplace (without compensation to publishers whose sites were harvested for the data).
  5. Cleaning and formatting a single data set is hard enough, but what if you’re building a live visualization that will run with many different data sets? Do have time to manually clean each data set? Your first instinct may be to grab some demo data and use that to build your visualization; your library may even come with standard sample data.
  6. How come only 29 percent of the organizations have more than one person! That is bad. Wait. That did not make sense. Back to read the question. Then the graph. Then the legend. Then back to the question. Then the legend…
  7. Measuring customer satisfaction by itself will not provide the best view forward.  Using a complete satisfaction measurement system – including future behaviors and predictive metrics such as likelihood to return to the site or likelihood to purchase again – generates leading indicators that complement and illuminate lagging indicators.
  8. Simpson's Paradox disappears when causal relations are understood and accounted for in the analysis.
  9. If you agree that increasing age (for school children) causes increasing foot size, and therefore increasing shoe size, then you expect a correlation between age and shoe size. Correlation is symmetric, so shoe size and age are correlated. But it would be absurd to say that shoe size causes age.
  10. It is true that sampling randomly will eliminate systematic bias. This is the best plausible explanation that is acceptable to someone with little mathematical background. This statement could easily be misinterpreted as the myth above.
  11. Example: An experiment designed to study whether an abstract or concrete approach works better for teaching abstract concepts used computer-delivered instruction. This was done to avoid confounding variables such as the effect of the teacher. However, the study then lacked ecological validity for most real-life classroom use in instruction.
  12. Another example: Does an increase in tire pressure cause an increase in tread wear? What is the X? Tire Pressure. What is the Y? Tread Wear. State the Null (Ho) and Alternative (Ha) Hypothesis  The Null Hypothesis is "r=0' (there is no correlation) Null Hypothesis (Ho) = There is no relationship between pressure and tread wear Alt Hypothesis (Ha) = There is a relationship between pressure and tread wear Gather data, Run the analysis and determine the P-Value  Run a Corellation (r=.554, p-Value = .0228). Determine the Alpha Risk The Confidence Interval was 95%, therefore the Alpha Risk is 5% (or 0.05) What does the P-Value tell us? (Reject or Accept the Null)  Reject the Null (because the P-Value (.0228) is lower than the Alpha Risk (0.05))
  13. It's a central tenet of the field that data visualization can yield meaningful insight. While there’s a great deal of truth to this, it’s important to remember that visualization is a tool to aid analysis, not a substitute for analytical skill. It’s also not a substitute for statistics: your chart may highlight differences or correlations between data points, but to reliably draw conclusions from these insights often requires a more rigorous statistical approach. (The reverse can also be true - as Anscombe’s Quartet demonstrates, visualizations can reveal differences statistics hide.) Really understanding your data generally requires a combination of analytical skills, domain expertise, and effort. Don’t expect your visualizations to do this work for you, and make sure you manage the expectations of your clients and your CEO when creating or commissioning visualizations. Tools and strategies Unless you’re a data analyst, be very careful about promising real insight. Consider working with a statistician or a domain expert if you need to offer reliable conclusions Small design decisions - the color palette you use, or how you represent a particular variable - can skew the conclusions a visualization suggests. If you’re using visualizations for analysis, try a variety of options, rather than relying on a single view Stephen Few’s Now You See It offers a good practical introduction to using visualization for business analysis, including suggestions for developers on how to design analytically-valid visualization tools
  14. You don’t have to be a mathematics major to see what is wrong with an aggregate response of 243%.