4. Analytics projects driven by nothing more than management
pronouncements that "we need metrics, get us some"
don’t ever end well.
• Analysis for analysis sake is ridiculous.
• Ask the right questions to learn what data and metrics are
important and can make a difference.
• Know which quantitative measures don't matter and can be
disregarded.
Not Asking the Hard Questions
5. • Analytics projects can be driven by answering yesterday's questions, not
tomorrow's. It's an easy trap to fall into.
People are comfortable with what they know.
• A backwards-optimized view provides comfort... at a cost of NOT tracking metrics
that help drive business forward, numbers the organization doesn't yet know, or
what's unpredictable and uncomfortable but exactly where focus is needed.
Focusing on Yesterday's Metrics
Historical
Reporting
“What
happened”
Real Time
Reporting
“What are we
doing right
now”
Modelling
“What if we
did this or
that”
Predictive
Analytics
“What can we
expect”
Prescriptive
Analytics
“What is the
optimal
solution”
-- Descriptive -- -- Prescriptive ---- Predictive --
6. No easy answer to this. Try to understand how various systems count, resolve that
with your own needs, then accept the compromises you have no control over.
• Example:
Online media and marketing are immature. Systems that measure them are
immature. Don’t assume systems and their outputs are fully baked and results
can be taken at face value.
Reality check: They aren't; They can't.
A page view in one system isn't a page view in another. Definitions and methods
that define metrics such as page views, visits, unique visitors, and ad deliveries
are different in every system. Without a standard definition for an ad impression
you can be sure there aren't standards for all the peripheral metrics.
Counts can differ between products from the same vendor by as much as 30%.
Misunderstanding Metrics
7. Many well-intentioned project owners kill the project value because they
bottleneck their own success. Well-meaning professionals, usually the systems'
power users, create processes and procedures that place themselves at the center
of everything regarding the systems: report creation, running ad hoc queries,
report distribution, and troubleshooting.
This often prevents others from taking real advantage of the systems' intelligence.
The only solution is to train others, then step back and let them at it.
Mistakes and misinterpretations will happen, but the benefits of widespread
adoption across the enterprise will always outweigh them.
Bottlenecking Value to the Organization
8. When it comes to analytics, there's nothing we love more than lots of pretty
pictures and dazzling graphics.
Theirs is nothing wrong with pleasing visualization tools in data presentation.
Too many worry about pictures first, data and analysis second. Pictures cloud their
minds and vision, which is exactly why vendors put them there. Graphics are great
at grabbing attention, but not always great at putting data into action.
What looks good in reports should be a means to the end, not the end in itself.
Overvaluing Data Visualization
http://www.wheels.org/monkeywrench/?p=413
9. One reason analytics projects lose focus is they begin compromised.
Too many follow the conference table consensus approach.
The Conference Table Compromise
Every department gets a seat at the table.
Everyone contributes suggestions.
The final product is a compendium of all requests.
Although this method tends to create lots of good
feelings, it rarely results in the best set of metrics with
which to run the business. You will find it's tracking silly,
pet metrics someone at the table thought were
important and are completely irrelevant.
To keep projects focused, decide which metrics are important and stay,
and which are distracting and go.
10. When budgets are tight and all are clamoring for better
analytics, it's understandable that not everyone reads or fully
comprehends the fine print associated vendor "partnerships."
The nuances of data ownership may seem innocuous, but
there are consequences.
Many will use analytics services to build a databases of
anonymous consumer profiles and their behavior to use in ad
targeting when those consumers visit other sites without
compensation to the publishers whose sites were harvested.
Be careful with this one…
Compromising Data Ownership
11. What good are analysis and insight if you can't act on them?
• Almost all analytics systems bill themselves as actionable.
Many claim they're real time. Learn what they really mean.
• Few systems can enable an enterprise to take immediate, tactical steps to
leverage data for value. For most, "actionable" means the system can generate
reports, such as user navigation patterns publishers can mull over in meetings,
then plan the changes needed to improve.
• While that may meet the definition of actionable, it doesn't necessarily mean
real-time or even right-time action. Bottom line: Understand, don't assume.
Confusing Insight With Action
12. Data Quality
and Context
Never
Compare
Apples to
Oranges
Don't
Overstate
(alarm)
Unnecessarily
Calibrate Your
Time Series
Always Make
Your Point
Clearly
(and Colors
Matter.)
Statistical
Significance
Correlation vs.
Causation
Improper use
of averages
There is Such a
Thing as Too
Little Data!
Common
Organizational
Blunders
Analytic
Fundamentals
Common
Measurement
Errors
Statistical
Mistakes
Visualization
Faults
Summary
Common Analytic Mistakes
13. While a poor font choice can ruin a meeting, a poor interpretation of statistics or
data can kill you (or someone else, for that matter). Proper use of data is a viciously
complicated topic. See the section on Statistical Mistakes for examples.
If your findings would lead to the wrong conclusions, not presenting the data at all
would be a better choice. Here is a simple rule for designers:
Data Quality and Context
Your project isn’t ready if you have spent more time choosing a font or
color scheme than choosing your data.
14. • Real data is ugly – but there’s no substitute for real data
• Provenance - Critical Questions to Ask
• Who collected it?
• Why was it collected?
• What is its context in a broader subject?
• What is its context in the field of research that created it?
• How was it collected?
• What are its limitations?
• Which mathematical transformations are appropriate?
• Which methods of display are appropriate?
Data Quality and Context
15. Cleaning and formatting a single data set is hard. What if you’re:
• Building a live visualization that will run with many different data sets?
• Don’t have time to manually clean each data set?
There is no substitute for real data. It doesn’t help you plan for data discrepancies,
null values, outliers, or real-world problems.
• Use several random samples of real data if you cannot access an entire data set
• Invalid and missing data is a guarantee. If your data won’t be cleaned before
being graphed, do not clean your sample data.
• Real data may be so large it overwhelms your visualization generating it.
Be sure that if you use a sample of data you correctly scale up the sample size or
reduce it appropriately before creating a final visualization.
There’s no substitute for real data
16. Four different segments are being compared ,
but they are calibrated wrong. On the surface
this is hard to detect.
• The clean part is that there is very
little overlap between Search Traffic and
Referral Traffic. But Mobile is a platform.
Never Compare Apples to Oranges
Traffic (conversions in this case) is most likely in both Referrals and Search. It is unclear what to
make of that orange stacked bar. The graph is showing conversions already included in Search and
Referral (double counting) and because you have no idea what it is, it is impossible to know what
action to take.
Would you recommend a higher investment in Mobile based on this graph?
• The same for Social Media. It is likely that the Social Media conversions are already included in
Referrals and in Mobile. The green bar in the graph is useless.
Is a massive increase in investment in Social Media an imprecise conclusion?
17. What do you think is wrong with this graph?
It artificially inflates the importance of a
change in the metric that might not be all that
important. In this case the data is not
statistically significant, but there is no way we
can know that just from the data. Yet the scale
used for the y-axis implies that something
huge has happened.
Don't Overstate or Alarm Unnecessarily
Try to avoid being so dramatic in your presentation. It causes people to read things into
the performance that they should most likely not read. Setting the y-axis at zero may
not be necessary every time but this 1.5 point difference is a waste of everyone's time.
Another important thing. Label your x axis. Please.
18. This chart that shows nine months of
performance… by day! The "trend" is
completely useless.
• Looking at individual days over such a
long time period can hide insights and
important changes. It can be near
impossible to find anything of value.
• Try switching to looking at the exact
same time period but by week. Now see
some kind of trend, especially towards
the end of the graph (even this simple
insight was hidden before).
Calibrate Your Time Series
See: http://square.github.io/crossfilter for another good example.
19. What do you think the two colors in this graph
represent? How come only 29 percent of the
organizations have more than one person!
Problem one is that "red" denotes "good" in
this case and "green" represents "bad."
Here's something very, very simple you should
understand: Red is bad and Green is good.
Always. Period. People instinctively think this
way. So show "good" in green and "bad" in red.
It will communicate your point clearly and
faster.
Always Make Your Point Clearly
Problem two, much worse, was that it was harder than it should be to understand this data. First stacked
bar above: "Yes 71 percent of the organizations Yes, more than one person. And what is the 29 percent?
If the question is how many people are directly responsible for improving conversion rates and 71
percent have more than one person, then 29 percent are those that have less than one person or no
one? Or just less than one person? Unclear (and frustrating).
20. We all make this mistake. We create a table like the one below. We create a "heat map" in the table
highlighting where conversions rates appear good. We declare Organic to be the winner, Direct is
close behind. Then the other two. And we recommend doing more SEO.
Statistical Significance
None of this data could be significant – that the numbers seem to be so different might not mean
anything. It is entirely possible that it is completely immaterial that Direct is 34% and Email is 10%,
or that Referral is 7%.
We should evaluate the raw numbers to see if the percentage is meaningful at all.
• The data in the Direct row could represent conversions out of 10 visits and all the Referral data
could be represent conversions from 1,000,000 visits. Suggest computing statistical significance to
identify which comparison sets we can be confident are different, and in which cases we simply
don't have enough confidence.
Do you see the problem?
21. Confusing correlation and causation is
one of the most overlooked problems.
In the Cheese and Employment Status
percentage graph, it is clear that
retired Redditors prefer cheddar
cheese and freelance Redditors prefer
brie.
• This does not mean that once an
individual retires he or she will
develop a sudden affinity for
cheddar.
• Nor does becoming a freelancer
cause one to suddenly prefer brie.
Correlation vs. Causation
22. Improper use of averages
Averages can be a great way to get a quick
overview of some key areas, but use it wisely.
For example, average order value is a useful
metric. If we were to look at only the average
order value by month it’s enlightening because it
shows an increase over time, which indicates a
move in the right direction.
However, it’s more useful to look at average order
value by department by month, because this
shows us where the increase in average order
value is coming from; the women’s shoes
department.
If just looked at only average order value by
month, we might focus marketing across all
departments, which is not the most efficient
allocation of resources.
23. • Do other useful things. Look at your search keyword reports.
Do you see a few people coming on keywords you SEOed the site for?
Look at the keywords your site is showing up in through Google search results.
Are they the ones you were expecting?
• Even better… spend time with competitive intelligence tools like Insights for Search, Ad Planner, and
others to seek clues from your competitors and your industry ecosystem. At this stage you can learn
a lot more from their data than your data…
There is Such a Thing as Too Little Data!
Another "simple" mistake. We get excited about having data, especially if new at
this. We get our tables and charts together and we reporting data and having a
lot of fun.
This is very dangerous. You see there is such a thing as too little data.
You don't want to wait until you've collected millions of rows of data to make
any decision, but the table here is nearly useless. So can you do anything with
data like this?
24. Drawing
Conclusions
from
Incomplete
Information
Assuming a
Lab is a
Reasonable
Substitute
Forgetting
the Real
Experts are
Your
Customers
Gaming the
System
Sampling
Problems
Intelligence
is not binary
Keeping it
Simple - Too
Simple
Beware the
long tail
Simpson's
paradox
Common
Organizational
Blunders
Analytic
Fundamentals
Common
Measurement
Errors
Statistical
Mistakes
Visualization
Faults
Summary
Common Analytic Mistakes
25. Data may not tell the full story. For example your analytics show visitors spend a relatively high
amount of time on a particular page.
• Is that page great – or is it problematic? Maybe visitors simply love the content.
• Or, maybe they are getting stuck due to a problem with the page.
Your call center statistics show average call time has decreased.
• Is a decrease in average call time good news or bad news?
• When calls end more quickly, costs go down, but have you actually satisfied callers or left them
disgruntled, dissatisfied, and on their way to your competition?
Never draw conclusions from any analysis that does not tell the whole story.
Use bar charts to visualize relative sizes. If you see one bar that is twice as long as the other bar, you
expect the underlying quantity to be twice as big. This relative sizing fails if you do not start your bar
chart axis at 0.
Rule of thumb, if you want to illustrate a small change, use a line chart if starting your y-axis
anywhere other than 0.
Conclusions from Incomplete Information
26. The chart on the left compares Redditors
who like dogs vs. Redditors who prefer cats.
With the y-axis starting at 9,000, it looks like
dog lovers outnumber cat lovers by three
times. However, the graph on the right is a
more accurate representation of the data.
There are not quite twice as many Redditors
who prefer dogs to cats.
A limitation of bar charts that start at 0 is
they do not show small percent
differences. If you need to change the start
of your axis in order to highlight small
changes, switch to a line chart.
Conclusions from Incomplete Information
27. Usability groups and panels are certainly useful and have their place; the problem
is the sample sizes are small and the testing takes place in a controlled
environment. You bring people into a lab and tell them what you want them to do.
• Does that small group of eight participants represent your broader audience?
• Does measuring and observing them when they do what we tell them to do
provide the same results as real users who do what they want to do?
Observation is helpful, but applying science to the voice of customer and
measuring the customer experience through the lens of customer satisfaction is a
better way to achieve successful results.
Assuming a Lab is a Reasonable Substitute
28. Experts, like usability groups, have their place.
But who knows customer intentions, needs, and attitudes better than actual
customers?
When you really want to know, go to the source.
It takes more time and work, but the results are much more valuable.
Experts and consultants certainly have their place, but their advice and
recommendations must be driven by customer needs as much if not more than by
organizational needs.
The Real Experts are Your Customers
29. Many feedback and measurement systems create bias and inaccuracy. How? Ask the wrong people,
bias their decisions, or give them incentives for participation. Measuring correctly means creating as
little measurement bias as possible while generating as little measurement noise as possible.
• Avoid incenting people to complete surveys, especially when there is no need.
Never ask for personal data; some will decline to participate if only for privacy concerns.
• Never measure with the intent to prove a point. We may, unintentionally, create customer
measurements to prove our opinions are correct or support our theories, but to what end?
• Customer measurements must measure from the customers’ perspective and through the
customers’ eyes, not through a lens of preconceived views.
Gaming the System
30. Sampling works well when sampling is done correctly. Sample selection and
sample size are critical to creating a:
• credible,
• reliable,
• accurate,
• precise, and predictive methodology.
Sampling is a science in and of itself. You need samples representative of the larger
population that are randomly selected.
See the section on Statistical Mistakes for more on this.
Sampling Problems
31. Taking a binary approach to measuring satisfaction – in effect, asking whether a
customer is or is not satisfied – leads to simplistic and inaccurate measurement.
Intelligence is not binary.
• People are not just smart or stupid.
• People are not just tall or short.
• Customers are not just satisfied or dissatisfied.
“Yes” and “no” do not accurately explain or define levels or nuances of customer
satisfaction. The degree of satisfaction with the experience is what determines the
customer’s level of loyalty and positive word of mouth.
Claiming 97% of your customers are satisfied certainly makes for a catchy marketing
slogan but is far from a metric you can use to manage your business forward.
If you cannot trust and use the results, why do the research?
Intelligence is not binary
32. The “keep it simple” approach does not work for measuring customer
satisfaction (or for measuring anything regarding customer attitudes
and behaviors.)
• Customers are complex; they make decisions based on a number of
criteria, most rational, some less so. Asking three or four questions
does not create a usable metric or help to develop actionable
intelligence.
Measuring customer satisfaction by itself will not provide the best
view. Using a complete satisfaction measurement system – including
future behaviors and predictive metrics.
Many will take this simple approach and make major strategic decisions
based on a limited and therefore flawed approach to measurement.
Great managers do not make decisions based on hunches or limited
data; “directionally accurate” is simply not good enough.
Keeping it Simple - Too Simple
http://experiencematters.wordpress.com/tag/lowes/page/2/
33. In statistics, a long tail of some
distributions of numbers is the portion
of the distribution having a large
number of occurrences far from the
"head" or central part of the
distribution.
A probability distribution is said to
have a long tail, if a larger share of
population rests within its tail than
would under a normal distribution.
Beware the long tail
A long-tail distribution will arise when many values are unusually far from the
mean, which increase the magnitude of the skewness of the distribution.
Top 10,000 Popular Keywords
34. The term long tail has gained popularity in describing the retailing strategy of
selling a large number of unique items with relatively small quantities sold of each
in addition to selling fewer popular items in large quantities.
The distribution and inventory costs of businesses successfully applying this
strategy allow them to realize significant profit out of selling small volumes of hard-
to-find items to many customers instead of only selling large volumes of a reduced
number of popular items.
The total sales of this large number of "non-hit items" is called "the long tail".
• See also:
• Black swan theory
• Kolmogorov's zero – one law which is also known a tail event.
• Mass customization
• Micropublishing
• Swarm intelligence
Source: http://en.wikipedia.org/wiki/Long_tail
Beware the long tail (cont’d)
35. Simpson's paradox, or the Yule–Simpson effect, is a paradox where a trend that
appears in different groups of data disappears when these groups are combined,
and the reverse trend appears for the aggregate data.
Encountered in social-science and medical-science statistics, this effect is
confounding when frequency data are unduly given causal interpretations.
Using professional baseball as an example it is possible for one player to hit for a
higher batting average than another player during a given year, and to do so again
during the next year, but to have a lower batting average when the two years are
combined.
Simpson's paradox
Player 1995 Avg. 1996 Avg. Combined Avg.
Derek Jeter 12/48 0.250 183/582 0.314 195/630 0.310
David Justice 104/411 0.253 45/140 0.321 149/551 0.270
36. This phenomenon occurs where there are large differences in the number of at-
bats between the years. The same situation applies to calculating batting averages
for the first half of the baseball season, and during the second half, and then
combining all of the data for the season's batting average.
Simpson's paradox (cont’d)
Player 1995 Avg. 1996 Avg. Combined Avg.
Derek Jeter 12/48 0.250 183/582 0.314 195/630 0.310
David Justice 104/411 0.253 45/140 0.321 149/551 0.270
If weighting is used this phenomenon disappears. The table below has been
normalized for the largest totals so the same things are compared.
Player 1995 Avg. 1996 Avg. Combined Avg.
Derek Jeter 12/48*411 0.250 183/582*582 0.314 285.75/993 0.288
David Justice 104/411*411 0.253 45/140*582 0.321 291/993 0.293
37. Expecting too
much certainty
Misunderstanding
probability
Mistakes in
thinking about
causation
Problematical
choice of measure
Errors in sampling
Over-
interpretation
Mistakes involving
limitations of
hypothesis
tests/confidence
intervals
Using an
inappropriate
model or research
design
Common
Organizational
Blunders
Analytic
Fundamentals
Common
Measurement
Errors
Statistical
Mistakes
Visualization
Faults
Summary
Common Analytic Mistakes
38. One consequence of not taking uncertainty seriously enough is that we often write
results in terms that misleadingly suggest certainty.
For example, some might conclude from a study that a hypothesis is true or has
been proved, when it would be more correct to say that the evidence supports the
hypothesis or is consistent with the hypothesis.
Another mistake is misinterpreting results of statistical analyses in a deterministic
rather than probabilistic (also called stochastic) manner.
Expecting too much certainty
" ... as far as the propositions of mathematics refer to reality, they are not certain; and as far as they are
certain, they do not refer to reality."
Albert Einstein , Geometry and Experience,
Lecture before the Prussian Academy of Sciences, January 27, 1921
Source: Martha Smith, University of Texas http://www.ma.utexas.edu/users/mks/statmistakes/TOC.html
39. There are four perspectives on probability that are commonly used:
• Classical,
• Empirical (or Frequentist),
• Subjective, and
• Axiomatic.
Using one perspective when another is intended can lead to serious errors.
Common misunderstanding: If there are only two possible outcomes, and you don't
know which is true, the probability of each of these outcomes is 1/2.
In fact, probabilities in such "binary outcome" situations could be anything from 0
to 1. For example, if the outcomes of interest are "has cancer" and "does not have
cancer," the probabilities of having cancer are (in most cases) much less than 1/2.
The empirical (frequentist) perspective allows us to estimate such probabilities.
Misunderstanding probability
Source: Martha Smith, University of Texas http://www.ma.utexas.edu/users/mks/statmistakes/TOC.html
40. Confusing correlation with causation.
Example:
Students' shoe sizes and scores on a standard reading exam. They are correlated, but saying that larger
shoe size causes higher reading scores is as absurd as saying that high reading scores cause larger shoe
size. There is a clear lurking variable, namely, age. As the child gets older, both their shoe size and reading
ability increase. Do not Interpret causality deterministically when the evidence is statistical.
Mistakes in thinking about causation
Source: Martha Smith, University of Texas http://www.ma.utexas.edu/users/mks/statmistakes/TOC.html
41. In most research, one or more outcome variables are measured. Statistical analysis
is done on the outcome measures, and conclusions are drawn from the statistical
analysis. The analysis itself involves a choice of measure, called a summary statistic.
Misleading results occur when inadequate attention to the choice of either
outcome variables or summary statistics.
• Example: What is a good outcome variable for deciding whether cancer
treatment in a country has been improving?
A first thought might be "number of deaths in the country from cancer in
one year." But number of deaths might increase simply because the
population is increasing. Or it might go down if cancer incidence is
decreasing. "Percent of the population that dies of cancer in one year"
would take care of the first problem, but not the second.
In this case rate is a better measure than a count.
Problematical choice of measure
42. Believing a "random sample will be representative of the population".
In fact, this statement is false -- a random sample might, by chance, turn out to be
anything but representative. For example, it is possible that if you toss a coin ten
times, all the tosses will come up heads.
A slightly better explanation that is partly true : "Random sampling eliminates bias
by giving all individuals an equal chance to be chosen.“
There is very important, the reason why random sampling is important…
Mathematical theorems which justify most statistical procedures apply only to
random samples.
Errors in sampling
43. • Extrapolation to a larger population than the one studied
Example: running a marketing experiment with
undergraduates enrolled in marketing classes and
drawing a conclusion about people in general.
• Extrapolation beyond the range of data
Similar to extrapolating to a larger population, but
concerns the values of the variables rather than the
individuals.
• Ignoring Ecological Validity
Involves the setting (i.e., the "ecology") rather than the
individuals studied, or it may involve extrapolation to a
population having characteristics very different from the
population that is relevant for application.
Over-interpretation
44. • Using overly strong language in stating results
Statistical procedures do not prove results. They only give us information on
whether or not the data support or are consistent with a particular conclusion.
There is always uncertainty involved. Acknowledge this uncertainty.
• Considering statistical significance but no practical significance
Example: Suppose that a well-designed, well-carried out, and carefully analyzed
study shows that there is a statistically significant difference in life span between
people engaging in a certain exercise regime at least five hours a week for at least
two years and those not following the exercise regime.
If the difference in average life span between the two groups is three days...
Over-interpretation (cont’d)
Source: Martha Smith, University of Texas http://www.ma.utexas.edu/users/mks/statmistakes/TOC.html
45. Type I Error
• Rejecting the null hypothesis when it is in fact true is known as a Type I error.
Many people decide, before doing a hypothesis test, on a maximum p-value for
which they will reject the null hypothesis. This value is often denoted α (alpha)
and is the significance level.
Type II Error
• Not rejecting the null hypothesis when in fact the alternate hypothesis is true is
called a Type II error.
An analogy helpful in understanding the two types of error is to consider a
defendant in a trial. The null hypothesis is "defendant is not guilty;" the alternate is
"defendant is guilty."
• Type I error would correspond to convicting an innocent person;
• Type II error would correspond to setting a guilty person free.
Test/Hypothesis confidence intervals
46. Two drugs are known to be equally effective for a certain condition.
• Drug 1 has been used for decades with no reports of the side effects
• Drug 2 may cause serious side-effects
The null hypothesis is "the incidence of the side effect in both drugs is the same",
The alternate hypothesis is "the incidence of the side effect in Drug 2 is greater
than that in Drug 1."
Falsely rejecting the null hypothesis when it is in fact true (Type I error) would have
no great consequences for the consumer, but a Type II error (i.e., failing to reject
the null hypothesis when in fact the alternate is true, would result in deciding that
Drug 2 is no more harmful than Drug 1 when it is in fact more harmful) could have
serious consequences.
Setting a large significance level is in this case is appropriate.
Test/Hypothesis confidence intervals
47. Each inference technique (hypothesis test or confidence interval) you select has
model assumptions. Different techniques have very different model assumptions.
The validity of the technique depends on whether the model assumptions fit the
context of the data being analyzed.
• Common Mistakes Involving Model Assumptions
• Using a two-sample test comparing means when cases are paired
• Comparisons of treatments applied to people, animals, etc.
(Intent to Treat; Comparisons involving Drop-outs)
• Fixed vs Random Factors
• Analyzing Data without Regard to How the Data was Collected
• Dividing a Continuous Variable into Categories ("Chopped Data“, Cohorts)
• Pseudo-replication
• Mistakes in Regression
• Dealing with Missing Data
Inappropriate model or research design
Source: Martha Smith, University of Texas http://www.ma.utexas.edu/users/mks/statmistakes/TOC.html
48. Bad Math,
Bad
Geography
Misrepresent
Data
Serving the
Presentation
Without the
Data
Pie Charts –
Why?
Using Non-
Solid Lines in
a Line Chart
Bar charts
with
erroneous
scale,
Arranging
Data Non-
Intuitively
Obscuring
Your Data,
Making the
Reader Do
More Work
Misrepresent
Data Using
Different
Colors on a
Heat Map
Making it
Hard to
Compare
Data,
Showing Too
Much Detail
Not
Explaining
the
Interactivity
Keep It
Simple
Common
Organizational
Blunders
Analytic
Fundamentals
Common
Measurement
Errors
Statistical
Mistakes
Visualization
Faults
Summary
Common Analytic Mistakes
49. Visualization is a tool to aid analysis, not a substitute for analytical skill.
It is not a substitute for statistics:
Really understanding your data generally requires a combination of
• analytical skills,
• domain expertise, and
• effort.
Strategy:
• Be careful about promising real insight.
Work with a statistician or a domain expert if you need to offer reliable conclusions.
• Small design decisions - the color palette you use, or how you represent a particular variable
can skew the conclusions a visualization suggests.
• If using visualizations for analysis, try a variety of options, rather than relying on a single view
Visualization is not analysis
50. The infographic seems informative enough.
The self-perceptions of “Baby Boomers” is an
interesting data set to visualize. The problem,
however, is that this graphic represents 243% of
responses.
This is not necessarily indicative of faulty data.
It is a poor representation of the data.
Display the phrases individually, with size
determined by percentages to compare the
phrases to each other more easily. Remove the
percentages from a single body also clarifies that
the percentages aren’t mutually exclusive.
Bad Math
51. Credibility? Gone in an instant.
Bad Geography - Just Plain Dumb
Presented without comment… speechless, how does this happen?
52. And don't do something like this.. ever
Makes sure all representations are accurate. For
example, bubbles should be scaled according to
area, not diameter.
Misrepresenting Data
53. The graph is clearly trying to make the point that
Obamacare enrollment numbers for March 27 are well
below the March 31 goal. Take a closer look.
How come the first bar is less than half the size of the
second one? This is a common problem.
Depending on the scale (there isn’t one in this
example), the comparison among data points look very
deceiving.
Misrepresenting Data (Why?)
• If you’re working with data points that are really far or really close together, be sure to
pay attention to the scale, so that data is accurately portrayed.
• If there is no scale, compare different slices, tiles, bubbles, or bars against each other.
Do the numbers match up to these?
54. Which comes first: the presentation or the data? In an effort to make a more
“interesting” or “cool” design, don’t allow the presentation layer of a visualization
to become more important than the data itself.
In this example a considerable amount of work went into it and there are parts that
are informative, like the summary counters at the top left. However, without a scale
or axis, the time series on the bottom right is meaningless and the 3D chart in the
center is even more opaque. Tooltips (pop ups) would help, if they were there.
Serving the Presentation Without the Data
55. They are useful on rare occasions.
But most of the time they actually do not communicate anything of value.
Pie Charts – Why?
Beyond the obvious point
made by the line graph in
the background (we are
storing more data now than
we used to), this graph
seems to tell us “…don’t
know if any of this matters,
so we’re going to print
everything.”
Sometimes it just better to
use a table... really.
56. A common mistake with pie charts is to divide them into
percentages that simply do not add up.
The basic rule of a pie chart is that the sum of all percentages
included should be 100%.
• In this example, the percentages fall short of 100%, and the
segment sizes do not match their values. This happens due to
rounding errors, or when non-mutually exclusive categories
are plotted on the same chart. Unless included categories are
mutually exclusive, their percentage cannot be plotted
separately using the same chart.
Pie Charts - Error in Chart Percentages
Here is another gem…
57. Always ask yourself when considering a potential design:
Why is this better than a bar chart? If you’re visualizing a single quantitative
measure over a single categorical dimension, there is rarely a better option.
Pie Charts – Why?
• Line charts are preferred when using time-based data
• Scatter plots are best for exploring correlations between two linear measures.
• Bubble charts support more data points with a wider range of values
• Tree maps support hierarchical categories
If you have to use a pie chart:
• Don’t include more than five segments
• Place the largest section at 12 o’clock, going clockwise.
• Place the second largest section at 12 o’clock, going counterclockwise.
• The remaining sections can be placed below, continuing counterclockwise.
58. Comparison is a valuable way to
showcase differences, but it's useless
if your viewer can’t easily compare.
Make sure all data is presented in a
way that allows the reader to
compare data side-by-side.
Making it Hard to Compare Data
Is this clear to you?
59. • Using Non-Solid Lines in a Line Chart
Dashed and dotted lines can be distracting. Instead, use a solid line and colors that are easy to
distinguish from each other.
• Making the Reader Do More Work
Make it as easy as possible to understand data by aiding the reader with graphic elements. For
example, add a trend line to a scatter plot to highlight trends.
• Obscuring Your Data
Make sure no data is lost or obstructed by design. For example, use transparency in a standard
area chart to make sure the viewer can see all data
• Using Different Colors on a Heat Map
Some colors stand out more than others, giving unnecessary weight to that data. Instead, use a
single color with varying shades or a spectrum between two analogous colors to show intensity.
Other Common Mistakes
60. • Arranging the Data Non-Intuitively
Content should be presented in a logical and intuitive way to guide
readers through the data. Order categories alphabetically, sequentially,
or by value.
• Not Explaining Interactivity
Enabling users to use and interact with a visualization makes it more
engaging. If you don’t tell them how to use that interactivity you risk
limiting them to the initial view. How you label the interactivity is just
as important as doing it in the first place.
• Inform the user at the top of the visualization is good practice.
• Calling out the interaction on or near the tools that use it.
• Use a common design concept such as underlining the words to associate a
hyperlink, would have been helpful.
See http://www.visualizing.org/full-screen/39118
Other Common Mistakes (cont’d)
61. Hard to resist… the temptation
with a dataset with numerous
usable categorical and numerical
fields is to show everything at
once, and allow users to drill
down to the finest level of detail.
The visualization is superfluous;
the user could simply look at the
dataset itself if they wanted to
see the finest level of detail.
Show enough detail to tell a story,
but not so much that that story is
convoluted and hidden.
Showing Too Much Detail
62. We all want specific, relevant answers.
The closer you can get to providing
exactly what is wanted, the less effort
we expend looking for answers.
Irrelevant data makes finding the
relevant information more difficult;
irrelevant data is just noise.
• Showing several closely related graphs
can be a nice compromise between
showing too much in one graph and
not showing enough overall.
• A few clean, clear graphs are better
than a single complicated view.
• Try to represent your data in the
simplest way possible to avoid this.
Showing Too Much Detail (cont’d)
63. Data visualization is about simplicity. Using embellished or artistic representations may result in
more clarity but usually distracts from the actual data. Look at the example below.
Keep it Simple
• Why is the first image blue and the rest are red?
• The number in the second image is against the
paintbrush and not against the head while in all
other columns it is against the head. Is this
meaningful?
• We might just appreciate different figures and
think about the real-life characters represented by
them and move on without understanding the
data. Really…
• It’s important that visual representation of data is
free of the pitfalls that make data representation
ambiguous and irrelevant.
65. Thank You…
Mr. Parnitzke is a hands-on technology executive, trusted partner, advisor, software publisher, and widely
recognized database management and enterprise architecture thought leader. Over his career he has served
in executive, technical, publisher (commercial software), and practice management roles across a wide range
of industries. Now a highly sought after technology management advisor and hands-on practitioner his
customers include many of the Fortune 500 as well as emerging businesses where he is known for taking
complex challenges and solving for them across all levels of the customer’s organization delivering distinctive
value and lasting relationships.
Contact:
j.parnitzke@comcast.net
Blogs:
Applied Enterprise Architecture (pragmaticarchitect.wordpress.com)
Essential Analytics (essentialanalytics.wordpress.com)
The Corner Office (cornerofficeguy.wordpress.com)
Data management professional (jparnitzke.wordpress.com)
The program office (theprogramoffice.wordpress.com)
Data Science Page (http://www.datasciencecentral.com/profile/JamesParnitzke)
Editor's Notes
Mr. Parnitzke is a hands-on technology executive, trusted partner, advisor, software publisher, and widely recognized database management and enterprise architecture thought leader. Over his career he has served in executive, technical, publisher (commercial software), and practice management roles across a wide range of industries.
Retail Example:
According to recent research by IDC Retail Insights, the Omnichannel shopper is the gold standard consumer. A multichannel shopper will:
spend, on average 15 percent to 30 percent more than someone using just one channel,
outspend simple multichannel shoppers by over 20 percent.
What’s more, multichannel shoppers exhibit strong loyalty and are more likely to influence others to endorse a retailer.
So, the hard questions are really in generating this kind of lift, and traditional simple focused marketing is not enough…
One reason analytics projects lose focus is they begin compromised. When it's time to decide what metrics a company should track, too many follow the conference table consensus approach. They worry more about consensus than about value and accuracy.
When budgets are tight and all are clamoring for better analytics, it's understandable that not everyone reads or fully comprehends the fine print associated with some vendors' "partnerships." In these models, the vendor may reserve ownership rights of the data, data aggregates, and the metadata derived from providing analytics services.
It's not uncommon for companies to use subsidized analytics services to create aggregated research products they sell back to the marketplace (without compensation to publishers whose sites were harvested for the data).
Cleaning and formatting a single data set is hard enough, but what if you’re building a live visualization that will run with many different data sets? Do have time to manually clean each data set? Your first instinct may be to grab some demo data and use that to build your visualization; your library may even come with standard sample data.
How come only 29 percent of the organizations have more than one person! That is bad. Wait. That did not make sense. Back to read the question. Then the graph. Then the legend. Then back to the question. Then the legend…
Measuring customer satisfaction by itself will not provide the best view forward. Using a complete satisfaction measurement system – including future behaviors and predictive metrics such as likelihood to return to the site or likelihood to purchase again – generates leading indicators that complement and illuminate lagging indicators.
Simpson's Paradox disappears when causal relations are understood and accounted for in the analysis.
If you agree that increasing age (for school children) causes increasing foot size, and therefore increasing shoe size, then you expect a correlation between age and shoe size. Correlation is symmetric, so shoe size and age are correlated. But it would be absurd to say that shoe size causes age.
It is true that sampling randomly will eliminate systematic bias. This is the best plausible explanation that is acceptable to someone with little mathematical background. This statement could easily be misinterpreted as the myth above.
Example: An experiment designed to study whether an abstract or concrete approach works better for teaching abstract concepts used computer-delivered instruction. This was done to avoid confounding variables such as the effect of the teacher. However, the study then lacked ecological validity for most real-life classroom use in instruction.
Another example:
Does an increase in tire pressure cause an increase in tread wear?What is the X? Tire Pressure. What is the Y? Tread Wear.
State the Null (Ho) and Alternative (Ha) Hypothesis The Null Hypothesis is "r=0' (there is no correlation)Null Hypothesis (Ho) = There is no relationship between pressure and tread wearAlt Hypothesis (Ha) = There is a relationship between pressure and tread wear
Gather data, Run the analysis and determine the P-Value Run a Corellation (r=.554, p-Value = .0228).
Determine the Alpha RiskThe Confidence Interval was 95%, therefore the Alpha Risk is 5% (or 0.05)
What does the P-Value tell us? (Reject or Accept the Null) Reject the Null (because the P-Value (.0228) is lower than the Alpha Risk (0.05))
It's a central tenet of the field that data visualization can yield meaningful insight. While there’s a great deal of truth to this, it’s important to remember that visualization is a tool to aid analysis, not a substitute for analytical skill. It’s also not a substitute for statistics: your chart may highlight differences or correlations between data points, but to reliably draw conclusions from these insights often requires a more rigorous statistical approach. (The reverse can also be true - as Anscombe’s Quartet demonstrates, visualizations can reveal differences statistics hide.) Really understanding your data generally requires a combination of analytical skills, domain expertise, and effort. Don’t expect your visualizations to do this work for you, and make sure you manage the expectations of your clients and your CEO when creating or commissioning visualizations.
Tools and strategies
Unless you’re a data analyst, be very careful about promising real insight. Consider working with a statistician or a domain expert if you need to offer reliable conclusions
Small design decisions - the color palette you use, or how you represent a particular variable - can skew the conclusions a visualization suggests. If you’re using visualizations for analysis, try a variety of options, rather than relying on a single view
Stephen Few’s Now You See It offers a good practical introduction to using visualization for business analysis, including suggestions for developers on how to design analytically-valid visualization tools
You don’t have to be a mathematics major to see what is wrong with an aggregate response of 243%.