Picture this. You’ve collected, cleaned, and analyzed your data (no small feat) but as you sit there staring at your computer screen you think, “what does this actually mean?” If you’ve had a moment like this you’re not alone! One of the most difficult things about data is interpreting the output because it depends on selecting the appropriate analytics method for the right type of data.
This Lecture Will:
-TEACH YOU HOW TO IDENTIFY DIFFERENT DATA TYPES
-EXPLAIN THE RIGHT WAY TO SELECT DATA ANALYSIS METHODS
-SHOW CORE DATA INTERPRETATION SKILLS YOU NEED TO SUCCEED
You can watch this lecture here: https://youtu.be/SirK0SSBeZg
Interpreting Data Like a Pro - Dawn of the Data Age Lecture Series
1. Dawn of the Data Age Lecture Series
Interpreting Data Like a Pro
2. Hi. I’m Luciano Pesci…
Co-Founder & CEO, EMPERITAS
● A Services as a Subscription team of economists and data scientists delivering bi-weekly Customer
Lifetime Value intelligence so our clients can beat their competitors for the most profitable customers.
Founder & Director, Utah Community Research Group, Univ. of Utah
● Teach microeconomics, statistics, applied research & data analytics, & American economic history.
● Teach data science for Westminster and developed their 3-class MBA emphasis in data science.
2
3. Today’s Lecture Outline
● Teach you how to identify data types & context.
● Explain the right way to select analysis methods.
● Show you the core data interpretation skills.
3
5. Defining Data Differently
● There are many ways to define data, each
requires a different approach when utilizing it:
○ Origin - How it was created.
○ Totality - If it’s a sample or a census.
○ Scope - Whether it’s been captured over time.
○ Measurement - How it was quantified.
5
6. What’s The Origin Story?
● Understanding the origin of your data is key
to grasping its context:
○ Experiments produce data with strong causal
patterns but it’s costly to collect & analyze.
○ Survey data is easy to get, but it shows intent or
attitude, not necessarily actual outcomes.
○ Observational data is mostly captured by machines
and shows actual outcomes, but it’s very rigid.
6
7. What’s the Totality?
● If you have data on every possible unit in a
population of interest, then it’s Census data.
● In most cases you’ll only have a Sample which
can be used to infer patterns about the larger
(unknowable) population.
7
8. Scoping Time?
● If your data contains different variables, all
measured at the same time, it’s Cross-Sectional.
○ Most data that you encounter will be cross-sectional.
● If your data contains multiple measurements of
the same variable over time, it’s Time Series.
8
9. Data Measurement?
9
● All data fits into 4 basic types
based on how it was measured:
○ Nominal & Ordinal = CATEGORICAL.
○ Interval & Ratio = CONTINUOUS.
● Identifying data types is a
critical skill to develop.
○ Analysis selection &
interpretation depend on it.
11. Categorical or Continuous?
● The biggest difference when selecting analysis is
based on whether the data is categorical
(nominal, ordinal) or continuous (interval, ratio).
○ So much of what you can or can’t do is determined by
the data’s measurement type.
○ Time Series vs Cross-Sectional is another important
distinction that radically changes your approach.
11
12. Looking for Differences
● Tests of difference, like comparing medians
or means, are a good way to find unique
subgroups within the data.
○ This should only be done with continuous data,
though you can use categorical variables for
subgrouping when testing for differences.
12
13. Looking for Similarities
● Measures of association (like correlation)
are a good way to find patterns that move
together in the data.
● While correlation doesn’t equal causation,
theory can help you understand when
correlations are likely to be real or not.*
13
*Source: www.tylervigen.com/spurious-correlations
15. Looking At It Both Ways
● You should look at tables & visualizations.
○ Each tells a unique part of the data’s story.
● 3 very specific things to find (when possible
based on your data type):
○ Shape of the distribution
○ Center of the distribution
○ Spread of the distribution
15
16. Understanding Shape
● The shape of any ordinal, interval or ratio
data is important to its interpretation.
○ Can show multimodality and/or outliers.
● This is much easier to see through a
visualization than from a table of numbers.
16
17. Understanding Center
● The central value of interval or ratio data
tells you what to predictively expect (it's
potentially the most frequent value).
○ You should calculate both the median & mean.
■ If they differ this is a sign you have skew
in the data, possibly from outliers.
17
18. Understanding Spread
● The spread of interval and ratio data tells an
important story about precision of your predictability.
○ Calculate the Interquartile Range (IQR).
■ This is found by subtracting the 1st quartile from the 3rd
quartile, and shows 50% of the data.
○ You can also calculate the variance and standard deviation.
18
19. 5-Number Summary
● Between a visualization and the 5-Number
Summary you get most of the information
you need to interpret what’s going on with
your variable.
○ This will show you the min value, quartiles,
median/mean, and max value.
○ The only thing that’s missing is the number of
observations (n-count).
19
21. The Example Data’s Origin
● The example data comes from a survey of festival goers (aka customers)
and was linked to observational data about their
lifetime ticket sales.
● It’s a cross-sectional sample (n=3,834) since we
don’t have every festival customer’s feedback and
the data was captured at a single moment in time.
21
22. Inspecting Your Data File
● Before you start summarizing and visualizing
your data, open the raw file and look around.
● Make sure you can identify what the rows
are, and what each column measures.
○ When in doubt, ask for a data map or data dictionary.
22
23. Ordinal Data: Years Attended
23
68% of festival customers have been attending for less than 10 years.
1 in 10 have been attending for more than 20 years.
24. Interval Data: Likelihood to Recommend
24
Min 0
1st Quartile 9
Median 10
Mean 9.2
2nd Quartile 10
Max 10
5-Number Summary
~80% of festival customers are likely
to recommend (9’s & 10’s).
25. Making It Ordinal: Net Promoter Groups*
● It’s always possible to transform
data from continuous to categorical,
but not the other way around.
○ Likelihood to recommend can be
transformed into categorical
groups to create a simpler metric:
■ Net Promoter Score (NPS).
25
*Source: https://hbr.org/2003/12/the-one-number-you-need-to-grow
26. From Interval to Ordinal Data: NPS
26
● The Festival’s NPS is: 75%
● We could use these groups for
testing differences, like in their
Customer Lifetime Value.
○ This is often why you want to create
categorical data from continuous data.
27. Ratio Data: # Tickets Purchased Per Visit
27
Min 0
1st Quartile 2
Median 4
Mean 5.8
2nd Quartile 6
Max 400
5-Number Summary
The presence of outliers hides an important pattern
in this data. To see it, we will drop outliers who
purchase more than 13 tickets per visit.*
*You should ALWAYS note when you drop outliers from analysis.
28. Ratio Data: # Tickets Purchased Per Visit
28
Min 0
1st Quartile 2
Median 4
Mean 4.4
2nd Quartile 6
Max 12
5-Number Summary
With outliers removed the mean falls to ~4 tickets, and we
can see multimodality for even-numbered purchases.
People don’t like to go to the festival alone.
29. Ratio Data: Customer Lifetime Value
29
Min 0
1st Quartile 124
Median 336
Mean 1510
2nd Quartile 1125
Max 479878
5-Number SummaryAs with tickets purchased, the presence of outliers is
obscuring any detail in the visualization.
The maximum value of $479,878 is suspiciously high
(though it turns out to be an accurate value, despite
being 55 standard deviations above the mean*).
*Values more than 3 Standard Deviations from the mean are considered outliers.
30. Ratio Data: Customer Lifetime Value
30
Min 0
1st Quartile 112
Median 249
Mean 486
2nd Quartile 642
Max 2624
5-Number Summary
Dropping outliers above $2,624 causes the mean
to fall significantly from its previous level.
This shows EXTREME leverage in the data.
31. Ratio Data: Customer Lifetime Value
31
Like most data, the festival’s Customer Lifetime Value
exhibits the Pareto Principle (aka the 80/20 rule).
This means 80% of all CLV comes from 20% of customers.
32. What We Learned About CLV?
● Most festival customer have been attending for less than 10 years, but
there’s a small group that’s been coming for more than 20.
● Festival customers are unlikely to come alone, they’ll buy 4 tickets,
and virtually all are likely to recommend the festival.
● The average CLV is $486 and 80% of all CLV
comes from just 20% of festival customers.
32
33. Next Step: Analytics & Predictive Modeling
● The next step for this data would be
multivariate analytics.
○ Tests of difference & measures of association.
○ Present discounted value of future ticket sales.
● After that, we could use all of the data
to build a predictive model for CLV.
33
34. JOIN US FOR THE NEXT LECTURE
Turning Analytics into Actionable Insights, Thursday October 19th 2017
emperitas.com/lecture