Big Data Research Methods – Contemporary Analysis

Big Data
& Research Methods
PRESENTED BY
Grant Stanley, CEO
Tadd Wood, Chief Data Scientist
Contemporary Analysis
1209 Harney Street, Suite 200
Omaha, NE 68102

Big Data & Research Methods
INTRO
The process of research is as
important as the results.
• Correct research methods improve results,
• And allow others to collaborate and improve
your work.
Contemporary Analysis canworksmart.com

INTRO
We’ll explore the dangers of:
• Spurious Correlation
• Sampling Errors
• Model Selection
• Heteroscedasticity
• Overfitting
• Lack of Background
• Solutions instead of
Theories
• Lack of the Scientific
Method
• Correlation vs.
Causation
Text

INTRO
Big Data can’t just be about
collecting, processing & storing
more data.
It has to be put to use. We need to
conduct research, build models,
and develop reports.

THE DANGER OF FALSE POSITIVES
The car has little impact without
the highway or interstate.
If we take Big Data beyond
engineering, we are building
the equivalent of the highway
or interstate for the computer &
Internet.

SPURIOUS RELATIONSHIPS
Spurious relationships are when
two or more events or variables
have no direct causal connection,
yet it may be wrongly inferred that
they do, due to either coincidence
or the presence of a certain third,
unseen factor.

Big Data Errors: Spurious Correlations
140,000
CORRELATIONS
80,000
SPURIOUS 20,000
VARIABLES 500 1000 1500 2000

Maine’s divorce rate with US margarine consumption
8
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
DIVORCES PER 1000 PEOPLE
Divorce rate in Maine
Divorces per 1000 people (US Census)
5 4.7 4.6 4.4 4.3 4.1 4.2 4.2 4.2 4.1
Consumption of margarine (US)
Per capita in pounds (USDA)
8.2 7 6.5 5.3 5.2 4 4.6 4.5 4.2 3.7
Correlation 0.992558
MARGARINE CONSUMPTION (POUNDS)
5
4.8
4.6
4.4
4.2
4
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
9
7
6
5
4
3
DIVORCE RATE IN MAINE
PER CAPITA CONSUMPTION OF MARGARINE (US)

SAMPLING
There are two reasons for
sampling a population:
• The cost of collecting and processing data
is too high or impossible.
• To ensure that the results are representative
of the population.

SAMPLING
Sampling still matters in Big Data.
Data is not information. It is simply
a representation of information.
You have to think about what the
data you are using represents.

SAMPLING
Is smartphone data representative of the population?
Gender by Platform Age by Platform
iPhone Android
100%
0%
12%
18 - 24
iPhone Android
100%
0%
57%
MALE
73%
MALE
43%
FEMALE
27%
FEMALE
7%
17 OR YOUNGER
13%
17 OR YOUNGER
17%
18 - 24
21%
25 - 34
30%
25 - 34
21%
35 - 44
21%
35 - 44
32%
45+
25%
45+

MODEL SELECTION
OLS is not a catch all.
You have to know your data.
Is it continuous, discrete, binary,
ordinal, or categorical? Is your
data symmetric or asymmetric? Are
there outliers?

MODEL SELECTION

HETEROSCEDASTICITY
Heteroscedasticity refers to
the circumstance in which the
variability of a variable is unequal
across the range of values of a
second variable that predicts it.

HETEROSCEDASTICITY
Predicting equipment pricing based on machine hours
MARKET PRICE
T2
HOURS ON MACHINE
T1
T3
^
= a + bx
Y

Unbiased & Homoscedastic Biased & Homoscedastic Biased & Homoscedastic
Unbiased & Heteroscedastic Biased & Heteroscedastic Biased & Heteroscedastic

OVERFITTING
Overfitting occurs when a
statistical model captures
more than just the underlying
relationships.
The model is fitted to as much
data as possible including random
errors, outliers, and noise.

OVERFITTING
An overfitted model nearly
perfectly matches the training
set, but does not perform well
with new data. While an overfitted
model looks great, it will have poor
predictive performance.

OVERFITTING
The mark of a good model isn’t
how well it performs on the data
used to build the model, but on
fresh data outside of the training
data set.

OVERFITTING
Overfitting Example: Training Classification Table
General Election (Predicted)
General Election (Observed) Did not vote Voted Percentage Correct
Did not vote 132423 3 99.99773%
Voted 0 411099 100%
Overall Correct Percentage 100%

OVERFITTING
Overfitting Example: Prediction Classification Table
Did not vote 35726 4068 90%
Voted 45924 77199 63%

OVERFITTING
Overfitting Example: Variables
95% C.I. for EXP(B)
Variable B (Coefficients) Standard Error Wald Significance Lower Upper
NumberOfPastRaces 63.840 106.208 .361 .548 .000 1.35E+118
Primary_03072000_Voter -66.218 106.264 .388 .533 .000 4.95E+61
General_1107200_Voter -61.971 106.219 .340 .560 .000 3.16E+63
Special_05082001_Voter -58.129 111.165 .273 .601 .000 2.39E+69
General_11062001_Voter -60.658 106.181 .326 .568 .000 1.09E+64
Primary_05072002_Voter -57.806 99.816 .335 .563 .000 7.23E+59
General_11052002_Voter -63.208 106.206 .354 .552 .000 8.94E+62
Special_05062003_Voter -66.393 106.249 .390 .532 .000 4.03E+61
General_11042003_Voter -64.056 106.209 .364 .546 .000 3.85E+62
Primary_03022004_Voter -63.836 106.204 .361 .548 .000 4.76E+62
Special_02052005_Voter -58.510 111.784 .274 .601 .000 5.50E+69
General_11082005_Voter -65.617 106.238 .381 .537 .000 8.56E+61
Special_02072006_Voter -56.952 305.188 .035 .852 .000 1.10E+235
Primary_05022006_Voter -64.696 106.220 .371 .542 .000 2.08E+62
General_11072006_Voter -64.074 106.210 .364 .546 .000 3.79E+62
Primary_05082007_Voter -65.976 106.233 .386 .535 .000 5.93E+61
Primary_09112007_Voter -57.949 15652.399 .000 .997 .000 —
General_11062007_Voter -67.465 106.231 .403 .525 .000 1.33E+61
General_12112007_Voter -75.855 106.274 .509 .475 .000 3.29E+57
Primary_03042008_Voter -62.602 106.214 .347 .556 .000 1.67E+63
General_11042008_Voter -64.100 106.220 .364 .546 .000 3.77E+62
Primary_05052009_Voter -57.094 98.053 .339 .560 .000 4.56E+58
Primary_09152009_Voter -54.792 7118.311 .000 .994 .000 —
General_11032009_Voter -55.176 98.071 .317 .574 .000 3.28E+59
Primary_05042010_Voter -65.564 106.234 .381 .537 .000 8.97E+61
Primary_07132010_Voter -56.331 45432.804 .000 .999 .000 —
Primary_09072010_Voter -57.607 3684.807 .000 .998 .000 —
General_11022010_Voter -63.431 106.214 .357 .550 .000 7.28E+62
Primary_05032011_Voter -57.848 136.939 .178 .673 .000 2.75E+91
General_11082011_Voter -54.865 98.255 .312 .577 .000 6.42E+59
Primary_03062012_Voter -55.419 95.847 .334 .563 .000 3.29E+57
Primary_05072013_Voter -58.652 110.873 .280 .597 .000 8.00E+68
General_11052013_Voter -62.617 106.196 .348 .555 .000 1.58E+63
Constant -115.093 212.413 .294 .588

OVERFITTING
Simple Model Example: Variables
95% C.I. for EXP(B)
Age_life_bin_1 .344 .019 312.341 .000 1.358 1.466
Age_life_bin_2 .282 .017 266.954 .000 1.282 1.372
Age_life_bin_3 .180 .017 109.330 .000 1.158 1.239
Age_life_bin_4 .133 .018 53.146 .000 1.102 1.184
Age_life_bin_5 .055 .019 8.719 .003 1.019 1.096
Age_life_bin_7 -.342 .029 139.262 .000 .671 .752
Age_life_bin_8 -1.949 .029 4636.533 .000 .135 .151
Party_affiliation_D .523 .037 202.630 .000 1.570 1.814
Party_affiliation_R .692 .027 656.239 .000 1.895 2.106
NumberOfPastRaces .480 .002 63659.304 .000 1.611 1.623
Constant -1.332 .017 6041.871 .000

OVERFITTING
Simple Model Example: Training Classification Table
Did not vote 95397 37029 72%
Voted 43439 367660 89%

OVERFITTING
Simple Model Example: Prediction Classification Table
Did not vote 72167 9483 88%
Voted 15131 66136 81%

OVERFITTING
Big Data Errors: Spurious Correlations
140,000
CORRELATIONS
80,000
SPURIOUS 20,000
VARIABLES 500 1000 1500 2000

OVERFITTING
Overstuffing Example: Variables
95% C.I. for EXP(B)
Age_life_bin_1 .331 .020 286.120 .000 1.339 1.446
Age_life_bin_2 .281 .017 263.325 .000 1.281 1.371
Age_life_bin_3 .184 .017 113.157 .000 1.162 1.243
Age_life_bin_4 .134 .018 53.857 .000 1.103 1.185
Age_life_bin_5 .058 .019 9.629 .002 1.022 1.099
Age_life_bin_7 -.348 .029 143.259 .000 .667 .748
Age_life_bin_8 -1.959 .029 4687.305 .000 .133 .149
Party_affiliation_D .513 .037 194.040 .000 1.554 1.796
Party_affiliation_R .684 .027 637.417 .000 1.879 2.089
NumberOfPastRaces .478 .002 62834.614 .000 1.608 1.620
Residential_Zip_3 -.364 .127 8.181 .004 .541 .892
Residential_Zip_7 .360 .063 32.902 .000 1.268 1.622
Residential_Zip_8 .428 .218 3.834 .050 1.000 2.354
Residential_Zip_16 -.125 .023 28.277 .000 .843 .924
Residential_Zip_17 .127 .058 4.797 .029 1.013 1.272
Residential_Zip_18 -.356 .044 64.141 .000 .642 .764
Residential_Zip_19 -.283 .026 117.878 .000 .716 .793
Residential_Zip_21 .115 .037 9.801 .002 1.044 1.206
Residential_Zip_22 .113 .026 19.024 .000 1.064 1.178
Residential_Zip_25 -.182 .024 59.045 .000 .796 .873
Residential_Zip_26 .074 .032 5.248 .022 1.011 1.148
Residential_Zip_27 -.132 .033 16.081 .000 .821 .935
Residential_Zip_28 -.077 .023 11.484 .001 .885 .968
Residential_Zip_29 -.160 .038 17.765 .000 .791 .918
Residential_Zip_30 -.191 .044 18.638 .000 .758 .901
Residential_Zip_33 -.059 .030 3.945 .047 .889 .999
Residential_Zip_35 .104 .026 15.662 .000 1.054 1.168
Residential_Zip_41 .140 .018 57.675 .000 1.109 1.193
Residential_Zip_42 .156 .039 16.010 .000 1.083 1.262
Residential_Zip_45 .138 .024 32.782 .000 1.095 1.204
Residential_Zip_46 -.065 .018 12.838 .000 .904 .971
Residential_Zip_48 .261 .022 136.998 .000 1.243 1.357
Residential_Zip_50 .164 .025 41.633 .000 1.121 1.239
Residential_Zip_51 .157 .031 26.169 .000 1.102 1.243
Residential_Zip_53 .114 .033 11.628 .001 1.050 1.197
Residential_Zip_54 .104 .029 13.215 .000 1.049 1.174
Residential_Zip_56 .116 .032 13.238 .000 1.055 1.196
Residential_Zip_59 .094 .032 8.647 .003 1.032 1.170
Local_School_District_6 -.375 .055 47.296 .000 .618 .765
Local_School_District_7 .078 .016 23.389 .000 1.047 1.115
Constant -1.332 .018 5513.792 .000

OVERFITTING
Overstuffing Example: Training Classification Table
Did not vote 93029 39397 70%
Voted 36228 374871 91%

LACK OF BACKGROUND
The farther we are from the work,
the more likely we are to be tricked
by the data.
We owe it to the end user to
get out of the library, and try to
understand what we are modeling.

SOLUTIONS INSTEAD OF THEORIES
There is an element of data
science that should be frustrating,
confusing, & despair inducing.
It should make us stand back in
awe of the complexity of the world,
and not the simplicity to which we
can reduce it to.

“The great thing about economics,
is that we admit that we know
nothing about anything”
- Thomas Piketty author of “Capital in the Twenty-First Century”

As we learn more, we realize
there’s more to learn.
The hallmark of genius is the sharp
awareness of what is and what is
not possible. We become aware of
complexity, ambiguity and nuance.

CORRELATION & CAUSATION
The anthem of the Big Data
age is “correlation does not
imply causation.”

The problem is that this statement
is tautological. It is always correct,
and can never be wrong.

Don’t let people use it as a kill
switch to discussion.
• True causation is pretty rare. There are few
things where, if I do this, this will happen.
• Research should create discussions not shut
them down. Models can’t explain everything.
There is always an “X” variable that captures
the unknown.

FAILING TO AUDIT
Primary reasons that we fail to
have our work peer-reviewed:
• Lack of funding to “repeat” work.
• We hide behind the complexity of our work.

FAILING TO AUDIT

FAILING TO AUDIT
Other tools:
• rMarkdown: for creating webpages and
documents in R
• iPython notebooks: for creating websites and
documents interactively in Python
• Galaxy Project: for creating reproducible
workflows. (Favorable for people with less
scripting experience.)

TRAINING
We offer
training on:
• Data Visualization
• Managerial Statistics
• Predictive Modeling
You will be
introduced to:
• R
• SPSS
• Tableau
• MySQL
• Git

TRAINING
Trainings sessions last 3 days.
We will work through projects,
practice different approaches,
and which approach is the best for
different scenarios.

QUESTIONS?
Grant Stanley, CEO
Contemporary Analysis
1209 Harney Street, Suite 200
Omaha, NE 68102
grant@canworksmart.com
(402) 679-8398
Questions & Learn more.

Big Data Research Methods – Contemporary Analysis

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Big Data Research Methods – Contemporary Analysis

Similar to Big Data Research Methods – Contemporary Analysis (20)

Recently uploaded

Recently uploaded (20)

Big Data Research Methods – Contemporary Analysis