SlideShare a Scribd company logo
Big Data 
& Research Methods 
PRESENTED BY 
Grant Stanley, CEO 
Tadd Wood, Chief Data Scientist 
Contemporary Analysis 
1209 Harney Street, Suite 200 
Omaha, NE 68102
Big Data & Research Methods 
INTRO 
The process of research is as 
important as the results. 
• Correct research methods improve results, 
• And allow others to collaborate and improve 
your work. 
Contemporary Analysis canworksmart.com
Big Data & Research Methods 
INTRO 
We’ll explore the dangers of: 
• Spurious Correlation 
• Sampling Errors 
• Model Selection 
• Heteroscedasticity 
• Overfitting 
• Lack of Background 
Contemporary Analysis canworksmart.com 
• Solutions instead of 
Theories 
• Lack of the Scientific 
Method 
• Correlation vs. 
Causation 
Text
Big Data & Research Methods 
INTRO 
Big Data can’t just be about 
collecting, processing & storing 
more data. 
It has to be put to use. We need to 
conduct research, build models, 
and develop reports. 
Contemporary Analysis canworksmart.com
Big Data & Research Methods 
THE DANGER OF FALSE POSITIVES 
The car has little impact without 
the highway or interstate. 
If we take Big Data beyond 
engineering, we are building 
the equivalent of the highway 
or interstate for the computer & 
Internet. 
Contemporary Analysis canworksmart.com
Big Data & Research Methods 
SPURIOUS RELATIONSHIPS 
Spurious relationships are when 
two or more events or variables 
have no direct causal connection, 
yet it may be wrongly inferred that 
they do, due to either coincidence 
or the presence of a certain third, 
unseen factor. 
Contemporary Analysis canworksmart.com
Big Data & Research Methods 
SPURIOUS RELATIONSHIPS 
Big Data Errors: Spurious Correlations 
140,000 
CORRELATIONS 
80,000 
SPURIOUS 20,000 
VARIABLES 500 1000 1500 2000 
Contemporary Analysis canworksmart.com
Big Data & Research Methods 
SPURIOUS RELATIONSHIPS 
Maine’s divorce rate with US margarine consumption 
8 
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 
DIVORCES PER 1000 PEOPLE 
Divorce rate in Maine 
Divorces per 1000 people (US Census) 
5 4.7 4.6 4.4 4.3 4.1 4.2 4.2 4.2 4.1 
Consumption of margarine (US) 
Per capita in pounds (USDA) 
8.2 7 6.5 5.3 5.2 4 4.6 4.5 4.2 3.7 
Correlation 0.992558 
Contemporary Analysis canworksmart.com 
MARGARINE CONSUMPTION (POUNDS) 
5 
4.8 
4.6 
4.4 
4.2 
4 
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 
9 
7 
6 
5 
4 
3 
DIVORCE RATE IN MAINE 
PER CAPITA CONSUMPTION OF MARGARINE (US)
Big Data & Research Methods 
SAMPLING 
There are two reasons for 
sampling a population: 
• The cost of collecting and processing data 
is too high or impossible. 
• To ensure that the results are representative 
of the population. 
Contemporary Analysis canworksmart.com
Big Data & Research Methods 
SAMPLING 
Sampling still matters in Big Data. 
Data is not information. It is simply 
a representation of information. 
You have to think about what the 
data you are using represents. 
Contemporary Analysis canworksmart.com
Big Data & Research Methods 
SAMPLING 
Is smartphone data representative of the population? 
Gender by Platform Age by Platform 
iPhone Android 
100% 
0% 
Contemporary Analysis canworksmart.com 
12% 
18 - 24 
iPhone Android 
100% 
0% 
57% 
MALE 
73% 
MALE 
43% 
FEMALE 
27% 
FEMALE 
7% 
17 OR YOUNGER 
13% 
17 OR YOUNGER 
17% 
18 - 24 
21% 
25 - 34 
30% 
25 - 34 
21% 
35 - 44 
21% 
35 - 44 
32% 
45+ 
25% 
45+
Big Data & Research Methods 
MODEL SELECTION 
OLS is not a catch all. 
You have to know your data. 
Is it continuous, discrete, binary, 
ordinal, or categorical? Is your 
data symmetric or asymmetric? Are 
there outliers? 
Contemporary Analysis canworksmart.com
Big Data & Research Methods 
MODEL SELECTION 
Contemporary Analysis canworksmart.com
Big Data & Research Methods 
HETEROSCEDASTICITY 
Heteroscedasticity refers to 
the circumstance in which the 
variability of a variable is unequal 
across the range of values of a 
second variable that predicts it. 
Contemporary Analysis canworksmart.com
Big Data & Research Methods 
HETEROSCEDASTICITY 
Predicting equipment pricing based on machine hours 
MARKET PRICE 
T2 
HOURS ON MACHINE 
T1 
Contemporary Analysis canworksmart.com 
T3 
^ 
= a + bx 
Y
Big Data & Research Methods 
Unbiased & Homoscedastic Biased & Homoscedastic Biased & Homoscedastic 
Unbiased & Heteroscedastic Biased & Heteroscedastic Biased & Heteroscedastic 
Contemporary Analysis canworksmart.com
Big Data & Research Methods 
OVERFITTING 
Overfitting occurs when a 
statistical model captures 
more than just the underlying 
relationships. 
The model is fitted to as much 
data as possible including random 
errors, outliers, and noise. 
Contemporary Analysis canworksmart.com
Big Data & Research Methods 
OVERFITTING 
An overfitted model nearly 
perfectly matches the training 
set, but does not perform well 
with new data. While an overfitted 
model looks great, it will have poor 
predictive performance. 
Contemporary Analysis canworksmart.com
Big Data & Research Methods 
OVERFITTING 
The mark of a good model isn’t 
how well it performs on the data 
used to build the model, but on 
fresh data outside of the training 
data set. 
Contemporary Analysis canworksmart.com
Big Data & Research Methods 
OVERFITTING 
Overfitting Example: Training Classification Table 
Contemporary Analysis canworksmart.com 
General Election (Predicted) 
General Election (Observed) Did not vote Voted Percentage Correct 
Did not vote 132423 3 99.99773% 
Voted 0 411099 100% 
Overall Correct Percentage 100%
Big Data & Research Methods 
OVERFITTING 
Overfitting Example: Prediction Classification Table 
Contemporary Analysis canworksmart.com 
General Election (Predicted) 
General Election (Observed) Did not vote Voted Percentage Correct 
Did not vote 35726 4068 90% 
Voted 45924 77199 63% 
Overall Correct Percentage 69%
Big Data & Research Methods 
OVERFITTING 
Overfitting Example: Variables 
Contemporary Analysis canworksmart.com 
95% C.I. for EXP(B) 
Variable B (Coefficients) Standard Error Wald Significance Lower Upper 
NumberOfPastRaces 63.840 106.208 .361 .548 .000 1.35E+118 
Primary_03072000_Voter -66.218 106.264 .388 .533 .000 4.95E+61 
General_1107200_Voter -61.971 106.219 .340 .560 .000 3.16E+63 
Special_05082001_Voter -58.129 111.165 .273 .601 .000 2.39E+69 
General_11062001_Voter -60.658 106.181 .326 .568 .000 1.09E+64 
Primary_05072002_Voter -57.806 99.816 .335 .563 .000 7.23E+59 
General_11052002_Voter -63.208 106.206 .354 .552 .000 8.94E+62 
Special_05062003_Voter -66.393 106.249 .390 .532 .000 4.03E+61 
General_11042003_Voter -64.056 106.209 .364 .546 .000 3.85E+62 
Primary_03022004_Voter -63.836 106.204 .361 .548 .000 4.76E+62 
Special_02052005_Voter -58.510 111.784 .274 .601 .000 5.50E+69 
General_11082005_Voter -65.617 106.238 .381 .537 .000 8.56E+61 
Special_02072006_Voter -56.952 305.188 .035 .852 .000 1.10E+235 
Primary_05022006_Voter -64.696 106.220 .371 .542 .000 2.08E+62 
General_11072006_Voter -64.074 106.210 .364 .546 .000 3.79E+62 
Primary_05082007_Voter -65.976 106.233 .386 .535 .000 5.93E+61 
Primary_09112007_Voter -57.949 15652.399 .000 .997 .000 — 
General_11062007_Voter -67.465 106.231 .403 .525 .000 1.33E+61 
General_12112007_Voter -75.855 106.274 .509 .475 .000 3.29E+57 
Primary_03042008_Voter -62.602 106.214 .347 .556 .000 1.67E+63 
General_11042008_Voter -64.100 106.220 .364 .546 .000 3.77E+62 
Primary_05052009_Voter -57.094 98.053 .339 .560 .000 4.56E+58 
Primary_09152009_Voter -54.792 7118.311 .000 .994 .000 — 
General_11032009_Voter -55.176 98.071 .317 .574 .000 3.28E+59 
Primary_05042010_Voter -65.564 106.234 .381 .537 .000 8.97E+61 
Primary_07132010_Voter -56.331 45432.804 .000 .999 .000 — 
Primary_09072010_Voter -57.607 3684.807 .000 .998 .000 — 
General_11022010_Voter -63.431 106.214 .357 .550 .000 7.28E+62 
Primary_05032011_Voter -57.848 136.939 .178 .673 .000 2.75E+91 
General_11082011_Voter -54.865 98.255 .312 .577 .000 6.42E+59 
Primary_03062012_Voter -55.419 95.847 .334 .563 .000 3.29E+57 
Primary_05072013_Voter -58.652 110.873 .280 .597 .000 8.00E+68 
General_11052013_Voter -62.617 106.196 .348 .555 .000 1.58E+63 
Constant -115.093 212.413 .294 .588
Big Data & Research Methods 
OVERFITTING 
Simple Model Example: Variables 
Contemporary Analysis canworksmart.com 
95% C.I. for EXP(B) 
Variable B (Coefficients) Standard Error Wald Significance Lower Upper 
Age_life_bin_1 .344 .019 312.341 .000 1.358 1.466 
Age_life_bin_2 .282 .017 266.954 .000 1.282 1.372 
Age_life_bin_3 .180 .017 109.330 .000 1.158 1.239 
Age_life_bin_4 .133 .018 53.146 .000 1.102 1.184 
Age_life_bin_5 .055 .019 8.719 .003 1.019 1.096 
Age_life_bin_7 -.342 .029 139.262 .000 .671 .752 
Age_life_bin_8 -1.949 .029 4636.533 .000 .135 .151 
Party_affiliation_D .523 .037 202.630 .000 1.570 1.814 
Party_affiliation_R .692 .027 656.239 .000 1.895 2.106 
NumberOfPastRaces .480 .002 63659.304 .000 1.611 1.623 
Constant -1.332 .017 6041.871 .000
Big Data & Research Methods 
OVERFITTING 
Simple Model Example: Training Classification Table 
Contemporary Analysis canworksmart.com 
General Election (Predicted) 
General Election (Observed) Did not vote Voted Percentage Correct 
Did not vote 95397 37029 72% 
Voted 43439 367660 89% 
Overall Correct Percentage 85%
Big Data & Research Methods 
OVERFITTING 
Simple Model Example: Prediction Classification Table 
Contemporary Analysis canworksmart.com 
General Election (Predicted) 
General Election (Observed) Did not vote Voted Percentage Correct 
Did not vote 72167 9483 88% 
Voted 15131 66136 81% 
Overall Correct Percentage 85%
Big Data & Research Methods 
OVERFITTING 
Big Data Errors: Spurious Correlations 
140,000 
CORRELATIONS 
80,000 
SPURIOUS 20,000 
VARIABLES 500 1000 1500 2000 
Contemporary Analysis canworksmart.com
Big Data & Research Methods 
OVERFITTING 
Overstuffing Example: Variables 
Contemporary Analysis canworksmart.com 
95% C.I. for EXP(B) 
Variable B (Coefficients) Standard Error Wald Significance Lower Upper 
Age_life_bin_1 .331 .020 286.120 .000 1.339 1.446 
Age_life_bin_2 .281 .017 263.325 .000 1.281 1.371 
Age_life_bin_3 .184 .017 113.157 .000 1.162 1.243 
Age_life_bin_4 .134 .018 53.857 .000 1.103 1.185 
Age_life_bin_5 .058 .019 9.629 .002 1.022 1.099 
Age_life_bin_7 -.348 .029 143.259 .000 .667 .748 
Age_life_bin_8 -1.959 .029 4687.305 .000 .133 .149 
Party_affiliation_D .513 .037 194.040 .000 1.554 1.796 
Party_affiliation_R .684 .027 637.417 .000 1.879 2.089 
NumberOfPastRaces .478 .002 62834.614 .000 1.608 1.620 
Residential_Zip_3 -.364 .127 8.181 .004 .541 .892 
Residential_Zip_7 .360 .063 32.902 .000 1.268 1.622 
Residential_Zip_8 .428 .218 3.834 .050 1.000 2.354 
Residential_Zip_16 -.125 .023 28.277 .000 .843 .924 
Residential_Zip_17 .127 .058 4.797 .029 1.013 1.272 
Residential_Zip_18 -.356 .044 64.141 .000 .642 .764 
Residential_Zip_19 -.283 .026 117.878 .000 .716 .793 
Residential_Zip_21 .115 .037 9.801 .002 1.044 1.206 
Residential_Zip_22 .113 .026 19.024 .000 1.064 1.178 
Residential_Zip_25 -.182 .024 59.045 .000 .796 .873 
Residential_Zip_26 .074 .032 5.248 .022 1.011 1.148 
Residential_Zip_27 -.132 .033 16.081 .000 .821 .935 
Residential_Zip_28 -.077 .023 11.484 .001 .885 .968 
Residential_Zip_29 -.160 .038 17.765 .000 .791 .918 
Residential_Zip_30 -.191 .044 18.638 .000 .758 .901 
Residential_Zip_33 -.059 .030 3.945 .047 .889 .999 
Residential_Zip_35 .104 .026 15.662 .000 1.054 1.168 
Residential_Zip_41 .140 .018 57.675 .000 1.109 1.193 
Residential_Zip_42 .156 .039 16.010 .000 1.083 1.262 
Residential_Zip_45 .138 .024 32.782 .000 1.095 1.204 
Residential_Zip_46 -.065 .018 12.838 .000 .904 .971 
Residential_Zip_48 .261 .022 136.998 .000 1.243 1.357 
Residential_Zip_50 .164 .025 41.633 .000 1.121 1.239 
Residential_Zip_51 .157 .031 26.169 .000 1.102 1.243 
Residential_Zip_53 .114 .033 11.628 .001 1.050 1.197 
Residential_Zip_54 .104 .029 13.215 .000 1.049 1.174 
Residential_Zip_56 .116 .032 13.238 .000 1.055 1.196 
Residential_Zip_59 .094 .032 8.647 .003 1.032 1.170 
Local_School_District_6 -.375 .055 47.296 .000 .618 .765 
Local_School_District_7 .078 .016 23.389 .000 1.047 1.115 
Local_School_District_9 -.501 .057 77.534 .000 .542 .677 
Local_School_District_10 -.255 .033 61.473 .000 .727 .826 
Constant -1.332 .018 5513.792 .000
Big Data & Research Methods 
OVERFITTING 
Overstuffing Example: Training Classification Table 
Contemporary Analysis canworksmart.com 
General Election (Predicted) 
General Election (Observed) Did not vote Voted Percentage Correct 
Did not vote 93029 39397 70% 
Voted 36228 374871 91% 
Overall Correct Percentage 86%
Big Data & Research Methods 
LACK OF BACKGROUND 
The farther we are from the work, 
the more likely we are to be tricked 
by the data. 
We owe it to the end user to 
get out of the library, and try to 
understand what we are modeling. 
Contemporary Analysis canworksmart.com
Big Data & Research Methods 
SOLUTIONS INSTEAD OF THEORIES 
There is an element of data 
science that should be frustrating, 
confusing, & despair inducing. 
It should make us stand back in 
awe of the complexity of the world, 
and not the simplicity to which we 
can reduce it to. 
Contemporary Analysis canworksmart.com
Big Data & Research Methods 
SOLUTIONS INSTEAD OF THEORIES 
“The great thing about economics, 
is that we admit that we know 
nothing about anything” 
- Thomas Piketty author of “Capital in the Twenty-First Century” 
Contemporary Analysis canworksmart.com
Big Data & Research Methods 
SOLUTIONS INSTEAD OF THEORIES 
As we learn more, we realize 
there’s more to learn. 
The hallmark of genius is the sharp 
awareness of what is and what is 
not possible. We become aware of 
complexity, ambiguity and nuance. 
Contemporary Analysis canworksmart.com
Big Data & Research Methods 
CORRELATION & CAUSATION 
The anthem of the Big Data 
age is “correlation does not 
imply causation.” 
Contemporary Analysis canworksmart.com
Big Data & Research Methods 
CORRELATION & CAUSATION 
The problem is that this statement 
is tautological. It is always correct, 
and can never be wrong. 
Contemporary Analysis canworksmart.com
Big Data & Research Methods 
CORRELATION & CAUSATION 
Don’t let people use it as a kill 
switch to discussion. 
• True causation is pretty rare. There are few 
things where, if I do this, this will happen. 
• Research should create discussions not shut 
them down. Models can’t explain everything. 
There is always an “X” variable that captures 
the unknown. 
Contemporary Analysis canworksmart.com
Big Data & Research Methods 
SOLUTIONS INSTEAD OF THEORIES 
Contemporary Analysis canworksmart.com
Big Data & Research Methods 
FAILING TO AUDIT 
Primary reasons that we fail to 
have our work peer-reviewed: 
• Lack of funding to “repeat” work. 
• We hide behind the complexity of our work. 
Contemporary Analysis canworksmart.com
Big Data & Research Methods 
FAILING TO AUDIT 
Contemporary Analysis canworksmart.com
Big Data & Research Methods 
FAILING TO AUDIT 
Other tools: 
• rMarkdown: for creating webpages and 
documents in R 
• iPython notebooks: for creating websites and 
documents interactively in Python 
• Galaxy Project: for creating reproducible 
workflows. (Favorable for people with less 
scripting experience.) 
Contemporary Analysis canworksmart.com
Big Data & Research Methods 
TRAINING 
We offer 
training on: 
• Data Visualization 
• Managerial Statistics 
• Predictive Modeling 
Contemporary Analysis canworksmart.com 
You will be 
introduced to: 
• R 
• SPSS 
• Tableau 
• MySQL 
• Git
Big Data & Research Methods 
TRAINING 
Trainings sessions last 3 days. 
We will work through projects, 
practice different approaches, 
and which approach is the best for 
different scenarios. 
Contemporary Analysis canworksmart.com
Big Data & Research Methods 
QUESTIONS? 
Grant Stanley, CEO 
Contemporary Analysis 
1209 Harney Street, Suite 200 
Omaha, NE 68102 
grant@canworksmart.com 
(402) 679-8398 
Contemporary Analysis canworksmart.com 
Questions & Learn more.

More Related Content

Viewers also liked

Online Analytical Processing
Online Analytical ProcessingOnline Analytical Processing
Online Analytical Processing
nayakslideshare
 
Chapter1 IFM
Chapter1 IFMChapter1 IFM
Chapter1 IFM
Piyush Gaur
 
Session1 methods research_question
Session1 methods research_questionSession1 methods research_question
Session1 methods research_question
milolostinspace
 
Descriptive Statistics and Data Visualization
Descriptive Statistics and Data VisualizationDescriptive Statistics and Data Visualization
Descriptive Statistics and Data Visualization
Douglas Joubert
 
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
error007
 
Session 2 Methods qualitative_quantitative
Session 2 Methods qualitative_quantitativeSession 2 Methods qualitative_quantitative
Session 2 Methods qualitative_quantitative
milolostinspace
 
Data Analysis Basics - Workshop (Frameworks)
Data Analysis Basics - Workshop (Frameworks)Data Analysis Basics - Workshop (Frameworks)
Data Analysis Basics - Workshop (Frameworks)
Angela Obias
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
error007
 
Chapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
error007
 
Emotional intellegence
Emotional intellegenceEmotional intellegence
Emotional intellegenceAmber Osborn
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Salah Amean
 
Machine Learning, Stock Market and Chaos
Machine Learning, Stock Market and Chaos Machine Learning, Stock Market and Chaos
Machine Learning, Stock Market and Chaos
I Know First: Daily Market Forecast
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Salah Amean
 
Database Modeling Using Entity.. Weak And Strong Entity Types
Database Modeling Using Entity.. Weak And Strong Entity TypesDatabase Modeling Using Entity.. Weak And Strong Entity Types
Database Modeling Using Entity.. Weak And Strong Entity Typesaakanksha s
 
Chapter - 8.1 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.1 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 8.1 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.1 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
error007
 
Expert systems in artificial intelegence
Expert systems in artificial intelegenceExpert systems in artificial intelegence
Expert systems in artificial intelegenceAnna Aquarian
 
Data Modeling Presentations I
Data Modeling Presentations IData Modeling Presentations I
Data Modeling Presentations Icd_crisci
 
Multidimensional Database Design & Architecture
Multidimensional Database Design & ArchitectureMultidimensional Database Design & Architecture
Multidimensional Database Design & Architecture
hasanshan
 
Mathematical modelling
Mathematical modellingMathematical modelling
Mathematical modellingSadia Zareen
 
Mathematical modelling
Mathematical modellingMathematical modelling
Mathematical modelling
Bhavin Tandel
 

Viewers also liked (20)

Online Analytical Processing
Online Analytical ProcessingOnline Analytical Processing
Online Analytical Processing
 
Chapter1 IFM
Chapter1 IFMChapter1 IFM
Chapter1 IFM
 
Session1 methods research_question
Session1 methods research_questionSession1 methods research_question
Session1 methods research_question
 
Descriptive Statistics and Data Visualization
Descriptive Statistics and Data VisualizationDescriptive Statistics and Data Visualization
Descriptive Statistics and Data Visualization
 
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Session 2 Methods qualitative_quantitative
Session 2 Methods qualitative_quantitativeSession 2 Methods qualitative_quantitative
Session 2 Methods qualitative_quantitative
 
Data Analysis Basics - Workshop (Frameworks)
Data Analysis Basics - Workshop (Frameworks)Data Analysis Basics - Workshop (Frameworks)
Data Analysis Basics - Workshop (Frameworks)
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Chapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Emotional intellegence
Emotional intellegenceEmotional intellegence
Emotional intellegence
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
 
Machine Learning, Stock Market and Chaos
Machine Learning, Stock Market and Chaos Machine Learning, Stock Market and Chaos
Machine Learning, Stock Market and Chaos
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
Database Modeling Using Entity.. Weak And Strong Entity Types
Database Modeling Using Entity.. Weak And Strong Entity TypesDatabase Modeling Using Entity.. Weak And Strong Entity Types
Database Modeling Using Entity.. Weak And Strong Entity Types
 
Chapter - 8.1 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.1 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 8.1 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.1 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Expert systems in artificial intelegence
Expert systems in artificial intelegenceExpert systems in artificial intelegence
Expert systems in artificial intelegence
 
Data Modeling Presentations I
Data Modeling Presentations IData Modeling Presentations I
Data Modeling Presentations I
 
Multidimensional Database Design & Architecture
Multidimensional Database Design & ArchitectureMultidimensional Database Design & Architecture
Multidimensional Database Design & Architecture
 
Mathematical modelling
Mathematical modellingMathematical modelling
Mathematical modelling
 
Mathematical modelling
Mathematical modellingMathematical modelling
Mathematical modelling
 

Similar to Big Data Research Methods – Contemporary Analysis

Measuring customer effort with Top Tasks - Gerry McGovern
Measuring customer effort with Top Tasks - Gerry McGovernMeasuring customer effort with Top Tasks - Gerry McGovern
Measuring customer effort with Top Tasks - Gerry McGovern
uxbri
 
howtoturnbigdataintobetterdecisionspauwelsemac2016
howtoturnbigdataintobetterdecisionspauwelsemac2016howtoturnbigdataintobetterdecisionspauwelsemac2016
howtoturnbigdataintobetterdecisionspauwelsemac2016Koen Pauwels
 
1530 track1 rosenbaum
1530 track1 rosenbaum1530 track1 rosenbaum
1530 track1 rosenbaum
Rising Media, Inc.
 
Entering the Data Analytics industry
Entering the Data Analytics industryEntering the Data Analytics industry
Entering the Data Analytics industry
Gramener
 
Global Survey Across 32 Countries Shows Worker Appetite for Social Tools is I...
Global Survey Across 32 Countries Shows Worker Appetite for Social Tools is I...Global Survey Across 32 Countries Shows Worker Appetite for Social Tools is I...
Global Survey Across 32 Countries Shows Worker Appetite for Social Tools is I...
Microsoft
 
Data drift and machine learning
Data drift and machine learningData drift and machine learning
Data drift and machine learning
Smita Agrawal
 
Data drift and machine learning
Data drift and machine learningData drift and machine learning
Data drift and machine learning
Smita Agrawal
 
Creating Better Customer Experiences Online (with Top Tasks) presented by Ger...
Creating Better Customer Experiences Online (with Top Tasks) presented by Ger...Creating Better Customer Experiences Online (with Top Tasks) presented by Ger...
Creating Better Customer Experiences Online (with Top Tasks) presented by Ger...
Patrick Van Renterghem
 
Generating and Qualifying Inbound SMB Leads
Generating and Qualifying Inbound SMB LeadsGenerating and Qualifying Inbound SMB Leads
Generating and Qualifying Inbound SMB Leads
Bredin, Inc.
 
Melda Elmas-Project1-ppt.pptx
Melda Elmas-Project1-ppt.pptxMelda Elmas-Project1-ppt.pptx
Melda Elmas-Project1-ppt.pptx
Imelda903061
 
Fintech Facebook Sentiment Analysis
Fintech Facebook Sentiment AnalysisFintech Facebook Sentiment Analysis
Fintech Facebook Sentiment Analysis
For The Women Foundation
 
Bad Data is Polluting Big Data
Bad Data is Polluting Big DataBad Data is Polluting Big Data
Bad Data is Polluting Big Data
Streamsets Inc.
 
Creating a Big data Strategy with Tactics for Quick Implementation
Creating a Big data Strategy with Tactics for Quick ImplementationCreating a Big data Strategy with Tactics for Quick Implementation
Creating a Big data Strategy with Tactics for Quick Implementation
Lewandog, Inc,
 
Echelon Asia Summit 2017 Startup Academy Workshop
Echelon Asia Summit 2017 Startup Academy WorkshopEchelon Asia Summit 2017 Startup Academy Workshop
Echelon Asia Summit 2017 Startup Academy Workshop
Garrett Teoh Hor Keong
 
Data Integrity Trends
Data Integrity TrendsData Integrity Trends
Data Integrity Trends
Precisely
 
Selling SaaS to SMBs
Selling SaaS to SMBsSelling SaaS to SMBs
Selling SaaS to SMBs
Bredin, Inc.
 
"Ready or Not, Here Comes 2015: Marketing Trends to Master" TrendLab Webinar
"Ready or Not, Here Comes 2015: Marketing Trends to Master" TrendLab Webinar"Ready or Not, Here Comes 2015: Marketing Trends to Master" TrendLab Webinar
"Ready or Not, Here Comes 2015: Marketing Trends to Master" TrendLab Webinar
Bluespire Marketing
 
Business and Data Analytics Collaborative April Meetup
Business and Data Analytics Collaborative April MeetupBusiness and Data Analytics Collaborative April Meetup
Business and Data Analytics Collaborative April Meetup
Ken Tucker
 
Economics & Statistics Insights in Data Science by DataPerts Technologies
Economics & Statistics Insights in Data Science by DataPerts TechnologiesEconomics & Statistics Insights in Data Science by DataPerts Technologies
Economics & Statistics Insights in Data Science by DataPerts Technologies
Ravindra Panwar
 
Looking for patterns in the data
Looking for patterns in the dataLooking for patterns in the data
Looking for patterns in the data
Ray Poynter
 

Similar to Big Data Research Methods – Contemporary Analysis (20)

Measuring customer effort with Top Tasks - Gerry McGovern
Measuring customer effort with Top Tasks - Gerry McGovernMeasuring customer effort with Top Tasks - Gerry McGovern
Measuring customer effort with Top Tasks - Gerry McGovern
 
howtoturnbigdataintobetterdecisionspauwelsemac2016
howtoturnbigdataintobetterdecisionspauwelsemac2016howtoturnbigdataintobetterdecisionspauwelsemac2016
howtoturnbigdataintobetterdecisionspauwelsemac2016
 
1530 track1 rosenbaum
1530 track1 rosenbaum1530 track1 rosenbaum
1530 track1 rosenbaum
 
Entering the Data Analytics industry
Entering the Data Analytics industryEntering the Data Analytics industry
Entering the Data Analytics industry
 
Global Survey Across 32 Countries Shows Worker Appetite for Social Tools is I...
Global Survey Across 32 Countries Shows Worker Appetite for Social Tools is I...Global Survey Across 32 Countries Shows Worker Appetite for Social Tools is I...
Global Survey Across 32 Countries Shows Worker Appetite for Social Tools is I...
 
Data drift and machine learning
Data drift and machine learningData drift and machine learning
Data drift and machine learning
 
Data drift and machine learning
Data drift and machine learningData drift and machine learning
Data drift and machine learning
 
Creating Better Customer Experiences Online (with Top Tasks) presented by Ger...
Creating Better Customer Experiences Online (with Top Tasks) presented by Ger...Creating Better Customer Experiences Online (with Top Tasks) presented by Ger...
Creating Better Customer Experiences Online (with Top Tasks) presented by Ger...
 
Generating and Qualifying Inbound SMB Leads
Generating and Qualifying Inbound SMB LeadsGenerating and Qualifying Inbound SMB Leads
Generating and Qualifying Inbound SMB Leads
 
Melda Elmas-Project1-ppt.pptx
Melda Elmas-Project1-ppt.pptxMelda Elmas-Project1-ppt.pptx
Melda Elmas-Project1-ppt.pptx
 
Fintech Facebook Sentiment Analysis
Fintech Facebook Sentiment AnalysisFintech Facebook Sentiment Analysis
Fintech Facebook Sentiment Analysis
 
Bad Data is Polluting Big Data
Bad Data is Polluting Big DataBad Data is Polluting Big Data
Bad Data is Polluting Big Data
 
Creating a Big data Strategy with Tactics for Quick Implementation
Creating a Big data Strategy with Tactics for Quick ImplementationCreating a Big data Strategy with Tactics for Quick Implementation
Creating a Big data Strategy with Tactics for Quick Implementation
 
Echelon Asia Summit 2017 Startup Academy Workshop
Echelon Asia Summit 2017 Startup Academy WorkshopEchelon Asia Summit 2017 Startup Academy Workshop
Echelon Asia Summit 2017 Startup Academy Workshop
 
Data Integrity Trends
Data Integrity TrendsData Integrity Trends
Data Integrity Trends
 
Selling SaaS to SMBs
Selling SaaS to SMBsSelling SaaS to SMBs
Selling SaaS to SMBs
 
"Ready or Not, Here Comes 2015: Marketing Trends to Master" TrendLab Webinar
"Ready or Not, Here Comes 2015: Marketing Trends to Master" TrendLab Webinar"Ready or Not, Here Comes 2015: Marketing Trends to Master" TrendLab Webinar
"Ready or Not, Here Comes 2015: Marketing Trends to Master" TrendLab Webinar
 
Business and Data Analytics Collaborative April Meetup
Business and Data Analytics Collaborative April MeetupBusiness and Data Analytics Collaborative April Meetup
Business and Data Analytics Collaborative April Meetup
 
Economics & Statistics Insights in Data Science by DataPerts Technologies
Economics & Statistics Insights in Data Science by DataPerts TechnologiesEconomics & Statistics Insights in Data Science by DataPerts Technologies
Economics & Statistics Insights in Data Science by DataPerts Technologies
 
Looking for patterns in the data
Looking for patterns in the dataLooking for patterns in the data
Looking for patterns in the data
 

Recently uploaded

原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 

Recently uploaded (20)

原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 

Big Data Research Methods – Contemporary Analysis

  • 1. Big Data & Research Methods PRESENTED BY Grant Stanley, CEO Tadd Wood, Chief Data Scientist Contemporary Analysis 1209 Harney Street, Suite 200 Omaha, NE 68102
  • 2. Big Data & Research Methods INTRO The process of research is as important as the results. • Correct research methods improve results, • And allow others to collaborate and improve your work. Contemporary Analysis canworksmart.com
  • 3. Big Data & Research Methods INTRO We’ll explore the dangers of: • Spurious Correlation • Sampling Errors • Model Selection • Heteroscedasticity • Overfitting • Lack of Background Contemporary Analysis canworksmart.com • Solutions instead of Theories • Lack of the Scientific Method • Correlation vs. Causation Text
  • 4. Big Data & Research Methods INTRO Big Data can’t just be about collecting, processing & storing more data. It has to be put to use. We need to conduct research, build models, and develop reports. Contemporary Analysis canworksmart.com
  • 5. Big Data & Research Methods THE DANGER OF FALSE POSITIVES The car has little impact without the highway or interstate. If we take Big Data beyond engineering, we are building the equivalent of the highway or interstate for the computer & Internet. Contemporary Analysis canworksmart.com
  • 6. Big Data & Research Methods SPURIOUS RELATIONSHIPS Spurious relationships are when two or more events or variables have no direct causal connection, yet it may be wrongly inferred that they do, due to either coincidence or the presence of a certain third, unseen factor. Contemporary Analysis canworksmart.com
  • 7. Big Data & Research Methods SPURIOUS RELATIONSHIPS Big Data Errors: Spurious Correlations 140,000 CORRELATIONS 80,000 SPURIOUS 20,000 VARIABLES 500 1000 1500 2000 Contemporary Analysis canworksmart.com
  • 8. Big Data & Research Methods SPURIOUS RELATIONSHIPS Maine’s divorce rate with US margarine consumption 8 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 DIVORCES PER 1000 PEOPLE Divorce rate in Maine Divorces per 1000 people (US Census) 5 4.7 4.6 4.4 4.3 4.1 4.2 4.2 4.2 4.1 Consumption of margarine (US) Per capita in pounds (USDA) 8.2 7 6.5 5.3 5.2 4 4.6 4.5 4.2 3.7 Correlation 0.992558 Contemporary Analysis canworksmart.com MARGARINE CONSUMPTION (POUNDS) 5 4.8 4.6 4.4 4.2 4 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 9 7 6 5 4 3 DIVORCE RATE IN MAINE PER CAPITA CONSUMPTION OF MARGARINE (US)
  • 9. Big Data & Research Methods SAMPLING There are two reasons for sampling a population: • The cost of collecting and processing data is too high or impossible. • To ensure that the results are representative of the population. Contemporary Analysis canworksmart.com
  • 10. Big Data & Research Methods SAMPLING Sampling still matters in Big Data. Data is not information. It is simply a representation of information. You have to think about what the data you are using represents. Contemporary Analysis canworksmart.com
  • 11. Big Data & Research Methods SAMPLING Is smartphone data representative of the population? Gender by Platform Age by Platform iPhone Android 100% 0% Contemporary Analysis canworksmart.com 12% 18 - 24 iPhone Android 100% 0% 57% MALE 73% MALE 43% FEMALE 27% FEMALE 7% 17 OR YOUNGER 13% 17 OR YOUNGER 17% 18 - 24 21% 25 - 34 30% 25 - 34 21% 35 - 44 21% 35 - 44 32% 45+ 25% 45+
  • 12. Big Data & Research Methods MODEL SELECTION OLS is not a catch all. You have to know your data. Is it continuous, discrete, binary, ordinal, or categorical? Is your data symmetric or asymmetric? Are there outliers? Contemporary Analysis canworksmart.com
  • 13. Big Data & Research Methods MODEL SELECTION Contemporary Analysis canworksmart.com
  • 14. Big Data & Research Methods HETEROSCEDASTICITY Heteroscedasticity refers to the circumstance in which the variability of a variable is unequal across the range of values of a second variable that predicts it. Contemporary Analysis canworksmart.com
  • 15. Big Data & Research Methods HETEROSCEDASTICITY Predicting equipment pricing based on machine hours MARKET PRICE T2 HOURS ON MACHINE T1 Contemporary Analysis canworksmart.com T3 ^ = a + bx Y
  • 16. Big Data & Research Methods Unbiased & Homoscedastic Biased & Homoscedastic Biased & Homoscedastic Unbiased & Heteroscedastic Biased & Heteroscedastic Biased & Heteroscedastic Contemporary Analysis canworksmart.com
  • 17. Big Data & Research Methods OVERFITTING Overfitting occurs when a statistical model captures more than just the underlying relationships. The model is fitted to as much data as possible including random errors, outliers, and noise. Contemporary Analysis canworksmart.com
  • 18. Big Data & Research Methods OVERFITTING An overfitted model nearly perfectly matches the training set, but does not perform well with new data. While an overfitted model looks great, it will have poor predictive performance. Contemporary Analysis canworksmart.com
  • 19. Big Data & Research Methods OVERFITTING The mark of a good model isn’t how well it performs on the data used to build the model, but on fresh data outside of the training data set. Contemporary Analysis canworksmart.com
  • 20. Big Data & Research Methods OVERFITTING Overfitting Example: Training Classification Table Contemporary Analysis canworksmart.com General Election (Predicted) General Election (Observed) Did not vote Voted Percentage Correct Did not vote 132423 3 99.99773% Voted 0 411099 100% Overall Correct Percentage 100%
  • 21. Big Data & Research Methods OVERFITTING Overfitting Example: Prediction Classification Table Contemporary Analysis canworksmart.com General Election (Predicted) General Election (Observed) Did not vote Voted Percentage Correct Did not vote 35726 4068 90% Voted 45924 77199 63% Overall Correct Percentage 69%
  • 22. Big Data & Research Methods OVERFITTING Overfitting Example: Variables Contemporary Analysis canworksmart.com 95% C.I. for EXP(B) Variable B (Coefficients) Standard Error Wald Significance Lower Upper NumberOfPastRaces 63.840 106.208 .361 .548 .000 1.35E+118 Primary_03072000_Voter -66.218 106.264 .388 .533 .000 4.95E+61 General_1107200_Voter -61.971 106.219 .340 .560 .000 3.16E+63 Special_05082001_Voter -58.129 111.165 .273 .601 .000 2.39E+69 General_11062001_Voter -60.658 106.181 .326 .568 .000 1.09E+64 Primary_05072002_Voter -57.806 99.816 .335 .563 .000 7.23E+59 General_11052002_Voter -63.208 106.206 .354 .552 .000 8.94E+62 Special_05062003_Voter -66.393 106.249 .390 .532 .000 4.03E+61 General_11042003_Voter -64.056 106.209 .364 .546 .000 3.85E+62 Primary_03022004_Voter -63.836 106.204 .361 .548 .000 4.76E+62 Special_02052005_Voter -58.510 111.784 .274 .601 .000 5.50E+69 General_11082005_Voter -65.617 106.238 .381 .537 .000 8.56E+61 Special_02072006_Voter -56.952 305.188 .035 .852 .000 1.10E+235 Primary_05022006_Voter -64.696 106.220 .371 .542 .000 2.08E+62 General_11072006_Voter -64.074 106.210 .364 .546 .000 3.79E+62 Primary_05082007_Voter -65.976 106.233 .386 .535 .000 5.93E+61 Primary_09112007_Voter -57.949 15652.399 .000 .997 .000 — General_11062007_Voter -67.465 106.231 .403 .525 .000 1.33E+61 General_12112007_Voter -75.855 106.274 .509 .475 .000 3.29E+57 Primary_03042008_Voter -62.602 106.214 .347 .556 .000 1.67E+63 General_11042008_Voter -64.100 106.220 .364 .546 .000 3.77E+62 Primary_05052009_Voter -57.094 98.053 .339 .560 .000 4.56E+58 Primary_09152009_Voter -54.792 7118.311 .000 .994 .000 — General_11032009_Voter -55.176 98.071 .317 .574 .000 3.28E+59 Primary_05042010_Voter -65.564 106.234 .381 .537 .000 8.97E+61 Primary_07132010_Voter -56.331 45432.804 .000 .999 .000 — Primary_09072010_Voter -57.607 3684.807 .000 .998 .000 — General_11022010_Voter -63.431 106.214 .357 .550 .000 7.28E+62 Primary_05032011_Voter -57.848 136.939 .178 .673 .000 2.75E+91 General_11082011_Voter -54.865 98.255 .312 .577 .000 6.42E+59 Primary_03062012_Voter -55.419 95.847 .334 .563 .000 3.29E+57 Primary_05072013_Voter -58.652 110.873 .280 .597 .000 8.00E+68 General_11052013_Voter -62.617 106.196 .348 .555 .000 1.58E+63 Constant -115.093 212.413 .294 .588
  • 23. Big Data & Research Methods OVERFITTING Simple Model Example: Variables Contemporary Analysis canworksmart.com 95% C.I. for EXP(B) Variable B (Coefficients) Standard Error Wald Significance Lower Upper Age_life_bin_1 .344 .019 312.341 .000 1.358 1.466 Age_life_bin_2 .282 .017 266.954 .000 1.282 1.372 Age_life_bin_3 .180 .017 109.330 .000 1.158 1.239 Age_life_bin_4 .133 .018 53.146 .000 1.102 1.184 Age_life_bin_5 .055 .019 8.719 .003 1.019 1.096 Age_life_bin_7 -.342 .029 139.262 .000 .671 .752 Age_life_bin_8 -1.949 .029 4636.533 .000 .135 .151 Party_affiliation_D .523 .037 202.630 .000 1.570 1.814 Party_affiliation_R .692 .027 656.239 .000 1.895 2.106 NumberOfPastRaces .480 .002 63659.304 .000 1.611 1.623 Constant -1.332 .017 6041.871 .000
  • 24. Big Data & Research Methods OVERFITTING Simple Model Example: Training Classification Table Contemporary Analysis canworksmart.com General Election (Predicted) General Election (Observed) Did not vote Voted Percentage Correct Did not vote 95397 37029 72% Voted 43439 367660 89% Overall Correct Percentage 85%
  • 25. Big Data & Research Methods OVERFITTING Simple Model Example: Prediction Classification Table Contemporary Analysis canworksmart.com General Election (Predicted) General Election (Observed) Did not vote Voted Percentage Correct Did not vote 72167 9483 88% Voted 15131 66136 81% Overall Correct Percentage 85%
  • 26. Big Data & Research Methods OVERFITTING Big Data Errors: Spurious Correlations 140,000 CORRELATIONS 80,000 SPURIOUS 20,000 VARIABLES 500 1000 1500 2000 Contemporary Analysis canworksmart.com
  • 27. Big Data & Research Methods OVERFITTING Overstuffing Example: Variables Contemporary Analysis canworksmart.com 95% C.I. for EXP(B) Variable B (Coefficients) Standard Error Wald Significance Lower Upper Age_life_bin_1 .331 .020 286.120 .000 1.339 1.446 Age_life_bin_2 .281 .017 263.325 .000 1.281 1.371 Age_life_bin_3 .184 .017 113.157 .000 1.162 1.243 Age_life_bin_4 .134 .018 53.857 .000 1.103 1.185 Age_life_bin_5 .058 .019 9.629 .002 1.022 1.099 Age_life_bin_7 -.348 .029 143.259 .000 .667 .748 Age_life_bin_8 -1.959 .029 4687.305 .000 .133 .149 Party_affiliation_D .513 .037 194.040 .000 1.554 1.796 Party_affiliation_R .684 .027 637.417 .000 1.879 2.089 NumberOfPastRaces .478 .002 62834.614 .000 1.608 1.620 Residential_Zip_3 -.364 .127 8.181 .004 .541 .892 Residential_Zip_7 .360 .063 32.902 .000 1.268 1.622 Residential_Zip_8 .428 .218 3.834 .050 1.000 2.354 Residential_Zip_16 -.125 .023 28.277 .000 .843 .924 Residential_Zip_17 .127 .058 4.797 .029 1.013 1.272 Residential_Zip_18 -.356 .044 64.141 .000 .642 .764 Residential_Zip_19 -.283 .026 117.878 .000 .716 .793 Residential_Zip_21 .115 .037 9.801 .002 1.044 1.206 Residential_Zip_22 .113 .026 19.024 .000 1.064 1.178 Residential_Zip_25 -.182 .024 59.045 .000 .796 .873 Residential_Zip_26 .074 .032 5.248 .022 1.011 1.148 Residential_Zip_27 -.132 .033 16.081 .000 .821 .935 Residential_Zip_28 -.077 .023 11.484 .001 .885 .968 Residential_Zip_29 -.160 .038 17.765 .000 .791 .918 Residential_Zip_30 -.191 .044 18.638 .000 .758 .901 Residential_Zip_33 -.059 .030 3.945 .047 .889 .999 Residential_Zip_35 .104 .026 15.662 .000 1.054 1.168 Residential_Zip_41 .140 .018 57.675 .000 1.109 1.193 Residential_Zip_42 .156 .039 16.010 .000 1.083 1.262 Residential_Zip_45 .138 .024 32.782 .000 1.095 1.204 Residential_Zip_46 -.065 .018 12.838 .000 .904 .971 Residential_Zip_48 .261 .022 136.998 .000 1.243 1.357 Residential_Zip_50 .164 .025 41.633 .000 1.121 1.239 Residential_Zip_51 .157 .031 26.169 .000 1.102 1.243 Residential_Zip_53 .114 .033 11.628 .001 1.050 1.197 Residential_Zip_54 .104 .029 13.215 .000 1.049 1.174 Residential_Zip_56 .116 .032 13.238 .000 1.055 1.196 Residential_Zip_59 .094 .032 8.647 .003 1.032 1.170 Local_School_District_6 -.375 .055 47.296 .000 .618 .765 Local_School_District_7 .078 .016 23.389 .000 1.047 1.115 Local_School_District_9 -.501 .057 77.534 .000 .542 .677 Local_School_District_10 -.255 .033 61.473 .000 .727 .826 Constant -1.332 .018 5513.792 .000
  • 28. Big Data & Research Methods OVERFITTING Overstuffing Example: Training Classification Table Contemporary Analysis canworksmart.com General Election (Predicted) General Election (Observed) Did not vote Voted Percentage Correct Did not vote 93029 39397 70% Voted 36228 374871 91% Overall Correct Percentage 86%
  • 29. Big Data & Research Methods LACK OF BACKGROUND The farther we are from the work, the more likely we are to be tricked by the data. We owe it to the end user to get out of the library, and try to understand what we are modeling. Contemporary Analysis canworksmart.com
  • 30. Big Data & Research Methods SOLUTIONS INSTEAD OF THEORIES There is an element of data science that should be frustrating, confusing, & despair inducing. It should make us stand back in awe of the complexity of the world, and not the simplicity to which we can reduce it to. Contemporary Analysis canworksmart.com
  • 31. Big Data & Research Methods SOLUTIONS INSTEAD OF THEORIES “The great thing about economics, is that we admit that we know nothing about anything” - Thomas Piketty author of “Capital in the Twenty-First Century” Contemporary Analysis canworksmart.com
  • 32. Big Data & Research Methods SOLUTIONS INSTEAD OF THEORIES As we learn more, we realize there’s more to learn. The hallmark of genius is the sharp awareness of what is and what is not possible. We become aware of complexity, ambiguity and nuance. Contemporary Analysis canworksmart.com
  • 33. Big Data & Research Methods CORRELATION & CAUSATION The anthem of the Big Data age is “correlation does not imply causation.” Contemporary Analysis canworksmart.com
  • 34. Big Data & Research Methods CORRELATION & CAUSATION The problem is that this statement is tautological. It is always correct, and can never be wrong. Contemporary Analysis canworksmart.com
  • 35. Big Data & Research Methods CORRELATION & CAUSATION Don’t let people use it as a kill switch to discussion. • True causation is pretty rare. There are few things where, if I do this, this will happen. • Research should create discussions not shut them down. Models can’t explain everything. There is always an “X” variable that captures the unknown. Contemporary Analysis canworksmart.com
  • 36. Big Data & Research Methods SOLUTIONS INSTEAD OF THEORIES Contemporary Analysis canworksmart.com
  • 37. Big Data & Research Methods FAILING TO AUDIT Primary reasons that we fail to have our work peer-reviewed: • Lack of funding to “repeat” work. • We hide behind the complexity of our work. Contemporary Analysis canworksmart.com
  • 38. Big Data & Research Methods FAILING TO AUDIT Contemporary Analysis canworksmart.com
  • 39. Big Data & Research Methods FAILING TO AUDIT Other tools: • rMarkdown: for creating webpages and documents in R • iPython notebooks: for creating websites and documents interactively in Python • Galaxy Project: for creating reproducible workflows. (Favorable for people with less scripting experience.) Contemporary Analysis canworksmart.com
  • 40. Big Data & Research Methods TRAINING We offer training on: • Data Visualization • Managerial Statistics • Predictive Modeling Contemporary Analysis canworksmart.com You will be introduced to: • R • SPSS • Tableau • MySQL • Git
  • 41. Big Data & Research Methods TRAINING Trainings sessions last 3 days. We will work through projects, practice different approaches, and which approach is the best for different scenarios. Contemporary Analysis canworksmart.com
  • 42. Big Data & Research Methods QUESTIONS? Grant Stanley, CEO Contemporary Analysis 1209 Harney Street, Suite 200 Omaha, NE 68102 grant@canworksmart.com (402) 679-8398 Contemporary Analysis canworksmart.com Questions & Learn more.