SlideShare a Scribd company logo
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-12019-02-07
DOI: 10.13140/RG.2.2.20892.33928
Leonardo Auslender –Ch. 1 Copyright 2004 -22019-02-07
Contents
Definition
Data sets and Examples
UEDA: Univariate EDA
Definitions of central tendency and variability.
Data set descriptions.
Transformations.
Statistical Inference – Probability.
Univariate Data Distributions – Normality.
CLT – Statistical Tests.
Sampling (under construction)
Statistical Puzzles
Bayesian Inference (under construction).
BEDA
Continuous Variables – Correlations - Causation
Nominal Variables – Chi Square – Odds
Simpson’s Paradox
Leonardo Auslender –Ch. 1 Copyright 2004 -32019-02-07
Contents (cont.)
MEDA
Principal Components
Factor Analysis
Clustering and Segmentation
Canonical Discrimination Analysis
Missing Value imputation
Outliers and Variable Transformations.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-42019-02-07
EDA: definition, purpose and usefulness.
Typically (or should be) initial step to view and try to comprehend data
set , assumed to be of rectangular form. Columns called variables, rows
observations. Comprehension of individual and across individual variables.
Given size of present data bases with hundreds if not thousands or more
variables or attributes and possibly millions of observations, it is either hard
or not possible to obtain full conceptual understanding of the
informational content imbued in data. Since many applications lead to
(some) modeling, also possible and desirable to perform EDA on outcome of
these procedures  we can envision EDA applied twice, prior to and after
model creation, and thus possibly restarting modeling effort.
Variables are random when their values cannot be known with certainty, i.e.,
they are not deterministic variables. For instance, not possible to know the
next outcome of roulette bet, we “know” probability of success of such
outcome.
When probability is not knowable  uncertainty. Otherwise, risk.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-52019-02-07
Data sets in large data set setting, contain hundreds (if
not) more variables, with taxonomy:
1) Numeric,
2) Character, considered nominal, i.e., no cardinality.
3) Ordinal
4) Ratio: ratio of numeric variables.
5) Id: a. Id proper, nominal variables: patient id, SSN, etc.
b. Indices: Time index, patient visit number, etc.
Can compute Nominal Ordinal Interval Ratio
frequency distribution. Yes Yes Yes Yes
median and percentiles. No Yes Yes Yes
add or subtract. No No Yes Yes
mean, standard deviation,
standard error of the mean.
No No Yes Yes
ratio, or coefficient of
variation.
No No No Yes
Leonardo Auslender –Ch. 1 Copyright 2004 -62019-02-07
Proposed EDA steps
Variables (attributes) can be analyzed in themselves (univariate), in relation to
other single variables (bivariate) and as a whole (multivariate).
Conceptualization is more difficult as we progress from one to many, and
requires answers to e.g., which one is truly bigger?
We can envision 5 steps in EDA:
Data Set definition: at least number of observations, variables and taxonomy.
UEDA Univariate (central location and dispersion measures).
BEDA Bivariate (mostly correlation and contingency tables)
MEDA Multivariate: Missing values analysis and imputation, principal
components and clustering.
Outliers and variable transformation or Engineering (that involve UEDA, BEDA
and MEDA. E.g., log (Var1) is Univariate transformation. PCA is multivariate
Transformation).
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-72019-02-07
Statistical Inference
It embraces all of EDA and Modeling, allows for
comparisons and statements about similarity or not in
many different situations.
All data sets assumed to be representative samples from an
infinite population, which can be unrealistic. If interested in
inferences on heights of first graders in specific school at
specific time, data is entire population, and there is no
uncertainty on recorded heights and thus statistical inference
not required..
If interest is in height changes across time (and thus work with
samples), then use statistical inference to infer information
about overall population (perhaps ideal, not real).
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-82019-02-07
Leonardo Auslender –Ch. 1 Copyright 2004 — 9 —
Data set 1: Definition by way of Example
• Health insurance company: Ophtamologic
Insurance Claims
• Is claim valid or fraudulent?
• Present operation:
• Manual review of history and circumstances
Alternative:
Scoring analytical system.
Data Mining Solution: Use data on past claims
to verify fraud
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-102019-02-07
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informa
t
Label
3 DOCTOR_VISITS Num 8 BEST12
.
F12. Total visits to a doctor
1 FRAUD Num 8 BEST12
.
F12. Fraudulent Activity yes/no
5 MEMBER_DURAT
ION
Num 8 Membership duration
4 NO_CLAIMS Num 8 BEST12
.
F12. No of claims made recently
7 NUM_MEMBERS Num 8 Number of members covered
6 OPTOM_PRESC Num 8 BEST12
.
F12. Number of opticals claimed
2 TOTAL_SPEND Num 8 BEST12
.
F12. Total spent on opticals
SAS EXAMPLE for fraud data set:
ods html;
proc contents data = fraud.fraud;
run;
ods html close;
Leonardo Auslender –Ch. 1 Copyright 2004 — 11 —
Data set 2 (DS2): Babies’ deaths (health
care example).
• Babies’ death or survivability.
• Determine basic statistics for mostly binary
variables.
• Data anomalies?
• Is present data representative or
anomalous?
Leonardo Auslender –Ch. 1 Copyright 2004 -122019-02-07
Alphabetic List of Variables and Attributes
# Variable Type Len Label
16 Const Num 8
18 H Num 8
17 M1 Num 8
7 abort Num 8 past abortion
1 death Num 8 Death
9 dyslab Num 8 Labor progress
5 gestage Num 8 Gestational AGe
8 hydramnios Num 8 Too much amniotic fluid
6 isoimm Num 8 Iso immunization
15 malpres Num 8 Mal Presented
11 nomonit Num 8 No Monitor
2 nonwhite Num 8 Non-White
4 nullip Num 8 Null Parity
10 placord Num 8 Placental - cord anomaly
14 prerupt Num 8 PROM
3 teenages Num 8 Early Age
12 twint Num 8 Twin, Triplet
13 ward Num 8 Public Ward
Leonardo Auslender –Ch. 1 Copyright 2004 -132019-02-07
Data Set HMEQ (DS 3)
Reports characteristics and delinquency information for 5,960
home equity loans: loan where the obligor uses the equity of
his or her home as the underlying collateral.
The data set has the following characteristics:
◾ BAD: 1 = applicant defaulted on loan or seriously delinquent; 0 = applicant paid loan
◾ LOAN: Amount of the loan request
◾ MORTDUE: Amount due on existing mortgage
◾ VALUE: Value of current property
◾ REASON: DebtCon = debt consolidation; HomeImp = home improvement
◾ JOB: Occupational categories
◾ YOJ: Years at present job
◾ DEROG: Number of major derogatory reports
◾ DELINQ: Number of delinquent credit lines
◾ CLAGE: Age of oldest credit line in months
◾ NINQ: Number of recent credit inquiries
◾ CLNO: Number of credit lines
◾ DEBTINC: Debt-to-income ratio
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-142019-02-07
Home Work Questions:
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-152019-02-07
Statistical
Inference.
Leonardo Auslender –Ch. 1 Copyright 2004 -162019-02-07
NEXT:
UEDA: Single variable analysis.
BEDA: Analysis of pairs of variables, typically correlation and
chi-square analysis.
MEDA: Focuses on dimension reduction. Studied dimensions
are Variables and Observations 
Methods to group variables (PCA, FA)
Methods to group observations (Cluster analysis).
Plus
Missing Value imputation
Multivariate transformations
Multivariate outlier detection.
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-172019-02-07

More Related Content

Similar to 1 eda

U5 a1 stages in the decision making process
U5 a1 stages in the decision making processU5 a1 stages in the decision making process
U5 a1 stages in the decision making process
Peter R Breach
 
Le Bauer: Data Driven Model Development
Le Bauer:  Data Driven Model DevelopmentLe Bauer:  Data Driven Model Development
Le Bauer: Data Driven Model DevelopmentquestRCN
 
Automated Recommendation of Templates for Legal Requirements
Automated Recommendation of Templates for Legal RequirementsAutomated Recommendation of Templates for Legal Requirements
Automated Recommendation of Templates for Legal Requirements
Lionel Briand
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Greg Makowski
 
IRJET- A Survey on Prediction of Heart Disease Presence using Data Mining and...
IRJET- A Survey on Prediction of Heart Disease Presence using Data Mining and...IRJET- A Survey on Prediction of Heart Disease Presence using Data Mining and...
IRJET- A Survey on Prediction of Heart Disease Presence using Data Mining and...
IRJET Journal
 
Zero Base Training Report
Zero Base Training ReportZero Base Training Report
Zero Base Training Report
Patty Buckley
 
Zero Base Training Report
Zero Base Training ReportZero Base Training Report
Zero Base Training Report
Paula Smith
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.ppt
manaswidebbarma1
 
Research methodology part 2
Research methodology part 2Research methodology part 2
Research methodology part 2
NeelavathyNeelavathy1
 
Research methodology part 2
Research methodology part 2Research methodology part 2
Research methodology part 2
sathyalekha
 
Business Research Methods. problem definition literature review and qualitati...
Business Research Methods. problem definition literature review and qualitati...Business Research Methods. problem definition literature review and qualitati...
Business Research Methods. problem definition literature review and qualitati...
Ahsan Khan Eco (Superior College)
 
30 Argumentative Essay Examples In Illustrator Go
30 Argumentative Essay Examples In Illustrator  Go30 Argumentative Essay Examples In Illustrator  Go
30 Argumentative Essay Examples In Illustrator Go
Tanya Williams
 
Not Good Enough, But Try Again! The Impact of Improved Rejection Communicatio...
Not Good Enough, But Try Again! The Impact of Improved Rejection Communicatio...Not Good Enough, But Try Again! The Impact of Improved Rejection Communicatio...
Not Good Enough, But Try Again! The Impact of Improved Rejection Communicatio...
Aleksi Aaltonen
 
er_modeling_case_studies.ppt
er_modeling_case_studies.ppter_modeling_case_studies.ppt
er_modeling_case_studies.ppt
sadiakausar4
 
How To Improve Print Handwriting Worksheets F
How To Improve Print Handwriting Worksheets FHow To Improve Print Handwriting Worksheets F
How To Improve Print Handwriting Worksheets F
Dawn Robertson
 
Question 1 Which type of offsite backup service provides backups.docx
Question 1 Which type of offsite backup service provides backups.docxQuestion 1 Which type of offsite backup service provides backups.docx
Question 1 Which type of offsite backup service provides backups.docx
IRESH3
 
Sunset Writing Paper Set, Optional Storage Box Pen
Sunset Writing Paper Set, Optional Storage Box PenSunset Writing Paper Set, Optional Storage Box Pen
Sunset Writing Paper Set, Optional Storage Box Pen
Valerie Felton
 
C o l o r a d o S t a t e U n i v e r s i t y - P u e b l o .docx
C o l o r a d o  S t a t e  U n i v e r s i t y - P u e b l o .docxC o l o r a d o  S t a t e  U n i v e r s i t y - P u e b l o .docx
C o l o r a d o S t a t e U n i v e r s i t y - P u e b l o .docx
clairbycraft
 
6th November 2008 Final
6th November 2008 Final6th November 2008 Final
6th November 2008 Final
MarcusBrook
 

Similar to 1 eda (20)

U5 a1 stages in the decision making process
U5 a1 stages in the decision making processU5 a1 stages in the decision making process
U5 a1 stages in the decision making process
 
Le Bauer: Data Driven Model Development
Le Bauer:  Data Driven Model DevelopmentLe Bauer:  Data Driven Model Development
Le Bauer: Data Driven Model Development
 
03_AJMS_298_21.pdf
03_AJMS_298_21.pdf03_AJMS_298_21.pdf
03_AJMS_298_21.pdf
 
Automated Recommendation of Templates for Legal Requirements
Automated Recommendation of Templates for Legal RequirementsAutomated Recommendation of Templates for Legal Requirements
Automated Recommendation of Templates for Legal Requirements
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
 
IRJET- A Survey on Prediction of Heart Disease Presence using Data Mining and...
IRJET- A Survey on Prediction of Heart Disease Presence using Data Mining and...IRJET- A Survey on Prediction of Heart Disease Presence using Data Mining and...
IRJET- A Survey on Prediction of Heart Disease Presence using Data Mining and...
 
Zero Base Training Report
Zero Base Training ReportZero Base Training Report
Zero Base Training Report
 
Zero Base Training Report
Zero Base Training ReportZero Base Training Report
Zero Base Training Report
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.ppt
 
Research methodology part 2
Research methodology part 2Research methodology part 2
Research methodology part 2
 
Research methodology part 2
Research methodology part 2Research methodology part 2
Research methodology part 2
 
Business Research Methods. problem definition literature review and qualitati...
Business Research Methods. problem definition literature review and qualitati...Business Research Methods. problem definition literature review and qualitati...
Business Research Methods. problem definition literature review and qualitati...
 
30 Argumentative Essay Examples In Illustrator Go
30 Argumentative Essay Examples In Illustrator  Go30 Argumentative Essay Examples In Illustrator  Go
30 Argumentative Essay Examples In Illustrator Go
 
Not Good Enough, But Try Again! The Impact of Improved Rejection Communicatio...
Not Good Enough, But Try Again! The Impact of Improved Rejection Communicatio...Not Good Enough, But Try Again! The Impact of Improved Rejection Communicatio...
Not Good Enough, But Try Again! The Impact of Improved Rejection Communicatio...
 
er_modeling_case_studies.ppt
er_modeling_case_studies.ppter_modeling_case_studies.ppt
er_modeling_case_studies.ppt
 
How To Improve Print Handwriting Worksheets F
How To Improve Print Handwriting Worksheets FHow To Improve Print Handwriting Worksheets F
How To Improve Print Handwriting Worksheets F
 
Question 1 Which type of offsite backup service provides backups.docx
Question 1 Which type of offsite backup service provides backups.docxQuestion 1 Which type of offsite backup service provides backups.docx
Question 1 Which type of offsite backup service provides backups.docx
 
Sunset Writing Paper Set, Optional Storage Box Pen
Sunset Writing Paper Set, Optional Storage Box PenSunset Writing Paper Set, Optional Storage Box Pen
Sunset Writing Paper Set, Optional Storage Box Pen
 
C o l o r a d o S t a t e U n i v e r s i t y - P u e b l o .docx
C o l o r a d o  S t a t e  U n i v e r s i t y - P u e b l o .docxC o l o r a d o  S t a t e  U n i v e r s i t y - P u e b l o .docx
C o l o r a d o S t a t e U n i v e r s i t y - P u e b l o .docx
 
6th November 2008 Final
6th November 2008 Final6th November 2008 Final
6th November 2008 Final
 

More from Leonardo Auslender

1 UMI.pdf
1 UMI.pdf1 UMI.pdf
Ensembles.pdf
Ensembles.pdfEnsembles.pdf
Ensembles.pdf
Leonardo Auslender
 
Suppression Enhancement.pdf
Suppression Enhancement.pdfSuppression Enhancement.pdf
Suppression Enhancement.pdf
Leonardo Auslender
 
4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdf4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdf
Leonardo Auslender
 
4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf
Leonardo Auslender
 
4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdf4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdf
Leonardo Auslender
 
4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdf4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdf
Leonardo Auslender
 
4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdf4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdf
Leonardo Auslender
 
4_3_Ensemble models and grad boost part 2.pdf
4_3_Ensemble models and grad boost part 2.pdf4_3_Ensemble models and grad boost part 2.pdf
4_3_Ensemble models and grad boost part 2.pdf
Leonardo Auslender
 
4_2_Ensemble models and grad boost part 1.pdf
4_2_Ensemble models and grad boost part 1.pdf4_2_Ensemble models and grad boost part 1.pdf
4_2_Ensemble models and grad boost part 1.pdf
Leonardo Auslender
 
4_1_Tree World.pdf
4_1_Tree World.pdf4_1_Tree World.pdf
4_1_Tree World.pdf
Leonardo Auslender
 
Classification methods and assessment.pdf
Classification methods and assessment.pdfClassification methods and assessment.pdf
Classification methods and assessment.pdf
Leonardo Auslender
 
Linear Regression.pdf
Linear Regression.pdfLinear Regression.pdf
Linear Regression.pdf
Leonardo Auslender
 
4 MEDA.pdf
4 MEDA.pdf4 MEDA.pdf
4 MEDA.pdf
Leonardo Auslender
 
2 UEDA.pdf
2 UEDA.pdf2 UEDA.pdf
2 UEDA.pdf
Leonardo Auslender
 
3 BEDA.pdf
3 BEDA.pdf3 BEDA.pdf
3 BEDA.pdf
Leonardo Auslender
 
0 Statistics Intro.pdf
0 Statistics Intro.pdf0 Statistics Intro.pdf
0 Statistics Intro.pdf
Leonardo Auslender
 
0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf
Leonardo Auslender
 
4 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-074 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-07
Leonardo Auslender
 
4 2 ensemble models and grad boost part 2 2019-10-07
4 2 ensemble models and grad boost part 2 2019-10-074 2 ensemble models and grad boost part 2 2019-10-07
4 2 ensemble models and grad boost part 2 2019-10-07
Leonardo Auslender
 

More from Leonardo Auslender (20)

1 UMI.pdf
1 UMI.pdf1 UMI.pdf
1 UMI.pdf
 
Ensembles.pdf
Ensembles.pdfEnsembles.pdf
Ensembles.pdf
 
Suppression Enhancement.pdf
Suppression Enhancement.pdfSuppression Enhancement.pdf
Suppression Enhancement.pdf
 
4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdf4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdf
 
4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf
 
4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdf4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdf
 
4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdf4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdf
 
4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdf4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdf
 
4_3_Ensemble models and grad boost part 2.pdf
4_3_Ensemble models and grad boost part 2.pdf4_3_Ensemble models and grad boost part 2.pdf
4_3_Ensemble models and grad boost part 2.pdf
 
4_2_Ensemble models and grad boost part 1.pdf
4_2_Ensemble models and grad boost part 1.pdf4_2_Ensemble models and grad boost part 1.pdf
4_2_Ensemble models and grad boost part 1.pdf
 
4_1_Tree World.pdf
4_1_Tree World.pdf4_1_Tree World.pdf
4_1_Tree World.pdf
 
Classification methods and assessment.pdf
Classification methods and assessment.pdfClassification methods and assessment.pdf
Classification methods and assessment.pdf
 
Linear Regression.pdf
Linear Regression.pdfLinear Regression.pdf
Linear Regression.pdf
 
4 MEDA.pdf
4 MEDA.pdf4 MEDA.pdf
4 MEDA.pdf
 
2 UEDA.pdf
2 UEDA.pdf2 UEDA.pdf
2 UEDA.pdf
 
3 BEDA.pdf
3 BEDA.pdf3 BEDA.pdf
3 BEDA.pdf
 
0 Statistics Intro.pdf
0 Statistics Intro.pdf0 Statistics Intro.pdf
0 Statistics Intro.pdf
 
0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf
 
4 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-074 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-07
 
4 2 ensemble models and grad boost part 2 2019-10-07
4 2 ensemble models and grad boost part 2 2019-10-074 2 ensemble models and grad boost part 2 2019-10-07
4 2 ensemble models and grad boost part 2 2019-10-07
 

Recently uploaded

Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 

Recently uploaded (20)

Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 

1 eda

  • 1. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-12019-02-07 DOI: 10.13140/RG.2.2.20892.33928
  • 2. Leonardo Auslender –Ch. 1 Copyright 2004 -22019-02-07 Contents Definition Data sets and Examples UEDA: Univariate EDA Definitions of central tendency and variability. Data set descriptions. Transformations. Statistical Inference – Probability. Univariate Data Distributions – Normality. CLT – Statistical Tests. Sampling (under construction) Statistical Puzzles Bayesian Inference (under construction). BEDA Continuous Variables – Correlations - Causation Nominal Variables – Chi Square – Odds Simpson’s Paradox
  • 3. Leonardo Auslender –Ch. 1 Copyright 2004 -32019-02-07 Contents (cont.) MEDA Principal Components Factor Analysis Clustering and Segmentation Canonical Discrimination Analysis Missing Value imputation Outliers and Variable Transformations.
  • 4. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-42019-02-07 EDA: definition, purpose and usefulness. Typically (or should be) initial step to view and try to comprehend data set , assumed to be of rectangular form. Columns called variables, rows observations. Comprehension of individual and across individual variables. Given size of present data bases with hundreds if not thousands or more variables or attributes and possibly millions of observations, it is either hard or not possible to obtain full conceptual understanding of the informational content imbued in data. Since many applications lead to (some) modeling, also possible and desirable to perform EDA on outcome of these procedures  we can envision EDA applied twice, prior to and after model creation, and thus possibly restarting modeling effort. Variables are random when their values cannot be known with certainty, i.e., they are not deterministic variables. For instance, not possible to know the next outcome of roulette bet, we “know” probability of success of such outcome. When probability is not knowable  uncertainty. Otherwise, risk.
  • 5. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-52019-02-07 Data sets in large data set setting, contain hundreds (if not) more variables, with taxonomy: 1) Numeric, 2) Character, considered nominal, i.e., no cardinality. 3) Ordinal 4) Ratio: ratio of numeric variables. 5) Id: a. Id proper, nominal variables: patient id, SSN, etc. b. Indices: Time index, patient visit number, etc. Can compute Nominal Ordinal Interval Ratio frequency distribution. Yes Yes Yes Yes median and percentiles. No Yes Yes Yes add or subtract. No No Yes Yes mean, standard deviation, standard error of the mean. No No Yes Yes ratio, or coefficient of variation. No No No Yes
  • 6. Leonardo Auslender –Ch. 1 Copyright 2004 -62019-02-07 Proposed EDA steps Variables (attributes) can be analyzed in themselves (univariate), in relation to other single variables (bivariate) and as a whole (multivariate). Conceptualization is more difficult as we progress from one to many, and requires answers to e.g., which one is truly bigger? We can envision 5 steps in EDA: Data Set definition: at least number of observations, variables and taxonomy. UEDA Univariate (central location and dispersion measures). BEDA Bivariate (mostly correlation and contingency tables) MEDA Multivariate: Missing values analysis and imputation, principal components and clustering. Outliers and variable transformation or Engineering (that involve UEDA, BEDA and MEDA. E.g., log (Var1) is Univariate transformation. PCA is multivariate Transformation).
  • 7. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-72019-02-07 Statistical Inference It embraces all of EDA and Modeling, allows for comparisons and statements about similarity or not in many different situations. All data sets assumed to be representative samples from an infinite population, which can be unrealistic. If interested in inferences on heights of first graders in specific school at specific time, data is entire population, and there is no uncertainty on recorded heights and thus statistical inference not required.. If interest is in height changes across time (and thus work with samples), then use statistical inference to infer information about overall population (perhaps ideal, not real).
  • 8. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-82019-02-07
  • 9. Leonardo Auslender –Ch. 1 Copyright 2004 — 9 — Data set 1: Definition by way of Example • Health insurance company: Ophtamologic Insurance Claims • Is claim valid or fraudulent? • Present operation: • Manual review of history and circumstances Alternative: Scoring analytical system. Data Mining Solution: Use data on past claims to verify fraud
  • 10. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-102019-02-07 Alphabetic List of Variables and Attributes # Variable Type Len Format Informa t Label 3 DOCTOR_VISITS Num 8 BEST12 . F12. Total visits to a doctor 1 FRAUD Num 8 BEST12 . F12. Fraudulent Activity yes/no 5 MEMBER_DURAT ION Num 8 Membership duration 4 NO_CLAIMS Num 8 BEST12 . F12. No of claims made recently 7 NUM_MEMBERS Num 8 Number of members covered 6 OPTOM_PRESC Num 8 BEST12 . F12. Number of opticals claimed 2 TOTAL_SPEND Num 8 BEST12 . F12. Total spent on opticals SAS EXAMPLE for fraud data set: ods html; proc contents data = fraud.fraud; run; ods html close;
  • 11. Leonardo Auslender –Ch. 1 Copyright 2004 — 11 — Data set 2 (DS2): Babies’ deaths (health care example). • Babies’ death or survivability. • Determine basic statistics for mostly binary variables. • Data anomalies? • Is present data representative or anomalous?
  • 12. Leonardo Auslender –Ch. 1 Copyright 2004 -122019-02-07 Alphabetic List of Variables and Attributes # Variable Type Len Label 16 Const Num 8 18 H Num 8 17 M1 Num 8 7 abort Num 8 past abortion 1 death Num 8 Death 9 dyslab Num 8 Labor progress 5 gestage Num 8 Gestational AGe 8 hydramnios Num 8 Too much amniotic fluid 6 isoimm Num 8 Iso immunization 15 malpres Num 8 Mal Presented 11 nomonit Num 8 No Monitor 2 nonwhite Num 8 Non-White 4 nullip Num 8 Null Parity 10 placord Num 8 Placental - cord anomaly 14 prerupt Num 8 PROM 3 teenages Num 8 Early Age 12 twint Num 8 Twin, Triplet 13 ward Num 8 Public Ward
  • 13. Leonardo Auslender –Ch. 1 Copyright 2004 -132019-02-07 Data Set HMEQ (DS 3) Reports characteristics and delinquency information for 5,960 home equity loans: loan where the obligor uses the equity of his or her home as the underlying collateral. The data set has the following characteristics: ◾ BAD: 1 = applicant defaulted on loan or seriously delinquent; 0 = applicant paid loan ◾ LOAN: Amount of the loan request ◾ MORTDUE: Amount due on existing mortgage ◾ VALUE: Value of current property ◾ REASON: DebtCon = debt consolidation; HomeImp = home improvement ◾ JOB: Occupational categories ◾ YOJ: Years at present job ◾ DEROG: Number of major derogatory reports ◾ DELINQ: Number of delinquent credit lines ◾ CLAGE: Age of oldest credit line in months ◾ NINQ: Number of recent credit inquiries ◾ CLNO: Number of credit lines ◾ DEBTINC: Debt-to-income ratio
  • 14. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-142019-02-07 Home Work Questions:
  • 15. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-152019-02-07 Statistical Inference.
  • 16. Leonardo Auslender –Ch. 1 Copyright 2004 -162019-02-07 NEXT: UEDA: Single variable analysis. BEDA: Analysis of pairs of variables, typically correlation and chi-square analysis. MEDA: Focuses on dimension reduction. Studied dimensions are Variables and Observations  Methods to group variables (PCA, FA) Methods to group observations (Cluster analysis). Plus Missing Value imputation Multivariate transformations Multivariate outlier detection.
  • 17. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-172019-02-07