SlideShare a Scribd company logo
1 of 17
Download to read offline
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-1
2019-02-07
DOI: 10.13140/RG.2.2.20892.33928
Leonardo Auslender –Ch. 1 Copyright 2004 -2
2019-02-07
Contents
Definition
Data sets and Examples
UEDA: Univariate EDA
Definitions of central tendency and variability.
Data set descriptions.
Transformations.
Statistical Inference – Probability.
Univariate Data Distributions – Normality.
CLT – Statistical Tests.
Sampling (under construction)
Statistical Puzzles
Bayesian Inference (under construction).
BEDA
Continuous Variables – Correlations - Causation
Nominal Variables – Chi Square – Odds
Simpson’s Paradox
Leonardo Auslender –Ch. 1 Copyright 2004 -3
2019-02-07
Contents (cont.)
MEDA
Principal Components
Factor Analysis
Clustering and Segmentation
Canonical Discrimination Analysis
Missing Value imputation
Outliers and Variable Transformations.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-4
2019-02-07
EDA: definition, purpose and usefulness.
Typically (or should be) initial step to view and try to comprehend data
set , assumed to be of rectangular form. Columns called variables, rows
observations. Comprehension of individual and across individual variables.
Given size of present data bases with hundreds if not thousands or more
variables or attributes and possibly millions of observations, it is either hard
or not possible to obtain full conceptual understanding of the
informational content imbued in data. Since many applications lead to
(some) modeling, also possible and desirable to perform EDA on outcome of
these procedures  we can envision EDA applied twice, prior to and after
model creation, and thus possibly restarting modeling effort.
Variables are random when their values cannot be known with certainty, i.e.,
they are not deterministic variables. For instance, not possible to know the
next outcome of roulette bet, we “know” probability of success of such
outcome.
When probability is not knowable  uncertainty. Otherwise, risk.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-5
2019-02-07
Data sets in large data set setting, contain hundreds (if
not) more variables, with taxonomy:
1) Numeric,
2) Character, considered nominal, i.e., no cardinality.
3) Ordinal
4) Ratio: ratio of numeric variables.
5) Id: a. Id proper, nominal variables: patient id, SSN, etc.
b. Indices: Time index, patient visit number, etc.
Can compute Nominal Ordinal Interval Ratio
frequency distribution. Yes Yes Yes Yes
median and percentiles. No Yes Yes Yes
add or subtract. No No Yes Yes
mean, standard deviation,
standard error of the mean.
No No Yes Yes
ratio, or coefficient of
variation.
No No No Yes
Leonardo Auslender –Ch. 1 Copyright 2004 -6
2019-02-07
Proposed EDA steps
Variables (attributes) can be analyzed in themselves (univariate), in relation to
other single variables (bivariate) and as a whole (multivariate).
Conceptualization is more difficult as we progress from one to many, and
requires answers to e.g., which one is truly bigger?
We can envision 5 steps in EDA:
Data Set definition: at least number of observations, variables and taxonomy.
UEDA Univariate (central location and dispersion measures).
BEDA Bivariate (mostly correlation and contingency tables)
MEDA Multivariate: Missing values analysis and imputation, principal
components and clustering.
Outliers and variable transformation or Engineering (that involve UEDA, BEDA
and MEDA. E.g., log (Var1) is Univariate transformation. PCA is multivariate
Transformation).
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-7
2019-02-07
Statistical Inference
It embraces all of EDA and Modeling, allows for
comparisons and statements about similarity or not in
many different situations.
All data sets assumed to be representative samples from an
infinite population, which can be unrealistic. If interested in
inferences on heights of first graders in specific school at
specific time, data is entire population, and there is no
uncertainty on recorded heights and thus statistical inference
not required..
If interest is in height changes across time (and thus work with
samples), then use statistical inference to infer information
about overall population (perhaps ideal, not real).
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-8
2019-02-07
Leonardo Auslender –Ch. 1 Copyright 2004 — 9 —
Data set 1: Definition by way of Example
• Health insurance company: Ophtamologic
Insurance Claims
• Is claim valid or fraudulent?
• Present operation:
• Manual review of history and circumstances
Alternative:
Scoring analytical system.
Data Mining Solution: Use data on past claims
to verify fraud
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-10
2019-02-07
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informa
t
Label
3 DOCTOR_VISITS Num 8 BEST12
.
F12. Total visits to a doctor
1 FRAUD Num 8 BEST12
.
F12. Fraudulent Activity yes/no
5 MEMBER_DURAT
ION
Num 8 Membership duration
4 NO_CLAIMS Num 8 BEST12
.
F12. No of claims made recently
7 NUM_MEMBERS Num 8 Number of members covered
6 OPTOM_PRESC Num 8 BEST12
.
F12. Number of opticals claimed
2 TOTAL_SPEND Num 8 BEST12
.
F12. Total spent on opticals
SAS EXAMPLE for fraud data set:
ods html;
proc contents data = fraud.fraud;
run;
ods html close;
Leonardo Auslender –Ch. 1 Copyright 2004 — 11 —
Data set 2 (DS2): Babies’ deaths (health
care example).
• Babies’ death or survivability.
• Determine basic statistics for mostly binary
variables.
• Data anomalies?
• Is present data representative or
anomalous?
Leonardo Auslender –Ch. 1 Copyright 2004 -12
2019-02-07
Alphabetic List of Variables and Attributes
# Variable Type Len Label
16 Const Num 8
18 H Num 8
17 M1 Num 8
7 abort Num 8 past abortion
1 death Num 8 Death
9 dyslab Num 8 Labor progress
5 gestage Num 8 Gestational AGe
8 hydramnios Num 8 Too much amniotic fluid
6 isoimm Num 8 Iso immunization
15 malpres Num 8 Mal Presented
11 nomonit Num 8 No Monitor
2 nonwhite Num 8 Non-White
4 nullip Num 8 Null Parity
10 placord Num 8 Placental - cord anomaly
14 prerupt Num 8 PROM
3 teenages Num 8 Early Age
12 twint Num 8 Twin, Triplet
13 ward Num 8 Public Ward
Leonardo Auslender –Ch. 1 Copyright 2004 -13
2019-02-07
Data Set HMEQ (DS 3)
Reports characteristics and delinquency information for 5,960
home equity loans: loan where the obligor uses the equity of
his or her home as the underlying collateral.
The data set has the following characteristics:
◾ BAD: 1 = applicant defaulted on loan or seriously delinquent; 0 = applicant paid loan
◾ LOAN: Amount of the loan request
◾ MORTDUE: Amount due on existing mortgage
◾ VALUE: Value of current property
◾ REASON: DebtCon = debt consolidation; HomeImp = home improvement
◾ JOB: Occupational categories
◾ YOJ: Years at present job
◾ DEROG: Number of major derogatory reports
◾ DELINQ: Number of delinquent credit lines
◾ CLAGE: Age of oldest credit line in months
◾ NINQ: Number of recent credit inquiries
◾ CLNO: Number of credit lines
◾ DEBTINC: Debt-to-income ratio
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-14
2019-02-07
Home Work Questions:
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-15
2019-02-07
Statistical
Inference.
Leonardo Auslender –Ch. 1 Copyright 2004 -16
2019-02-07
NEXT:
UEDA: Single variable analysis.
BEDA: Analysis of pairs of variables, typically correlation and
chi-square analysis.
MEDA: Focuses on dimension reduction. Studied dimensions
are Variables and Observations 
Methods to group variables (PCA, FA)
Methods to group observations (Cluster analysis).
Plus
Missing Value imputation
Multivariate transformations
Multivariate outlier detection.
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-17
2019-02-07

More Related Content

Similar to 1 EDA.pdf

Not Good Enough, But Try Again! The Impact of Improved Rejection Communicatio...
Not Good Enough, But Try Again! The Impact of Improved Rejection Communicatio...Not Good Enough, But Try Again! The Impact of Improved Rejection Communicatio...
Not Good Enough, But Try Again! The Impact of Improved Rejection Communicatio...Aleksi Aaltonen
 
IRJET- A Survey on Prediction of Heart Disease Presence using Data Mining and...
IRJET- A Survey on Prediction of Heart Disease Presence using Data Mining and...IRJET- A Survey on Prediction of Heart Disease Presence using Data Mining and...
IRJET- A Survey on Prediction of Heart Disease Presence using Data Mining and...IRJET Journal
 
Question 1 Which type of offsite backup service provides backups.docx
Question 1 Which type of offsite backup service provides backups.docxQuestion 1 Which type of offsite backup service provides backups.docx
Question 1 Which type of offsite backup service provides backups.docxIRESH3
 
30 Argumentative Essay Examples In Illustrator Go
30 Argumentative Essay Examples In Illustrator  Go30 Argumentative Essay Examples In Illustrator  Go
30 Argumentative Essay Examples In Illustrator GoTanya Williams
 
Forecasting COVID-19 using Polynomial Regression and Support Vector Machine
Forecasting COVID-19 using Polynomial Regression and Support Vector MachineForecasting COVID-19 using Polynomial Regression and Support Vector Machine
Forecasting COVID-19 using Polynomial Regression and Support Vector MachineIRJET Journal
 
Business Research Methods. problem definition literature review and qualitati...
Business Research Methods. problem definition literature review and qualitati...Business Research Methods. problem definition literature review and qualitati...
Business Research Methods. problem definition literature review and qualitati...Ahsan Khan Eco (Superior College)
 
Research methodology part 2
Research methodology part 2Research methodology part 2
Research methodology part 2sathyalekha
 
C o l o r a d o S t a t e U n i v e r s i t y - P u e b l o .docx
C o l o r a d o  S t a t e  U n i v e r s i t y - P u e b l o .docxC o l o r a d o  S t a t e  U n i v e r s i t y - P u e b l o .docx
C o l o r a d o S t a t e U n i v e r s i t y - P u e b l o .docxclairbycraft
 
Automated Recommendation of Templates for Legal Requirements
Automated Recommendation of Templates for Legal RequirementsAutomated Recommendation of Templates for Legal Requirements
Automated Recommendation of Templates for Legal RequirementsLionel Briand
 
Choosing an Analytics Solution in Healthcare
Choosing an Analytics Solution in HealthcareChoosing an Analytics Solution in Healthcare
Choosing an Analytics Solution in HealthcareDale Sanders
 
IRJET- Disease Analysis and Giving Remedies through an Android Application
IRJET- Disease Analysis and Giving Remedies through an Android ApplicationIRJET- Disease Analysis and Giving Remedies through an Android Application
IRJET- Disease Analysis and Giving Remedies through an Android ApplicationIRJET Journal
 
Sunset Writing Paper Set, Optional Storage Box Pen
Sunset Writing Paper Set, Optional Storage Box PenSunset Writing Paper Set, Optional Storage Box Pen
Sunset Writing Paper Set, Optional Storage Box PenValerie Felton
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Greg Makowski
 
Zero Base Training Report
Zero Base Training ReportZero Base Training Report
Zero Base Training ReportPatty Buckley
 
Zero Base Training Report
Zero Base Training ReportZero Base Training Report
Zero Base Training ReportPaula Smith
 

Similar to 1 EDA.pdf (20)

Not Good Enough, But Try Again! The Impact of Improved Rejection Communicatio...
Not Good Enough, But Try Again! The Impact of Improved Rejection Communicatio...Not Good Enough, But Try Again! The Impact of Improved Rejection Communicatio...
Not Good Enough, But Try Again! The Impact of Improved Rejection Communicatio...
 
IRJET- A Survey on Prediction of Heart Disease Presence using Data Mining and...
IRJET- A Survey on Prediction of Heart Disease Presence using Data Mining and...IRJET- A Survey on Prediction of Heart Disease Presence using Data Mining and...
IRJET- A Survey on Prediction of Heart Disease Presence using Data Mining and...
 
EDA-Unit 1.pdf
EDA-Unit 1.pdfEDA-Unit 1.pdf
EDA-Unit 1.pdf
 
Question 1 Which type of offsite backup service provides backups.docx
Question 1 Which type of offsite backup service provides backups.docxQuestion 1 Which type of offsite backup service provides backups.docx
Question 1 Which type of offsite backup service provides backups.docx
 
30 Argumentative Essay Examples In Illustrator Go
30 Argumentative Essay Examples In Illustrator  Go30 Argumentative Essay Examples In Illustrator  Go
30 Argumentative Essay Examples In Illustrator Go
 
Forecasting COVID-19 using Polynomial Regression and Support Vector Machine
Forecasting COVID-19 using Polynomial Regression and Support Vector MachineForecasting COVID-19 using Polynomial Regression and Support Vector Machine
Forecasting COVID-19 using Polynomial Regression and Support Vector Machine
 
Business Research Methods. problem definition literature review and qualitati...
Business Research Methods. problem definition literature review and qualitati...Business Research Methods. problem definition literature review and qualitati...
Business Research Methods. problem definition literature review and qualitati...
 
Research methodology part 2
Research methodology part 2Research methodology part 2
Research methodology part 2
 
Research methodology part 2
Research methodology part 2Research methodology part 2
Research methodology part 2
 
C o l o r a d o S t a t e U n i v e r s i t y - P u e b l o .docx
C o l o r a d o  S t a t e  U n i v e r s i t y - P u e b l o .docxC o l o r a d o  S t a t e  U n i v e r s i t y - P u e b l o .docx
C o l o r a d o S t a t e U n i v e r s i t y - P u e b l o .docx
 
Automated Recommendation of Templates for Legal Requirements
Automated Recommendation of Templates for Legal RequirementsAutomated Recommendation of Templates for Legal Requirements
Automated Recommendation of Templates for Legal Requirements
 
Choosing an Analytics Solution in Healthcare
Choosing an Analytics Solution in HealthcareChoosing an Analytics Solution in Healthcare
Choosing an Analytics Solution in Healthcare
 
IRJET- Disease Analysis and Giving Remedies through an Android Application
IRJET- Disease Analysis and Giving Remedies through an Android ApplicationIRJET- Disease Analysis and Giving Remedies through an Android Application
IRJET- Disease Analysis and Giving Remedies through an Android Application
 
Sunset Writing Paper Set, Optional Storage Box Pen
Sunset Writing Paper Set, Optional Storage Box PenSunset Writing Paper Set, Optional Storage Box Pen
Sunset Writing Paper Set, Optional Storage Box Pen
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
 
Zero Base Training Report
Zero Base Training ReportZero Base Training Report
Zero Base Training Report
 
Zero Base Training Report
Zero Base Training ReportZero Base Training Report
Zero Base Training Report
 
Ijcet 06 07_004
Ijcet 06 07_004Ijcet 06 07_004
Ijcet 06 07_004
 
Credit iconip
Credit iconipCredit iconip
Credit iconip
 
Credit iconip
Credit iconipCredit iconip
Credit iconip
 

More from Leonardo Auslender

4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdf4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdfLeonardo Auslender
 
4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdfLeonardo Auslender
 
4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdf4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdfLeonardo Auslender
 
4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdf4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdfLeonardo Auslender
 
4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdf4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdfLeonardo Auslender
 
4_3_Ensemble models and grad boost part 2.pdf
4_3_Ensemble models and grad boost part 2.pdf4_3_Ensemble models and grad boost part 2.pdf
4_3_Ensemble models and grad boost part 2.pdfLeonardo Auslender
 
4_2_Ensemble models and grad boost part 1.pdf
4_2_Ensemble models and grad boost part 1.pdf4_2_Ensemble models and grad boost part 1.pdf
4_2_Ensemble models and grad boost part 1.pdfLeonardo Auslender
 
Classification methods and assessment.pdf
Classification methods and assessment.pdfClassification methods and assessment.pdf
Classification methods and assessment.pdfLeonardo Auslender
 
0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdfLeonardo Auslender
 
4 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-074 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-07Leonardo Auslender
 
4 2 ensemble models and grad boost part 2 2019-10-07
4 2 ensemble models and grad boost part 2 2019-10-074 2 ensemble models and grad boost part 2 2019-10-07
4 2 ensemble models and grad boost part 2 2019-10-07Leonardo Auslender
 

More from Leonardo Auslender (20)

1 UMI.pdf
1 UMI.pdf1 UMI.pdf
1 UMI.pdf
 
Ensembles.pdf
Ensembles.pdfEnsembles.pdf
Ensembles.pdf
 
Suppression Enhancement.pdf
Suppression Enhancement.pdfSuppression Enhancement.pdf
Suppression Enhancement.pdf
 
4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdf4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdf
 
4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf
 
4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdf4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdf
 
4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdf4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdf
 
4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdf4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdf
 
4_3_Ensemble models and grad boost part 2.pdf
4_3_Ensemble models and grad boost part 2.pdf4_3_Ensemble models and grad boost part 2.pdf
4_3_Ensemble models and grad boost part 2.pdf
 
4_2_Ensemble models and grad boost part 1.pdf
4_2_Ensemble models and grad boost part 1.pdf4_2_Ensemble models and grad boost part 1.pdf
4_2_Ensemble models and grad boost part 1.pdf
 
4_1_Tree World.pdf
4_1_Tree World.pdf4_1_Tree World.pdf
4_1_Tree World.pdf
 
Classification methods and assessment.pdf
Classification methods and assessment.pdfClassification methods and assessment.pdf
Classification methods and assessment.pdf
 
Linear Regression.pdf
Linear Regression.pdfLinear Regression.pdf
Linear Regression.pdf
 
4 MEDA.pdf
4 MEDA.pdf4 MEDA.pdf
4 MEDA.pdf
 
2 UEDA.pdf
2 UEDA.pdf2 UEDA.pdf
2 UEDA.pdf
 
3 BEDA.pdf
3 BEDA.pdf3 BEDA.pdf
3 BEDA.pdf
 
0 Statistics Intro.pdf
0 Statistics Intro.pdf0 Statistics Intro.pdf
0 Statistics Intro.pdf
 
0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf
 
4 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-074 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-07
 
4 2 ensemble models and grad boost part 2 2019-10-07
4 2 ensemble models and grad boost part 2 2019-10-074 2 ensemble models and grad boost part 2 2019-10-07
4 2 ensemble models and grad boost part 2 2019-10-07
 

Recently uploaded

原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证pwgnohujw
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...Amil baba
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesBoston Institute of Analytics
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeBoston Institute of Analytics
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444saurabvyas476
 
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单aqpto5bt
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshareraiaryan448
 
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI  MANAJEMEN OF PENYAKIT TETANUS.pptMATERI  MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI MANAJEMEN OF PENYAKIT TETANUS.pptRachmaGhifari
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token PredictionNABLAS株式会社
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...BabaJohn3
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfRobertoOcampo24
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...ThinkInnovation
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfgreat91
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchersdarmandersingh4580
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...ssuserf63bd7
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024patrickdtherriault
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives23050636
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Klinik Aborsi
 

Recently uploaded (20)

原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
 
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI  MANAJEMEN OF PENYAKIT TETANUS.pptMATERI  MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchers
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 

1 EDA.pdf

  • 1. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-1 2019-02-07 DOI: 10.13140/RG.2.2.20892.33928
  • 2. Leonardo Auslender –Ch. 1 Copyright 2004 -2 2019-02-07 Contents Definition Data sets and Examples UEDA: Univariate EDA Definitions of central tendency and variability. Data set descriptions. Transformations. Statistical Inference – Probability. Univariate Data Distributions – Normality. CLT – Statistical Tests. Sampling (under construction) Statistical Puzzles Bayesian Inference (under construction). BEDA Continuous Variables – Correlations - Causation Nominal Variables – Chi Square – Odds Simpson’s Paradox
  • 3. Leonardo Auslender –Ch. 1 Copyright 2004 -3 2019-02-07 Contents (cont.) MEDA Principal Components Factor Analysis Clustering and Segmentation Canonical Discrimination Analysis Missing Value imputation Outliers and Variable Transformations.
  • 4. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-4 2019-02-07 EDA: definition, purpose and usefulness. Typically (or should be) initial step to view and try to comprehend data set , assumed to be of rectangular form. Columns called variables, rows observations. Comprehension of individual and across individual variables. Given size of present data bases with hundreds if not thousands or more variables or attributes and possibly millions of observations, it is either hard or not possible to obtain full conceptual understanding of the informational content imbued in data. Since many applications lead to (some) modeling, also possible and desirable to perform EDA on outcome of these procedures  we can envision EDA applied twice, prior to and after model creation, and thus possibly restarting modeling effort. Variables are random when their values cannot be known with certainty, i.e., they are not deterministic variables. For instance, not possible to know the next outcome of roulette bet, we “know” probability of success of such outcome. When probability is not knowable  uncertainty. Otherwise, risk.
  • 5. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-5 2019-02-07 Data sets in large data set setting, contain hundreds (if not) more variables, with taxonomy: 1) Numeric, 2) Character, considered nominal, i.e., no cardinality. 3) Ordinal 4) Ratio: ratio of numeric variables. 5) Id: a. Id proper, nominal variables: patient id, SSN, etc. b. Indices: Time index, patient visit number, etc. Can compute Nominal Ordinal Interval Ratio frequency distribution. Yes Yes Yes Yes median and percentiles. No Yes Yes Yes add or subtract. No No Yes Yes mean, standard deviation, standard error of the mean. No No Yes Yes ratio, or coefficient of variation. No No No Yes
  • 6. Leonardo Auslender –Ch. 1 Copyright 2004 -6 2019-02-07 Proposed EDA steps Variables (attributes) can be analyzed in themselves (univariate), in relation to other single variables (bivariate) and as a whole (multivariate). Conceptualization is more difficult as we progress from one to many, and requires answers to e.g., which one is truly bigger? We can envision 5 steps in EDA: Data Set definition: at least number of observations, variables and taxonomy. UEDA Univariate (central location and dispersion measures). BEDA Bivariate (mostly correlation and contingency tables) MEDA Multivariate: Missing values analysis and imputation, principal components and clustering. Outliers and variable transformation or Engineering (that involve UEDA, BEDA and MEDA. E.g., log (Var1) is Univariate transformation. PCA is multivariate Transformation).
  • 7. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-7 2019-02-07 Statistical Inference It embraces all of EDA and Modeling, allows for comparisons and statements about similarity or not in many different situations. All data sets assumed to be representative samples from an infinite population, which can be unrealistic. If interested in inferences on heights of first graders in specific school at specific time, data is entire population, and there is no uncertainty on recorded heights and thus statistical inference not required.. If interest is in height changes across time (and thus work with samples), then use statistical inference to infer information about overall population (perhaps ideal, not real).
  • 8. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-8 2019-02-07
  • 9. Leonardo Auslender –Ch. 1 Copyright 2004 — 9 — Data set 1: Definition by way of Example • Health insurance company: Ophtamologic Insurance Claims • Is claim valid or fraudulent? • Present operation: • Manual review of history and circumstances Alternative: Scoring analytical system. Data Mining Solution: Use data on past claims to verify fraud
  • 10. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-10 2019-02-07 Alphabetic List of Variables and Attributes # Variable Type Len Format Informa t Label 3 DOCTOR_VISITS Num 8 BEST12 . F12. Total visits to a doctor 1 FRAUD Num 8 BEST12 . F12. Fraudulent Activity yes/no 5 MEMBER_DURAT ION Num 8 Membership duration 4 NO_CLAIMS Num 8 BEST12 . F12. No of claims made recently 7 NUM_MEMBERS Num 8 Number of members covered 6 OPTOM_PRESC Num 8 BEST12 . F12. Number of opticals claimed 2 TOTAL_SPEND Num 8 BEST12 . F12. Total spent on opticals SAS EXAMPLE for fraud data set: ods html; proc contents data = fraud.fraud; run; ods html close;
  • 11. Leonardo Auslender –Ch. 1 Copyright 2004 — 11 — Data set 2 (DS2): Babies’ deaths (health care example). • Babies’ death or survivability. • Determine basic statistics for mostly binary variables. • Data anomalies? • Is present data representative or anomalous?
  • 12. Leonardo Auslender –Ch. 1 Copyright 2004 -12 2019-02-07 Alphabetic List of Variables and Attributes # Variable Type Len Label 16 Const Num 8 18 H Num 8 17 M1 Num 8 7 abort Num 8 past abortion 1 death Num 8 Death 9 dyslab Num 8 Labor progress 5 gestage Num 8 Gestational AGe 8 hydramnios Num 8 Too much amniotic fluid 6 isoimm Num 8 Iso immunization 15 malpres Num 8 Mal Presented 11 nomonit Num 8 No Monitor 2 nonwhite Num 8 Non-White 4 nullip Num 8 Null Parity 10 placord Num 8 Placental - cord anomaly 14 prerupt Num 8 PROM 3 teenages Num 8 Early Age 12 twint Num 8 Twin, Triplet 13 ward Num 8 Public Ward
  • 13. Leonardo Auslender –Ch. 1 Copyright 2004 -13 2019-02-07 Data Set HMEQ (DS 3) Reports characteristics and delinquency information for 5,960 home equity loans: loan where the obligor uses the equity of his or her home as the underlying collateral. The data set has the following characteristics: ◾ BAD: 1 = applicant defaulted on loan or seriously delinquent; 0 = applicant paid loan ◾ LOAN: Amount of the loan request ◾ MORTDUE: Amount due on existing mortgage ◾ VALUE: Value of current property ◾ REASON: DebtCon = debt consolidation; HomeImp = home improvement ◾ JOB: Occupational categories ◾ YOJ: Years at present job ◾ DEROG: Number of major derogatory reports ◾ DELINQ: Number of delinquent credit lines ◾ CLAGE: Age of oldest credit line in months ◾ NINQ: Number of recent credit inquiries ◾ CLNO: Number of credit lines ◾ DEBTINC: Debt-to-income ratio
  • 14. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-14 2019-02-07 Home Work Questions:
  • 15. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-15 2019-02-07 Statistical Inference.
  • 16. Leonardo Auslender –Ch. 1 Copyright 2004 -16 2019-02-07 NEXT: UEDA: Single variable analysis. BEDA: Analysis of pairs of variables, typically correlation and chi-square analysis. MEDA: Focuses on dimension reduction. Studied dimensions are Variables and Observations  Methods to group variables (PCA, FA) Methods to group observations (Cluster analysis). Plus Missing Value imputation Multivariate transformations Multivariate outlier detection.
  • 17. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-17 2019-02-07