SlideShare a Scribd company logo
1 of 47
C11BD Big Data Analytics
Data Exploration II
Dr Dhanan Utomo
Assistant Professor in Business & Management
Edinburgh Business School
Outline
• Quick recap
• Feature selection
• Introduction to advanced data exploration
• Visualising relationships between features
• Covariance and correlation
2
Learning Outcomes
• Understand how to indicate which descriptive features may be useful
for predicting a target feature
3
Quick recap
Where are we in the course:
Assessments:
Week 5: Coursework 1 is due on Thursday, 15th February
Week 5: Coursework 2 will be introduced
Weeks Topic Lecturer: Dubai Lecturer: Edinburgh
3 Data Exploration I Dr Tarek Kandil Dr Paulus Aditjandra
4 Data Exploration II Dr Tarek Kandil Dr Dhanan Utomo
5 Intro to modelling and machine learning Dr Tarek Kandil Dr Dhanan Utomo
4
Recap on Coursework 1
• Individual assignment count towards 40% of your course mark
• Refer to the instructions on Canvas
5
Feature Selection
• Designing features
• Feature selection methods
6
Designing Features
A feature is any measurable input that can be used in a predictive
model (e.g. salary or bank balance in predicting loan default)
7
Designing Features
A feature is any measurable input that can be used in a predictive
model (e.g. salary or bank balance in predicting loan default)
Three key data considerations are particularly important when we are
designing features:
1. Data availability
• Are values for that feature available in the database?
• For a derived feature, are the values available to compute the derived
feature?
2. Timing
• When will data become available for a feature?
3. Longevity
• Data may become stale, e.g. salary may differ from that at time of making a
loan application
Feature design and implementation is an iterative process
8
Designing Features
The features in an ABT (analytics base table) can be either
• Raw features: Concrete features, direct measurable features, stored
in the database
• Derived features: Calculated from raw features, e.g. body mass index,
calculated as a ratio mass versus height
9
Designing Features
The features in an ABT can be either
• Raw features: Concrete features, direct measurable features, stored
in the database
• Derived features: Calculated from raw features, e.g. body mass index,
calculated as a ratio mass versus height
Feature engineering is the process of selecting, manipulating and
transforming raw data into derived features.
10
Designing Features
The features in an ABT can be either
• Raw features: Concrete features, direct measurable features, stored
in the database
• Derived features: Calculated from raw features, e.g. body mass index,
calculated as a ratio mass versus height
Feature engineering is the process of selecting, manipulating and
transforming raw data into derived features.
11
Features Selection
Why feature selection? Less features result in
• Simpler models
• Shorter training times
• Less overfitting, therefore better generalization
12
Features Selection
Why feature selection? Less features result in
• Simpler models
• Shorter training times
• Less overfitting, therefore better generalization
Too many versus too few features
• Too few features may result in more false positives/negatives
(underfitting)
• Too few features do not have sufficient discriminative power
• Too many result in noise in the training data, and potentially
overfitting, in addition to increases computational complexity
13
Features Selection
Types of features
• Informative – features correlated with the output variables/targets
• Features with unique values or very small deviation – should be
removed
• Redundant features – have correlations with other features
• Irrelevant features – noise, have no correlation with output variables
14
Features Selection
What is feature selection?
• The process of selecting a subset of the most relevant features for
use in model construction
• Remove features without loss of information
• Keep the features that describe the variance in a dataset
Approaches to feature selection
• Filter methods
• Wrapper methods
• Embedded methods
15
Features Selection
16
Features Selection
20
Introduction to Advanced Data
Exploration
Advanced Data Exploration
In data exploration, we looked at descriptive statistics and data
visualisation techniques of the characteristics of individual features
In advanced data exploration, techniques can be considered to enable
the examination and analysis of relationships between pairs of
features, in order to assist in
• Indicating which descriptive features might be useful for predicting a
target feature
• Finding pairs of descriptive features that are closely related
22
Advanced Data Exploration
Case study:
The details of a professional
basketball team
23
Visualising Relationships
Between Features
• Continuous features
• Categorical features
• Categorical vs continuous features
Visualising Relationships Between Features:
Continuous Features
For visualising pairs of continuous features, use a scatter plot
• A scatter plot is based on two axes: The horizontal axis represents
one feature and the vertical axis represents a second
• Each instance in a dataset is represented by a point on the plot
determined by the values for that instance of the two features
involved
25
Visualising Relationships Between Features:
Continuous Features
A scatter plot matrix (SPLOM) shows scatter plots for a whole
collection of features arranged into a matrix
This is useful for exploring the relationships between groups of
features (e.g. all of the continuous features in an ABT)
Effectiveness of scatter plot matrices diminishes once the number of
features in the set goes beyond eight because the graphs become too
small
26
Visualising Relationships Between Features:
Continuous Features
27
Visualising Relationships Between Features:
Categorical Features
For visualising pairs of categorical features, use a collection of small
multiple bar plots (small multiples visualisation)
1. Draw a simple bar plot indicating the densities of the different
levels of the first feature
2. For each level of the second feature, draw a bar plot of the first
feature using only the instances in the dataset for which the
second feature has that level
28
Visualising Relationships Between Features:
Categorical Features
For visualising pairs of categorical features, use a collection of small
multiple bar plots (small multiples visualisation)
1. Draw a simple bar plot indicating the densities of the different
levels of the first feature
2. For each level of the second feature, draw a bar plot of the first
feature using only the instances in the dataset for which the
second feature has that level
If the two features have a strong relationship, the bar plots for each
level of the second feature will look noticeably different to one
another and to the overall bar plot for the first feature
If there is no relationship, then we expect that the levels of the first
feature will be evenly distributed amongst the instances having the
different levels of the second feature, so all bar plots will look much
the same
29
Visualising Relationships Between Features:
Categorical Features
30
Visualising Relationships Between Features:
Categorical Features
For visualising pairs of categorical features where the number of levels
for one of the features being compared is no more than three, stacked
bar plots can be used as an alternative to the small multiples
1. Show a bar plot of the first feature above another bar plot that
shows the relative distribution of the levels of the second feature
within each level of the first
2. With relative distributions used, the bars in the second bar plot
cover the full range of the space available
If two features are unrelated, it is expected to see the same proportion
of each level of the second feature within the bars for each level of the
first
31
Visualising Relationships Between Features:
Categorical Features
32
Visualising Relationships Between Features:
Categorical vs Continuous Features
For visualising pairs of categorical and continuous features, use a small
multiples approach that draws a histogram of the values of the
continuous feature for each level of the categorical feature
Each histogram includes only those instances in the dataset that have
the associated level of the categorical feature
If the features are unrelated (or independent), the histograms for each
level should be very similar
If the features are related, however, the shapes and/or the central
tendencies of the histograms will be different
33
Visualising Relationships Between Features:
Categorical vs Continuous Features
34
Visualising Relationships Between Features:
Categorical vs Continuous Features
35
Visualising Relationships Between Features:
Categorical vs Continuous Features
Another approach to visualising the relationship between a categorical
feature and a continuous feature is to use a collection of box plots
For each level of the categorical feature a box plot of the
corresponding values of the continuous feature is drawn
When a relationship exists between the two features, the box plots
should show differing central tendencies and variations
When no relationship exists, the box plots should all appear similar
36
Visualising Relationships Between Features:
Categorical vs Continuous Features
37
Visualising Relationships Between Features:
Categorical vs Continuous Features
38
Covariance and Correlation
• Measuring covariance and correlation
• Covariance and correlation matrix
Measuring Covariance and Correlation
In addition to visually inspecting scatter plots, formal measures of the
relationship between two continuous features can be calculated using
covariance and correlation
Covariance values fall into the range [−∞; ∞]
• Negative values indicate a negative relationship
• Positive values indicate a positive relationship
• Values near zero indicate that there is little or no relationship
between the features
40
Measuring Covariance and Correlation
In calculating the covariance between the HEIGHT feature and the
WEIGHT and AGE features in the case study
• 𝑐𝑜𝑣(HEIGHT; WEIGHT) = 241.72 indicate that there is a strong
positive relationship between the height and weight of a player
• 𝑐𝑜𝑣 (HEIGHT; AGE) = 19.7 indicate a much smaller positive
relationship between height and age
A problem with using covariance is that it is measured in the same
units as the features that it measures
Comparing the covariance between pairs of features only makes sense
if each pair of features is composed of the same mixture of units
41
Measuring Covariance and Correlation
Correlation is a normalised form of covariance that ranges between -1
and 1
𝑐𝑜𝑟𝑟 𝑎, 𝑏 =
𝑐𝑜𝑣(𝑎, 𝑏)
𝜎𝑎𝜎𝑏
where 𝑐𝑜𝑣(𝑎, 𝑏) is the covariance between features 𝑎 and 𝑏, and 𝜎𝑎
and 𝜎𝑏 are the standard deviations of 𝑎 and 𝑏, respectively
• Correlation is dimensionless
• Values close to -1 indicate a very strong negative correlation (or
covariance)
• Values close to 1 indicate a very strong positive correlation
• Values around 0 indicate no correlation, i.e. the features are
independent
42
Measuring Covariance and Correlation
Measuring the covariance and correlation between the HEIGHT feature
and the WEIGHT and AGE features in the case study
• 𝑐𝑜𝑣(HEIGHT; WEIGHT) = 241.72
• 𝑐𝑜𝑣(HEIGHT; AGE) = 19.7
• 𝑐𝑜𝑟𝑟(HEIGHT; WEIGHT) = 0.898
• 𝑐𝑜𝑟𝑟(HEIGHT; AGE) = 0.345
43
Covariance and Correlation Matrix
There are typically multiple continuous features between which we
would like to explore relationships
Two tools that can be useful for this are the covariance matrix and the
correlation matrix
The scatter plot matrix (SPLOM) is a visualisation of the correlation
matrix
This can be made more obvious by including the correlation
coefficients in SPLOMs in the cells above the diagonal
44
Covariance and Correlation Matrix
45
Covariance and Correlation Discussion
Correlation is a good measure of the relationship between two
continuous features, but it is not by any means perfect
• The correlation measure responds only to linear relationships
between features
• Peculiarities in a dataset can affect the calculation of the correlation
between two features, illustrated very clearly in the famous example
of Anscombe's quartet by Francis Anscombe
46
Covariance and Correlation Discussion
Anscombe's quartet: A series of four pairs of features that all have the
same correlation value of 0.816, even though they exhibit very
different relationships
47
Covariance and Correlation Discussion
Perhaps the most important thing to remember in relation to
correlation is that correlation does not necessarily imply causation
Just because the values of two features are correlated does not mean
that an actual causal relationship exists between the two
Based on correlations tests alone, a conclusion could be made that,
e.g. the presence of swallows cause hot weather; however, swallows
migrate to warmer countries
48
Textbook References
Fundamentals of Machine Learning for Predictive Data Analytics:
Algorithms, Worked Examples, and Case Studies by JD Kelleher, B Mac
Namee and A D’Arcy (2015)
• Designing and implementing features (pp. 77-91)
• Visualising relationships between features (pp. 127-135)
• Measuring covariance and correlation (pp. 136-140)
49
Tutorial
Please prepare for the tutorial session and go through the tasks as
specified on Canvas.
• Dubai on-campus session at 18:00 (Dubai) on Wednesday
• Edinburgh on-campus session at 09:00 (GMT) on Friday
50

More Related Content

Similar to C11BD 22-23 data ana-Exploration II.pptx

Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Rebecca Bilbro
 
30thSep2014
30thSep201430thSep2014
30thSep2014
Mia liu
 

Similar to C11BD 22-23 data ana-Exploration II.pptx (20)

database management system
database management systemdatabase management system
database management system
 
AIML_UNIT 2 _PPT_HAND NOTES_MPS.pdf
AIML_UNIT 2 _PPT_HAND NOTES_MPS.pdfAIML_UNIT 2 _PPT_HAND NOTES_MPS.pdf
AIML_UNIT 2 _PPT_HAND NOTES_MPS.pdf
 
Steering Model Selection with Visual Diagnostics
Steering Model Selection with Visual DiagnosticsSteering Model Selection with Visual Diagnostics
Steering Model Selection with Visual Diagnostics
 
ML-Unit-4.pdf
ML-Unit-4.pdfML-Unit-4.pdf
ML-Unit-4.pdf
 
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
 
WIA 2019 - Steering Model Selection with Visual Diagnostics
WIA 2019 - Steering Model Selection with Visual DiagnosticsWIA 2019 - Steering Model Selection with Visual Diagnostics
WIA 2019 - Steering Model Selection with Visual Diagnostics
 
Unit 2_DBMS_10.2.22.pptx
Unit 2_DBMS_10.2.22.pptxUnit 2_DBMS_10.2.22.pptx
Unit 2_DBMS_10.2.22.pptx
 
M5.pptx
M5.pptxM5.pptx
M5.pptx
 
Introduction to image processing and pattern recognition
Introduction to image processing and pattern recognitionIntroduction to image processing and pattern recognition
Introduction to image processing and pattern recognition
 
30thSep2014
30thSep201430thSep2014
30thSep2014
 
Analysis Of Attribute Revelance
Analysis Of Attribute RevelanceAnalysis Of Attribute Revelance
Analysis Of Attribute Revelance
 
Relational database (Unit 2)
Relational database (Unit 2)Relational database (Unit 2)
Relational database (Unit 2)
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Data Reduction
Data ReductionData Reduction
Data Reduction
 
laptop price prediction presentation
laptop price prediction presentationlaptop price prediction presentation
laptop price prediction presentation
 
Singular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxSingular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptx
 
EDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxEDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptx
 
Characterization and Comparison
Characterization and ComparisonCharacterization and Comparison
Characterization and Comparison
 
DATA MINING.pptx
DATA MINING.pptxDATA MINING.pptx
DATA MINING.pptx
 
6.SE_Requirements Modeling.ppt
6.SE_Requirements Modeling.ppt6.SE_Requirements Modeling.ppt
6.SE_Requirements Modeling.ppt
 

Recently uploaded

❤️Ratnagiri Call Girls 💯Call Us 🔝 7014168258 🔝 💃 Top Class Call Girl Service ...
❤️Ratnagiri Call Girls 💯Call Us 🔝 7014168258 🔝 💃 Top Class Call Girl Service ...❤️Ratnagiri Call Girls 💯Call Us 🔝 7014168258 🔝 💃 Top Class Call Girl Service ...
❤️Ratnagiri Call Girls 💯Call Us 🔝 7014168258 🔝 💃 Top Class Call Girl Service ...
Call Girls
 
👉Jalandhar Call Girl Service👉📞 98724-41143 👉📞 Just📲 NISHA -RANA-Call Girls In...
👉Jalandhar Call Girl Service👉📞 98724-41143 👉📞 Just📲 NISHA -RANA-Call Girls In...👉Jalandhar Call Girl Service👉📞 98724-41143 👉📞 Just📲 NISHA -RANA-Call Girls In...
👉Jalandhar Call Girl Service👉📞 98724-41143 👉📞 Just📲 NISHA -RANA-Call Girls In...
Rashmi Entertainment
 
Pathways to Equality: The Role of Men and Women in Gender Equity
Pathways to Equality:          The Role of Men and Women in Gender EquityPathways to Equality:          The Role of Men and Women in Gender Equity
Pathways to Equality: The Role of Men and Women in Gender Equity
Atharv Kurhade
 
❤️ Chandigarh Call Girls Service ☎️99158-51334☎️ Escort service in Chandigarh...
❤️ Chandigarh Call Girls Service ☎️99158-51334☎️ Escort service in Chandigarh...❤️ Chandigarh Call Girls Service ☎️99158-51334☎️ Escort service in Chandigarh...
❤️ Chandigarh Call Girls Service ☎️99158-51334☎️ Escort service in Chandigarh...
rajveerescorts2022
 
Obat Penggugur Kandungan Cytotec Dan Gastrul Harga Indomaret
Obat Penggugur Kandungan Cytotec Dan Gastrul Harga IndomaretObat Penggugur Kandungan Cytotec Dan Gastrul Harga Indomaret
Obat Penggugur Kandungan Cytotec Dan Gastrul Harga Indomaret
Cara Menggugurkan Kandungan 087776558899
 
Call Girls Service In Jalandhar💯Call Us 🔝 8146719683🔝 💃 Top Class ☎️ Call Gir...
Call Girls Service In Jalandhar💯Call Us 🔝 8146719683🔝 💃 Top Class ☎️ Call Gir...Call Girls Service In Jalandhar💯Call Us 🔝 8146719683🔝 💃 Top Class ☎️ Call Gir...
Call Girls Service In Jalandhar💯Call Us 🔝 8146719683🔝 💃 Top Class ☎️ Call Gir...
daljeetkaur2026
 
Goa Call Girls Service +9316020077 Call GirlsGoa By Russian Call Girlsin Goa
Goa Call Girls Service  +9316020077 Call GirlsGoa By Russian Call Girlsin GoaGoa Call Girls Service  +9316020077 Call GirlsGoa By Russian Call Girlsin Goa
Goa Call Girls Service +9316020077 Call GirlsGoa By Russian Call Girlsin Goa
Real Sex Provide In Goa
 
Independent Call Girl in 😋 Goa +9316020077 Goa Call Girl
Independent Call Girl in 😋 Goa  +9316020077 Goa Call GirlIndependent Call Girl in 😋 Goa  +9316020077 Goa Call Girl
Independent Call Girl in 😋 Goa +9316020077 Goa Call Girl
Real Sex Provide In Goa
 
Spauldings classification ppt by Dr C P PRINCE
Spauldings classification ppt by Dr C P PRINCESpauldings classification ppt by Dr C P PRINCE
Spauldings classification ppt by Dr C P PRINCE
DR.PRINCE C P
 
OBAT PENGGUGUR KANDUNGAN 081466799220 PIL ABORSI CYTOTEC PELUNTUR JANIN
OBAT PENGGUGUR KANDUNGAN 081466799220 PIL ABORSI CYTOTEC PELUNTUR JANINOBAT PENGGUGUR KANDUNGAN 081466799220 PIL ABORSI CYTOTEC PELUNTUR JANIN
OBAT PENGGUGUR KANDUNGAN 081466799220 PIL ABORSI CYTOTEC PELUNTUR JANIN
JUAL OBAT GASTRUL MISOPROSTOL 081466799220 PIL ABORSI CYTOTEC 1 2 3 4 5 6 7 BULAN TERPERCAYA
 
Cash Payment 😋 +9316020077 Goa Call Girl No Advance *Full Service
Cash Payment 😋  +9316020077 Goa Call Girl No Advance *Full ServiceCash Payment 😋  +9316020077 Goa Call Girl No Advance *Full Service
Cash Payment 😋 +9316020077 Goa Call Girl No Advance *Full Service
Real Sex Provide In Goa
 
❤️Jhansi Call Girls Service Just Call 🍑👄7014168258 🍑👄 Top Class Call Girl Ser...
❤️Jhansi Call Girls Service Just Call 🍑👄7014168258 🍑👄 Top Class Call Girl Ser...❤️Jhansi Call Girls Service Just Call 🍑👄7014168258 🍑👄 Top Class Call Girl Ser...
❤️Jhansi Call Girls Service Just Call 🍑👄7014168258 🍑👄 Top Class Call Girl Ser...
Call Girls
 

Recently uploaded (20)

Russian Call Girls Delhi 🧍🏼‍♀️🧍🏼‍♀️(91X0X0X912🧍🏼‍♀️🧍🏼‍♀️ Russian Call Girls S...
Russian Call Girls Delhi 🧍🏼‍♀️🧍🏼‍♀️(91X0X0X912🧍🏼‍♀️🧍🏼‍♀️ Russian Call Girls S...Russian Call Girls Delhi 🧍🏼‍♀️🧍🏼‍♀️(91X0X0X912🧍🏼‍♀️🧍🏼‍♀️ Russian Call Girls S...
Russian Call Girls Delhi 🧍🏼‍♀️🧍🏼‍♀️(91X0X0X912🧍🏼‍♀️🧍🏼‍♀️ Russian Call Girls S...
 
❤️Ratnagiri Call Girls 💯Call Us 🔝 7014168258 🔝 💃 Top Class Call Girl Service ...
❤️Ratnagiri Call Girls 💯Call Us 🔝 7014168258 🔝 💃 Top Class Call Girl Service ...❤️Ratnagiri Call Girls 💯Call Us 🔝 7014168258 🔝 💃 Top Class Call Girl Service ...
❤️Ratnagiri Call Girls 💯Call Us 🔝 7014168258 🔝 💃 Top Class Call Girl Service ...
 
👉Jalandhar Call Girl Service👉📞 98724-41143 👉📞 Just📲 NISHA -RANA-Call Girls In...
👉Jalandhar Call Girl Service👉📞 98724-41143 👉📞 Just📲 NISHA -RANA-Call Girls In...👉Jalandhar Call Girl Service👉📞 98724-41143 👉📞 Just📲 NISHA -RANA-Call Girls In...
👉Jalandhar Call Girl Service👉📞 98724-41143 👉📞 Just📲 NISHA -RANA-Call Girls In...
 
Coach Dan Quinn Commanders Feather T Shirts
Coach Dan Quinn Commanders Feather T ShirtsCoach Dan Quinn Commanders Feather T Shirts
Coach Dan Quinn Commanders Feather T Shirts
 
Pathways to Equality: The Role of Men and Women in Gender Equity
Pathways to Equality:          The Role of Men and Women in Gender EquityPathways to Equality:          The Role of Men and Women in Gender Equity
Pathways to Equality: The Role of Men and Women in Gender Equity
 
👉 Srinagar Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl S...
👉 Srinagar Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl S...👉 Srinagar Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl S...
👉 Srinagar Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl S...
 
❤️ Chandigarh Call Girls Service ☎️99158-51334☎️ Escort service in Chandigarh...
❤️ Chandigarh Call Girls Service ☎️99158-51334☎️ Escort service in Chandigarh...❤️ Chandigarh Call Girls Service ☎️99158-51334☎️ Escort service in Chandigarh...
❤️ Chandigarh Call Girls Service ☎️99158-51334☎️ Escort service in Chandigarh...
 
Obat Penggugur Kandungan Cytotec Dan Gastrul Harga Indomaret
Obat Penggugur Kandungan Cytotec Dan Gastrul Harga IndomaretObat Penggugur Kandungan Cytotec Dan Gastrul Harga Indomaret
Obat Penggugur Kandungan Cytotec Dan Gastrul Harga Indomaret
 
Call Girls Service In Jalandhar💯Call Us 🔝 8146719683🔝 💃 Top Class ☎️ Call Gir...
Call Girls Service In Jalandhar💯Call Us 🔝 8146719683🔝 💃 Top Class ☎️ Call Gir...Call Girls Service In Jalandhar💯Call Us 🔝 8146719683🔝 💃 Top Class ☎️ Call Gir...
Call Girls Service In Jalandhar💯Call Us 🔝 8146719683🔝 💃 Top Class ☎️ Call Gir...
 
CALCIUM - ELECTROLYTE IMBALANCE (HYPERCALCEMIA & HYPOCALCEMIA).pdf
CALCIUM - ELECTROLYTE IMBALANCE (HYPERCALCEMIA & HYPOCALCEMIA).pdfCALCIUM - ELECTROLYTE IMBALANCE (HYPERCALCEMIA & HYPOCALCEMIA).pdf
CALCIUM - ELECTROLYTE IMBALANCE (HYPERCALCEMIA & HYPOCALCEMIA).pdf
 
Goa Call Girls Service +9316020077 Call GirlsGoa By Russian Call Girlsin Goa
Goa Call Girls Service  +9316020077 Call GirlsGoa By Russian Call Girlsin GoaGoa Call Girls Service  +9316020077 Call GirlsGoa By Russian Call Girlsin Goa
Goa Call Girls Service +9316020077 Call GirlsGoa By Russian Call Girlsin Goa
 
Independent Call Girl in 😋 Goa +9316020077 Goa Call Girl
Independent Call Girl in 😋 Goa  +9316020077 Goa Call GirlIndependent Call Girl in 😋 Goa  +9316020077 Goa Call Girl
Independent Call Girl in 😋 Goa +9316020077 Goa Call Girl
 
Spauldings classification ppt by Dr C P PRINCE
Spauldings classification ppt by Dr C P PRINCESpauldings classification ppt by Dr C P PRINCE
Spauldings classification ppt by Dr C P PRINCE
 
RESPIRATORY ALKALOSIS & RESPIRATORY ACIDOSIS.pdf
RESPIRATORY ALKALOSIS & RESPIRATORY ACIDOSIS.pdfRESPIRATORY ALKALOSIS & RESPIRATORY ACIDOSIS.pdf
RESPIRATORY ALKALOSIS & RESPIRATORY ACIDOSIS.pdf
 
OBAT PENGGUGUR KANDUNGAN 081466799220 PIL ABORSI CYTOTEC PELUNTUR JANIN
OBAT PENGGUGUR KANDUNGAN 081466799220 PIL ABORSI CYTOTEC PELUNTUR JANINOBAT PENGGUGUR KANDUNGAN 081466799220 PIL ABORSI CYTOTEC PELUNTUR JANIN
OBAT PENGGUGUR KANDUNGAN 081466799220 PIL ABORSI CYTOTEC PELUNTUR JANIN
 
Cash Payment 😋 +9316020077 Goa Call Girl No Advance *Full Service
Cash Payment 😋  +9316020077 Goa Call Girl No Advance *Full ServiceCash Payment 😋  +9316020077 Goa Call Girl No Advance *Full Service
Cash Payment 😋 +9316020077 Goa Call Girl No Advance *Full Service
 
❤️Jhansi Call Girls Service Just Call 🍑👄7014168258 🍑👄 Top Class Call Girl Ser...
❤️Jhansi Call Girls Service Just Call 🍑👄7014168258 🍑👄 Top Class Call Girl Ser...❤️Jhansi Call Girls Service Just Call 🍑👄7014168258 🍑👄 Top Class Call Girl Ser...
❤️Jhansi Call Girls Service Just Call 🍑👄7014168258 🍑👄 Top Class Call Girl Ser...
 
TEST BANK For Robbins & Kumar Basic Pathology, 11th Edition by Vinay Kumar, A...
TEST BANK For Robbins & Kumar Basic Pathology, 11th Edition by Vinay Kumar, A...TEST BANK For Robbins & Kumar Basic Pathology, 11th Edition by Vinay Kumar, A...
TEST BANK For Robbins & Kumar Basic Pathology, 11th Edition by Vinay Kumar, A...
 
Test bank for community public health nursing evidence for practice 4TH editi...
Test bank for community public health nursing evidence for practice 4TH editi...Test bank for community public health nursing evidence for practice 4TH editi...
Test bank for community public health nursing evidence for practice 4TH editi...
 
The Events of Cardiac Cycle - Wigger's Diagram
The Events of Cardiac Cycle - Wigger's DiagramThe Events of Cardiac Cycle - Wigger's Diagram
The Events of Cardiac Cycle - Wigger's Diagram
 

C11BD 22-23 data ana-Exploration II.pptx

  • 1. C11BD Big Data Analytics Data Exploration II Dr Dhanan Utomo Assistant Professor in Business & Management Edinburgh Business School
  • 2. Outline • Quick recap • Feature selection • Introduction to advanced data exploration • Visualising relationships between features • Covariance and correlation 2
  • 3. Learning Outcomes • Understand how to indicate which descriptive features may be useful for predicting a target feature 3
  • 4. Quick recap Where are we in the course: Assessments: Week 5: Coursework 1 is due on Thursday, 15th February Week 5: Coursework 2 will be introduced Weeks Topic Lecturer: Dubai Lecturer: Edinburgh 3 Data Exploration I Dr Tarek Kandil Dr Paulus Aditjandra 4 Data Exploration II Dr Tarek Kandil Dr Dhanan Utomo 5 Intro to modelling and machine learning Dr Tarek Kandil Dr Dhanan Utomo 4
  • 5. Recap on Coursework 1 • Individual assignment count towards 40% of your course mark • Refer to the instructions on Canvas 5
  • 6. Feature Selection • Designing features • Feature selection methods 6
  • 7. Designing Features A feature is any measurable input that can be used in a predictive model (e.g. salary or bank balance in predicting loan default) 7
  • 8. Designing Features A feature is any measurable input that can be used in a predictive model (e.g. salary or bank balance in predicting loan default) Three key data considerations are particularly important when we are designing features: 1. Data availability • Are values for that feature available in the database? • For a derived feature, are the values available to compute the derived feature? 2. Timing • When will data become available for a feature? 3. Longevity • Data may become stale, e.g. salary may differ from that at time of making a loan application Feature design and implementation is an iterative process 8
  • 9. Designing Features The features in an ABT (analytics base table) can be either • Raw features: Concrete features, direct measurable features, stored in the database • Derived features: Calculated from raw features, e.g. body mass index, calculated as a ratio mass versus height 9
  • 10. Designing Features The features in an ABT can be either • Raw features: Concrete features, direct measurable features, stored in the database • Derived features: Calculated from raw features, e.g. body mass index, calculated as a ratio mass versus height Feature engineering is the process of selecting, manipulating and transforming raw data into derived features. 10
  • 11. Designing Features The features in an ABT can be either • Raw features: Concrete features, direct measurable features, stored in the database • Derived features: Calculated from raw features, e.g. body mass index, calculated as a ratio mass versus height Feature engineering is the process of selecting, manipulating and transforming raw data into derived features. 11
  • 12. Features Selection Why feature selection? Less features result in • Simpler models • Shorter training times • Less overfitting, therefore better generalization 12
  • 13. Features Selection Why feature selection? Less features result in • Simpler models • Shorter training times • Less overfitting, therefore better generalization Too many versus too few features • Too few features may result in more false positives/negatives (underfitting) • Too few features do not have sufficient discriminative power • Too many result in noise in the training data, and potentially overfitting, in addition to increases computational complexity 13
  • 14. Features Selection Types of features • Informative – features correlated with the output variables/targets • Features with unique values or very small deviation – should be removed • Redundant features – have correlations with other features • Irrelevant features – noise, have no correlation with output variables 14
  • 15. Features Selection What is feature selection? • The process of selecting a subset of the most relevant features for use in model construction • Remove features without loss of information • Keep the features that describe the variance in a dataset Approaches to feature selection • Filter methods • Wrapper methods • Embedded methods 15
  • 18. Introduction to Advanced Data Exploration
  • 19. Advanced Data Exploration In data exploration, we looked at descriptive statistics and data visualisation techniques of the characteristics of individual features In advanced data exploration, techniques can be considered to enable the examination and analysis of relationships between pairs of features, in order to assist in • Indicating which descriptive features might be useful for predicting a target feature • Finding pairs of descriptive features that are closely related 22
  • 20. Advanced Data Exploration Case study: The details of a professional basketball team 23
  • 21. Visualising Relationships Between Features • Continuous features • Categorical features • Categorical vs continuous features
  • 22. Visualising Relationships Between Features: Continuous Features For visualising pairs of continuous features, use a scatter plot • A scatter plot is based on two axes: The horizontal axis represents one feature and the vertical axis represents a second • Each instance in a dataset is represented by a point on the plot determined by the values for that instance of the two features involved 25
  • 23. Visualising Relationships Between Features: Continuous Features A scatter plot matrix (SPLOM) shows scatter plots for a whole collection of features arranged into a matrix This is useful for exploring the relationships between groups of features (e.g. all of the continuous features in an ABT) Effectiveness of scatter plot matrices diminishes once the number of features in the set goes beyond eight because the graphs become too small 26
  • 24. Visualising Relationships Between Features: Continuous Features 27
  • 25. Visualising Relationships Between Features: Categorical Features For visualising pairs of categorical features, use a collection of small multiple bar plots (small multiples visualisation) 1. Draw a simple bar plot indicating the densities of the different levels of the first feature 2. For each level of the second feature, draw a bar plot of the first feature using only the instances in the dataset for which the second feature has that level 28
  • 26. Visualising Relationships Between Features: Categorical Features For visualising pairs of categorical features, use a collection of small multiple bar plots (small multiples visualisation) 1. Draw a simple bar plot indicating the densities of the different levels of the first feature 2. For each level of the second feature, draw a bar plot of the first feature using only the instances in the dataset for which the second feature has that level If the two features have a strong relationship, the bar plots for each level of the second feature will look noticeably different to one another and to the overall bar plot for the first feature If there is no relationship, then we expect that the levels of the first feature will be evenly distributed amongst the instances having the different levels of the second feature, so all bar plots will look much the same 29
  • 27. Visualising Relationships Between Features: Categorical Features 30
  • 28. Visualising Relationships Between Features: Categorical Features For visualising pairs of categorical features where the number of levels for one of the features being compared is no more than three, stacked bar plots can be used as an alternative to the small multiples 1. Show a bar plot of the first feature above another bar plot that shows the relative distribution of the levels of the second feature within each level of the first 2. With relative distributions used, the bars in the second bar plot cover the full range of the space available If two features are unrelated, it is expected to see the same proportion of each level of the second feature within the bars for each level of the first 31
  • 29. Visualising Relationships Between Features: Categorical Features 32
  • 30. Visualising Relationships Between Features: Categorical vs Continuous Features For visualising pairs of categorical and continuous features, use a small multiples approach that draws a histogram of the values of the continuous feature for each level of the categorical feature Each histogram includes only those instances in the dataset that have the associated level of the categorical feature If the features are unrelated (or independent), the histograms for each level should be very similar If the features are related, however, the shapes and/or the central tendencies of the histograms will be different 33
  • 31. Visualising Relationships Between Features: Categorical vs Continuous Features 34
  • 32. Visualising Relationships Between Features: Categorical vs Continuous Features 35
  • 33. Visualising Relationships Between Features: Categorical vs Continuous Features Another approach to visualising the relationship between a categorical feature and a continuous feature is to use a collection of box plots For each level of the categorical feature a box plot of the corresponding values of the continuous feature is drawn When a relationship exists between the two features, the box plots should show differing central tendencies and variations When no relationship exists, the box plots should all appear similar 36
  • 34. Visualising Relationships Between Features: Categorical vs Continuous Features 37
  • 35. Visualising Relationships Between Features: Categorical vs Continuous Features 38
  • 36. Covariance and Correlation • Measuring covariance and correlation • Covariance and correlation matrix
  • 37. Measuring Covariance and Correlation In addition to visually inspecting scatter plots, formal measures of the relationship between two continuous features can be calculated using covariance and correlation Covariance values fall into the range [−∞; ∞] • Negative values indicate a negative relationship • Positive values indicate a positive relationship • Values near zero indicate that there is little or no relationship between the features 40
  • 38. Measuring Covariance and Correlation In calculating the covariance between the HEIGHT feature and the WEIGHT and AGE features in the case study • 𝑐𝑜𝑣(HEIGHT; WEIGHT) = 241.72 indicate that there is a strong positive relationship between the height and weight of a player • 𝑐𝑜𝑣 (HEIGHT; AGE) = 19.7 indicate a much smaller positive relationship between height and age A problem with using covariance is that it is measured in the same units as the features that it measures Comparing the covariance between pairs of features only makes sense if each pair of features is composed of the same mixture of units 41
  • 39. Measuring Covariance and Correlation Correlation is a normalised form of covariance that ranges between -1 and 1 𝑐𝑜𝑟𝑟 𝑎, 𝑏 = 𝑐𝑜𝑣(𝑎, 𝑏) 𝜎𝑎𝜎𝑏 where 𝑐𝑜𝑣(𝑎, 𝑏) is the covariance between features 𝑎 and 𝑏, and 𝜎𝑎 and 𝜎𝑏 are the standard deviations of 𝑎 and 𝑏, respectively • Correlation is dimensionless • Values close to -1 indicate a very strong negative correlation (or covariance) • Values close to 1 indicate a very strong positive correlation • Values around 0 indicate no correlation, i.e. the features are independent 42
  • 40. Measuring Covariance and Correlation Measuring the covariance and correlation between the HEIGHT feature and the WEIGHT and AGE features in the case study • 𝑐𝑜𝑣(HEIGHT; WEIGHT) = 241.72 • 𝑐𝑜𝑣(HEIGHT; AGE) = 19.7 • 𝑐𝑜𝑟𝑟(HEIGHT; WEIGHT) = 0.898 • 𝑐𝑜𝑟𝑟(HEIGHT; AGE) = 0.345 43
  • 41. Covariance and Correlation Matrix There are typically multiple continuous features between which we would like to explore relationships Two tools that can be useful for this are the covariance matrix and the correlation matrix The scatter plot matrix (SPLOM) is a visualisation of the correlation matrix This can be made more obvious by including the correlation coefficients in SPLOMs in the cells above the diagonal 44
  • 43. Covariance and Correlation Discussion Correlation is a good measure of the relationship between two continuous features, but it is not by any means perfect • The correlation measure responds only to linear relationships between features • Peculiarities in a dataset can affect the calculation of the correlation between two features, illustrated very clearly in the famous example of Anscombe's quartet by Francis Anscombe 46
  • 44. Covariance and Correlation Discussion Anscombe's quartet: A series of four pairs of features that all have the same correlation value of 0.816, even though they exhibit very different relationships 47
  • 45. Covariance and Correlation Discussion Perhaps the most important thing to remember in relation to correlation is that correlation does not necessarily imply causation Just because the values of two features are correlated does not mean that an actual causal relationship exists between the two Based on correlations tests alone, a conclusion could be made that, e.g. the presence of swallows cause hot weather; however, swallows migrate to warmer countries 48
  • 46. Textbook References Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies by JD Kelleher, B Mac Namee and A D’Arcy (2015) • Designing and implementing features (pp. 77-91) • Visualising relationships between features (pp. 127-135) • Measuring covariance and correlation (pp. 136-140) 49
  • 47. Tutorial Please prepare for the tutorial session and go through the tasks as specified on Canvas. • Dubai on-campus session at 18:00 (Dubai) on Wednesday • Edinburgh on-campus session at 09:00 (GMT) on Friday 50

Editor's Notes

  1. Refer to Chapter 3 of the prescribed textbook by Kelleher et al.
  2. MENTIMETER
  3. PDF & Word
  4. Simple example of prices of properties in a city showing the total area and the total price of the house
  5. Simple example of prices of properties in a city showing the total area and the total price of the house
  6. Add a new column and calculate the cost per square foot (derived feature)
  7. E.g. house prices in city Informative, e.g. square footage, council bracket, distance to nearest school Unique values, e.g. Is it in a council? Redundant, e.g. council tax/month relate to council bracket Irrelevant, e.g. odd or even house number
  8. Filter methods Perform various statistical tests between feature & response (target) to identify which features are more relevant than others Wrapper methods Add/remove features to baseline model and compare the performance of the model Use an optimization algorithm to search for the optimal feature set, and use a model’s performance as objective function Embedded methods Algorithm has its own built-in feature selection methods
  9. POSITION that the player normally plays (guard, center, or forward) CAREER STAGE of the player (rookie, mid-career, or veteran) average weekly SPONSORSHIP EARNINGS of each player whether the player has a SHOE SPONSOR (yes or no)
  10. Fig (a) shows an example scatter plot for the HEIGHT and WEIGHT features from the professional basketball team dataset. Broadly linear pattern diagonally across the scatter plot. This suggests that there is a strong, positive, linear relationship between the HEIGHT and WEIGHT features—as height increases, so does weight. We say that features with this kind of relationship are positively covariant. Fig (b) indicates a scatter plot for the SPONSORSHIP EARNINGS and AGE features. The opposite occurs of what was seen in the previous figure, where the sponsorship earnings increase as the age decrease. These features are therefore strongly negatively covariant. Fig (c) shows a scatter plot of the HEIGHT and AGE features. There is clearly no linear pattern and these features are therefore not strongly covariant either positively or negatively.
  11. Each row and column represent the feature named in the cells along the diagonal. The cells above and below the diagonal show scatter plots of the features in the row and column that meet at that cell.
  12. FIRST SET: The bar plot on the left shows the distribution of the different levels of the CAREER STAGE feature across the entire dataset.  The two plots on the right show the distributions for those players with and without a shoe sponsor.  Since all three plots show very similar distributions, we can conclude that no real relationship exists between these two features and that players of any career stage are equally likely to have a shoe sponsor or not. SECOND SET: In this case, the three plots are very different, so we can conclude that there is a relationship between these two features. It seems that players who play in the guard position are much more likely to have a shoe sponsor than forwards or centers. When using small multiples, it is important that all the small charts are kept consistent because this ensures that only genuine differences within the data are highlighted, rather than differences that arise from formatting. For example, the scales of the axes must always be kept consistent, as should the order of the bars in the individual bar plots.  It is also important that densities are shown rather than frequencies as the overall bar plots on the left of each visualization cover much more of the dataset than the other two plots, so frequency-based plots would look very uneven.
  13. This is two examples of stacked bar plots. In figure (a) on the left, a bar plot of the CAREER STAGE feature is shown above a 100% stacked bar plot showing how the levels of the SHOE SPONSOR feature are distributed in instances having each level of CAREER STAGE. The distributions of the levels of SHOE SPONSOR are almost the same for each level of CAREER STAGE, and therefore we can conclude that there is no relationship between these two features. In figure (b) on the right, the POSITION and SHOE SPONSOR features are shown. In this case we can see that distributions of the levels of the SHOE SPONSOR feature are not the same for each position. From this we can again conclude that guards are more likely to have a shoe sponsor than players in the other positions.
  14. AGE follows a uniform distribution across a range from about 19 to about 35 These histograms show a slight tendency for centers to be a little older than guards and forwards, but the relationship does not appear very strong as each of the smaller histograms are similar to the overall uniform distribution of the AGE feature
  15. HEIGHT follows a normal distribution centered around a mean of approximately 194 The three smaller histograms depart from this distribution and suggest that centers tend to be taller than forwards, who in turn tend to be taller than guards
  16. Fig (a) shows a box plot for AGE across the full dataset, while Fig (b) shows individual box plots for AGE for each level of the POSITION feature. This visualization shows a slight indication that centers tend to be older than forwards and guards, but the three box plots overlap significantly, suggesting that this relationship is not very strong.
  17. In Fig (a), the box plot for the HEIGHT feature across the entire dataset is plotted, while Fig (b) shows the individual height box plots for each level of the POSITION feature. Fig (b) is typical of a series of box plots showing a strong relationship between a continuous and a categorical feature. We can see that the average height of centers is above that of forwards, which in turn is above that of guards. Although the whiskers show that there is some overlap between the three groups, they do appear to be well separated.
  18. A covariance matrix contains a row and column for each feature, and each element of the matrix lists the covariance between the corresponding pairs of features. As a result, the elements along the main diagonal list the covariance between a feature and itself, in other words, the variance of the feature. use the performance analytics package in R to generate the SPLOMS, which will include the correlation coefficient between each pair of continuous features on the top half of the matrix
  19. In this figure, the cells above the diagonal show the correlation coefficients for each pair of features. The font sizes of the correlation coefficients are scaled according to the absolute value of the strength of the correlation to draw attention to those pairs of features with the strongest relationships.
  20. Francis Anscombe (English statistician) The effect of curvatures (top right) and outliers (bottom) drastically thrown off the summary statistics Illustrates how important it is to always plot the data, rather than relying on summary statistics only