C11BD 22-23 data ana-Exploration II.pptx

C11BD Big Data Analytics
Data Exploration II
Dr Dhanan Utomo
Assistant Professor in Business & Management
Edinburgh Business School

Outline
• Quick recap
• Feature selection
• Introduction to advanced data exploration
• Visualising relationships between features
• Covariance and correlation
2

Learning Outcomes
• Understand how to indicate which descriptive features may be useful
for predicting a target feature
3

Quick recap
Where are we in the course:
Assessments:
Week 5: Coursework 1 is due on Thursday, 15th February
Week 5: Coursework 2 will be introduced
Weeks Topic Lecturer: Dubai Lecturer: Edinburgh
3 Data Exploration I Dr Tarek Kandil Dr Paulus Aditjandra
4 Data Exploration II Dr Tarek Kandil Dr Dhanan Utomo
5 Intro to modelling and machine learning Dr Tarek Kandil Dr Dhanan Utomo
4

Recap on Coursework 1
• Individual assignment count towards 40% of your course mark
• Refer to the instructions on Canvas
5

Feature Selection
• Designing features
• Feature selection methods
6

Designing Features
A feature is any measurable input that can be used in a predictive
model (e.g. salary or bank balance in predicting loan default)
7

Designing Features
A feature is any measurable input that can be used in a predictive
model (e.g. salary or bank balance in predicting loan default)
Three key data considerations are particularly important when we are
designing features:
1. Data availability
• Are values for that feature available in the database?
• For a derived feature, are the values available to compute the derived
feature?
2. Timing
• When will data become available for a feature?
3. Longevity
• Data may become stale, e.g. salary may differ from that at time of making a
loan application
Feature design and implementation is an iterative process
8

Designing Features
The features in an ABT (analytics base table) can be either
• Raw features: Concrete features, direct measurable features, stored
in the database
• Derived features: Calculated from raw features, e.g. body mass index,
calculated as a ratio mass versus height
9

Designing Features
The features in an ABT can be either
in the database
Feature engineering is the process of selecting, manipulating and
transforming raw data into derived features.
10

Designing Features
The features in an ABT can be either
in the database
Feature engineering is the process of selecting, manipulating and
transforming raw data into derived features.
11

Features Selection
Why feature selection? Less features result in
• Simpler models
• Shorter training times
• Less overfitting, therefore better generalization
12

Features Selection
Why feature selection? Less features result in
• Simpler models
• Shorter training times
• Less overfitting, therefore better generalization
Too many versus too few features
• Too few features may result in more false positives/negatives
(underfitting)
• Too few features do not have sufficient discriminative power
• Too many result in noise in the training data, and potentially
overfitting, in addition to increases computational complexity
13

Features Selection
Types of features
• Informative – features correlated with the output variables/targets
• Features with unique values or very small deviation – should be
removed
• Redundant features – have correlations with other features
• Irrelevant features – noise, have no correlation with output variables
14

Features Selection
What is feature selection?
• The process of selecting a subset of the most relevant features for
use in model construction
• Remove features without loss of information
• Keep the features that describe the variance in a dataset
Approaches to feature selection
• Filter methods
• Wrapper methods
• Embedded methods
15

Introduction to Advanced Data
Exploration

Advanced Data Exploration
In data exploration, we looked at descriptive statistics and data
visualisation techniques of the characteristics of individual features
In advanced data exploration, techniques can be considered to enable
the examination and analysis of relationships between pairs of
features, in order to assist in
• Indicating which descriptive features might be useful for predicting a
target feature
• Finding pairs of descriptive features that are closely related
22

Advanced Data Exploration
Case study:
The details of a professional
basketball team
23

Visualising Relationships
Between Features
• Continuous features
• Categorical features
• Categorical vs continuous features

Visualising Relationships Between Features:
Continuous Features
For visualising pairs of continuous features, use a scatter plot
• A scatter plot is based on two axes: The horizontal axis represents
one feature and the vertical axis represents a second
• Each instance in a dataset is represented by a point on the plot
determined by the values for that instance of the two features
involved
25

Continuous Features
A scatter plot matrix (SPLOM) shows scatter plots for a whole
collection of features arranged into a matrix
This is useful for exploring the relationships between groups of
features (e.g. all of the continuous features in an ABT)
Effectiveness of scatter plot matrices diminishes once the number of
features in the set goes beyond eight because the graphs become too
small
26

Continuous Features
27

Categorical Features
For visualising pairs of categorical features, use a collection of small
multiple bar plots (small multiples visualisation)
1. Draw a simple bar plot indicating the densities of the different
levels of the first feature
2. For each level of the second feature, draw a bar plot of the first
feature using only the instances in the dataset for which the
second feature has that level
28

For visualising pairs of categorical features, use a collection of small
multiple bar plots (small multiples visualisation)
1. Draw a simple bar plot indicating the densities of the different
levels of the first feature
2. For each level of the second feature, draw a bar plot of the first
feature using only the instances in the dataset for which the
second feature has that level
If the two features have a strong relationship, the bar plots for each
level of the second feature will look noticeably different to one
another and to the overall bar plot for the first feature
If there is no relationship, then we expect that the levels of the first
feature will be evenly distributed amongst the instances having the
different levels of the second feature, so all bar plots will look much
the same
29

30

For visualising pairs of categorical features where the number of levels
for one of the features being compared is no more than three, stacked
bar plots can be used as an alternative to the small multiples
1. Show a bar plot of the first feature above another bar plot that
shows the relative distribution of the levels of the second feature
within each level of the first
2. With relative distributions used, the bars in the second bar plot
cover the full range of the space available
If two features are unrelated, it is expected to see the same proportion
of each level of the second feature within the bars for each level of the
first
31

32

Categorical vs Continuous Features
For visualising pairs of categorical and continuous features, use a small
multiples approach that draws a histogram of the values of the
continuous feature for each level of the categorical feature
Each histogram includes only those instances in the dataset that have
the associated level of the categorical feature
If the features are unrelated (or independent), the histograms for each
level should be very similar
If the features are related, however, the shapes and/or the central
tendencies of the histograms will be different
33

34

35

Another approach to visualising the relationship between a categorical
feature and a continuous feature is to use a collection of box plots
For each level of the categorical feature a box plot of the
corresponding values of the continuous feature is drawn
When a relationship exists between the two features, the box plots
should show differing central tendencies and variations
When no relationship exists, the box plots should all appear similar
36

37

38

Covariance and Correlation
• Measuring covariance and correlation
• Covariance and correlation matrix

Measuring Covariance and Correlation
In addition to visually inspecting scatter plots, formal measures of the
relationship between two continuous features can be calculated using
covariance and correlation
Covariance values fall into the range [−∞; ∞]
• Negative values indicate a negative relationship
• Positive values indicate a positive relationship
• Values near zero indicate that there is little or no relationship
between the features
40

In calculating the covariance between the HEIGHT feature and the
WEIGHT and AGE features in the case study
• 𝑐𝑜𝑣(HEIGHT; WEIGHT) = 241.72 indicate that there is a strong
positive relationship between the height and weight of a player
• 𝑐𝑜𝑣 (HEIGHT; AGE) = 19.7 indicate a much smaller positive
relationship between height and age
A problem with using covariance is that it is measured in the same
units as the features that it measures
Comparing the covariance between pairs of features only makes sense
if each pair of features is composed of the same mixture of units
41

Correlation is a normalised form of covariance that ranges between -1
and 1
𝑐𝑜𝑟𝑟 𝑎, 𝑏 =
𝑐𝑜𝑣(𝑎, 𝑏)
𝜎𝑎𝜎𝑏
where 𝑐𝑜𝑣(𝑎, 𝑏) is the covariance between features 𝑎 and 𝑏, and 𝜎𝑎
and 𝜎𝑏 are the standard deviations of 𝑎 and 𝑏, respectively
• Correlation is dimensionless
• Values close to -1 indicate a very strong negative correlation (or
covariance)
• Values close to 1 indicate a very strong positive correlation
• Values around 0 indicate no correlation, i.e. the features are
independent
42

Measuring the covariance and correlation between the HEIGHT feature
and the WEIGHT and AGE features in the case study
• 𝑐𝑜𝑣(HEIGHT; WEIGHT) = 241.72
• 𝑐𝑜𝑣(HEIGHT; AGE) = 19.7
• 𝑐𝑜𝑟𝑟(HEIGHT; WEIGHT) = 0.898
• 𝑐𝑜𝑟𝑟(HEIGHT; AGE) = 0.345
43

Covariance and Correlation Matrix
There are typically multiple continuous features between which we
would like to explore relationships
Two tools that can be useful for this are the covariance matrix and the
correlation matrix
The scatter plot matrix (SPLOM) is a visualisation of the correlation
matrix
This can be made more obvious by including the correlation
coefficients in SPLOMs in the cells above the diagonal
44

Covariance and Correlation Matrix
45

Covariance and Correlation Discussion
Correlation is a good measure of the relationship between two
continuous features, but it is not by any means perfect
• The correlation measure responds only to linear relationships
between features
• Peculiarities in a dataset can affect the calculation of the correlation
between two features, illustrated very clearly in the famous example
of Anscombe's quartet by Francis Anscombe
46

Anscombe's quartet: A series of four pairs of features that all have the
same correlation value of 0.816, even though they exhibit very
different relationships
47

Perhaps the most important thing to remember in relation to
correlation is that correlation does not necessarily imply causation
Just because the values of two features are correlated does not mean
that an actual causal relationship exists between the two
Based on correlations tests alone, a conclusion could be made that,
e.g. the presence of swallows cause hot weather; however, swallows
migrate to warmer countries
48

Textbook References
Fundamentals of Machine Learning for Predictive Data Analytics:
Algorithms, Worked Examples, and Case Studies by JD Kelleher, B Mac
Namee and A D’Arcy (2015)
• Designing and implementing features (pp. 77-91)
• Visualising relationships between features (pp. 127-135)
• Measuring covariance and correlation (pp. 136-140)
49

Tutorial
Please prepare for the tutorial session and go through the tasks as
specified on Canvas.
• Dubai on-campus session at 18:00 (Dubai) on Wednesday
• Edinburgh on-campus session at 09:00 (GMT) on Friday
50

C11BD 22-23 data ana-Exploration II.pptx

Recommended

Recommended

More Related Content

Similar to C11BD 22-23 data ana-Exploration II.pptx

Similar to C11BD 22-23 data ana-Exploration II.pptx (20)

Recently uploaded

Recently uploaded (20)

C11BD 22-23 data ana-Exploration II.pptx

Editor's Notes