1. C11BD Big Data Analytics
Data Exploration II
Dr Dhanan Utomo
Assistant Professor in Business & Management
Edinburgh Business School
2. Outline
• Quick recap
• Feature selection
• Introduction to advanced data exploration
• Visualising relationships between features
• Covariance and correlation
2
3. Learning Outcomes
• Understand how to indicate which descriptive features may be useful
for predicting a target feature
3
4. Quick recap
Where are we in the course:
Assessments:
Week 5: Coursework 1 is due on Thursday, 15th February
Week 5: Coursework 2 will be introduced
Weeks Topic Lecturer: Dubai Lecturer: Edinburgh
3 Data Exploration I Dr Tarek Kandil Dr Paulus Aditjandra
4 Data Exploration II Dr Tarek Kandil Dr Dhanan Utomo
5 Intro to modelling and machine learning Dr Tarek Kandil Dr Dhanan Utomo
4
5. Recap on Coursework 1
• Individual assignment count towards 40% of your course mark
• Refer to the instructions on Canvas
5
7. Designing Features
A feature is any measurable input that can be used in a predictive
model (e.g. salary or bank balance in predicting loan default)
7
8. Designing Features
A feature is any measurable input that can be used in a predictive
model (e.g. salary or bank balance in predicting loan default)
Three key data considerations are particularly important when we are
designing features:
1. Data availability
• Are values for that feature available in the database?
• For a derived feature, are the values available to compute the derived
feature?
2. Timing
• When will data become available for a feature?
3. Longevity
• Data may become stale, e.g. salary may differ from that at time of making a
loan application
Feature design and implementation is an iterative process
8
9. Designing Features
The features in an ABT (analytics base table) can be either
• Raw features: Concrete features, direct measurable features, stored
in the database
• Derived features: Calculated from raw features, e.g. body mass index,
calculated as a ratio mass versus height
9
10. Designing Features
The features in an ABT can be either
• Raw features: Concrete features, direct measurable features, stored
in the database
• Derived features: Calculated from raw features, e.g. body mass index,
calculated as a ratio mass versus height
Feature engineering is the process of selecting, manipulating and
transforming raw data into derived features.
10
11. Designing Features
The features in an ABT can be either
• Raw features: Concrete features, direct measurable features, stored
in the database
• Derived features: Calculated from raw features, e.g. body mass index,
calculated as a ratio mass versus height
Feature engineering is the process of selecting, manipulating and
transforming raw data into derived features.
11
12. Features Selection
Why feature selection? Less features result in
• Simpler models
• Shorter training times
• Less overfitting, therefore better generalization
12
13. Features Selection
Why feature selection? Less features result in
• Simpler models
• Shorter training times
• Less overfitting, therefore better generalization
Too many versus too few features
• Too few features may result in more false positives/negatives
(underfitting)
• Too few features do not have sufficient discriminative power
• Too many result in noise in the training data, and potentially
overfitting, in addition to increases computational complexity
13
14. Features Selection
Types of features
• Informative – features correlated with the output variables/targets
• Features with unique values or very small deviation – should be
removed
• Redundant features – have correlations with other features
• Irrelevant features – noise, have no correlation with output variables
14
15. Features Selection
What is feature selection?
• The process of selecting a subset of the most relevant features for
use in model construction
• Remove features without loss of information
• Keep the features that describe the variance in a dataset
Approaches to feature selection
• Filter methods
• Wrapper methods
• Embedded methods
15
19. Advanced Data Exploration
In data exploration, we looked at descriptive statistics and data
visualisation techniques of the characteristics of individual features
In advanced data exploration, techniques can be considered to enable
the examination and analysis of relationships between pairs of
features, in order to assist in
• Indicating which descriptive features might be useful for predicting a
target feature
• Finding pairs of descriptive features that are closely related
22
22. Visualising Relationships Between Features:
Continuous Features
For visualising pairs of continuous features, use a scatter plot
• A scatter plot is based on two axes: The horizontal axis represents
one feature and the vertical axis represents a second
• Each instance in a dataset is represented by a point on the plot
determined by the values for that instance of the two features
involved
25
23. Visualising Relationships Between Features:
Continuous Features
A scatter plot matrix (SPLOM) shows scatter plots for a whole
collection of features arranged into a matrix
This is useful for exploring the relationships between groups of
features (e.g. all of the continuous features in an ABT)
Effectiveness of scatter plot matrices diminishes once the number of
features in the set goes beyond eight because the graphs become too
small
26
25. Visualising Relationships Between Features:
Categorical Features
For visualising pairs of categorical features, use a collection of small
multiple bar plots (small multiples visualisation)
1. Draw a simple bar plot indicating the densities of the different
levels of the first feature
2. For each level of the second feature, draw a bar plot of the first
feature using only the instances in the dataset for which the
second feature has that level
28
26. Visualising Relationships Between Features:
Categorical Features
For visualising pairs of categorical features, use a collection of small
multiple bar plots (small multiples visualisation)
1. Draw a simple bar plot indicating the densities of the different
levels of the first feature
2. For each level of the second feature, draw a bar plot of the first
feature using only the instances in the dataset for which the
second feature has that level
If the two features have a strong relationship, the bar plots for each
level of the second feature will look noticeably different to one
another and to the overall bar plot for the first feature
If there is no relationship, then we expect that the levels of the first
feature will be evenly distributed amongst the instances having the
different levels of the second feature, so all bar plots will look much
the same
29
28. Visualising Relationships Between Features:
Categorical Features
For visualising pairs of categorical features where the number of levels
for one of the features being compared is no more than three, stacked
bar plots can be used as an alternative to the small multiples
1. Show a bar plot of the first feature above another bar plot that
shows the relative distribution of the levels of the second feature
within each level of the first
2. With relative distributions used, the bars in the second bar plot
cover the full range of the space available
If two features are unrelated, it is expected to see the same proportion
of each level of the second feature within the bars for each level of the
first
31
30. Visualising Relationships Between Features:
Categorical vs Continuous Features
For visualising pairs of categorical and continuous features, use a small
multiples approach that draws a histogram of the values of the
continuous feature for each level of the categorical feature
Each histogram includes only those instances in the dataset that have
the associated level of the categorical feature
If the features are unrelated (or independent), the histograms for each
level should be very similar
If the features are related, however, the shapes and/or the central
tendencies of the histograms will be different
33
33. Visualising Relationships Between Features:
Categorical vs Continuous Features
Another approach to visualising the relationship between a categorical
feature and a continuous feature is to use a collection of box plots
For each level of the categorical feature a box plot of the
corresponding values of the continuous feature is drawn
When a relationship exists between the two features, the box plots
should show differing central tendencies and variations
When no relationship exists, the box plots should all appear similar
36
37. Measuring Covariance and Correlation
In addition to visually inspecting scatter plots, formal measures of the
relationship between two continuous features can be calculated using
covariance and correlation
Covariance values fall into the range [−∞; ∞]
• Negative values indicate a negative relationship
• Positive values indicate a positive relationship
• Values near zero indicate that there is little or no relationship
between the features
40
38. Measuring Covariance and Correlation
In calculating the covariance between the HEIGHT feature and the
WEIGHT and AGE features in the case study
• 𝑐𝑜𝑣(HEIGHT; WEIGHT) = 241.72 indicate that there is a strong
positive relationship between the height and weight of a player
• 𝑐𝑜𝑣 (HEIGHT; AGE) = 19.7 indicate a much smaller positive
relationship between height and age
A problem with using covariance is that it is measured in the same
units as the features that it measures
Comparing the covariance between pairs of features only makes sense
if each pair of features is composed of the same mixture of units
41
39. Measuring Covariance and Correlation
Correlation is a normalised form of covariance that ranges between -1
and 1
𝑐𝑜𝑟𝑟 𝑎, 𝑏 =
𝑐𝑜𝑣(𝑎, 𝑏)
𝜎𝑎𝜎𝑏
where 𝑐𝑜𝑣(𝑎, 𝑏) is the covariance between features 𝑎 and 𝑏, and 𝜎𝑎
and 𝜎𝑏 are the standard deviations of 𝑎 and 𝑏, respectively
• Correlation is dimensionless
• Values close to -1 indicate a very strong negative correlation (or
covariance)
• Values close to 1 indicate a very strong positive correlation
• Values around 0 indicate no correlation, i.e. the features are
independent
42
40. Measuring Covariance and Correlation
Measuring the covariance and correlation between the HEIGHT feature
and the WEIGHT and AGE features in the case study
• 𝑐𝑜𝑣(HEIGHT; WEIGHT) = 241.72
• 𝑐𝑜𝑣(HEIGHT; AGE) = 19.7
• 𝑐𝑜𝑟𝑟(HEIGHT; WEIGHT) = 0.898
• 𝑐𝑜𝑟𝑟(HEIGHT; AGE) = 0.345
43
41. Covariance and Correlation Matrix
There are typically multiple continuous features between which we
would like to explore relationships
Two tools that can be useful for this are the covariance matrix and the
correlation matrix
The scatter plot matrix (SPLOM) is a visualisation of the correlation
matrix
This can be made more obvious by including the correlation
coefficients in SPLOMs in the cells above the diagonal
44
43. Covariance and Correlation Discussion
Correlation is a good measure of the relationship between two
continuous features, but it is not by any means perfect
• The correlation measure responds only to linear relationships
between features
• Peculiarities in a dataset can affect the calculation of the correlation
between two features, illustrated very clearly in the famous example
of Anscombe's quartet by Francis Anscombe
46
44. Covariance and Correlation Discussion
Anscombe's quartet: A series of four pairs of features that all have the
same correlation value of 0.816, even though they exhibit very
different relationships
47
45. Covariance and Correlation Discussion
Perhaps the most important thing to remember in relation to
correlation is that correlation does not necessarily imply causation
Just because the values of two features are correlated does not mean
that an actual causal relationship exists between the two
Based on correlations tests alone, a conclusion could be made that,
e.g. the presence of swallows cause hot weather; however, swallows
migrate to warmer countries
48
46. Textbook References
Fundamentals of Machine Learning for Predictive Data Analytics:
Algorithms, Worked Examples, and Case Studies by JD Kelleher, B Mac
Namee and A D’Arcy (2015)
• Designing and implementing features (pp. 77-91)
• Visualising relationships between features (pp. 127-135)
• Measuring covariance and correlation (pp. 136-140)
49
47. Tutorial
Please prepare for the tutorial session and go through the tasks as
specified on Canvas.
• Dubai on-campus session at 18:00 (Dubai) on Wednesday
• Edinburgh on-campus session at 09:00 (GMT) on Friday
50
Editor's Notes
Refer to Chapter 3 of the prescribed textbook by Kelleher et al.
MENTIMETER
PDF & Word
Simple example of prices of properties in a city showing the total area and the total price of the house
Simple example of prices of properties in a city showing the total area and the total price of the house
Add a new column and calculate the cost per square foot (derived feature)
E.g. house prices in city
Informative, e.g. square footage, council bracket, distance to nearest school
Unique values, e.g. Is it in a council?
Redundant, e.g. council tax/month relate to council bracket
Irrelevant, e.g. odd or even house number
Filter methods
Perform various statistical tests between feature & response (target) to identify which features are more relevant than others
Wrapper methods
Add/remove features to baseline model and compare the performance of the model
Use an optimization algorithm to search for the optimal feature set, and use a model’s performance as objective function
Embedded methods
Algorithm has its own built-in feature selection methods
POSITION that the player normally plays (guard, center, or forward)
CAREER STAGE of the player (rookie, mid-career, or veteran)
average weekly SPONSORSHIP EARNINGS of each player
whether the player has a SHOE SPONSOR (yes or no)
Fig (a) shows an example scatter plot for the HEIGHT and WEIGHT features from the professional basketball team dataset. Broadly linear pattern diagonally across the scatter plot. This suggests that there is a strong, positive, linear relationship between the HEIGHT and WEIGHT features—as height increases, so does weight. We say that features with this kind of relationship are positively covariant.
Fig (b) indicates a scatter plot for the SPONSORSHIP EARNINGS and AGE features. The opposite occurs of what was seen in the previous figure, where the sponsorship earnings increase as the age decrease. These features are therefore strongly negatively covariant.
Fig (c) shows a scatter plot of the HEIGHT and AGE features. There is clearly no linear pattern and these features are therefore not strongly covariant either positively or negatively.
Each row and column represent the feature named in the cells along the diagonal. The cells above and below the diagonal show scatter plots of the features in the row and column that meet at that cell.
FIRST SET:
The bar plot on the left shows the distribution of the different levels of the CAREER STAGE feature across the entire dataset.
The two plots on the right show the distributions for those players with and without a shoe sponsor.
Since all three plots show very similar distributions, we can conclude that no real relationship exists between these two features and that players of any career stage are equally likely to have a shoe sponsor or not.
SECOND SET:
In this case, the three plots are very different, so we can conclude that there is a relationship between these two features. It seems that players who play in the guard position are much more likely to have a shoe sponsor than forwards or centers.
When using small multiples, it is important that all the small charts are kept consistent because this ensures that only genuine differences within the data are highlighted, rather than differences that arise from formatting. For example, the scales of the axes must always be kept consistent, as should the order of the bars in the individual bar plots.
It is also important that densities are shown rather than frequencies as the overall bar plots on the left of each visualization cover much more of the dataset than the other two plots, so frequency-based plots would look very uneven.
This is two examples of stacked bar plots. In figure (a) on the left, a bar plot of the CAREER STAGE feature is shown above a 100% stacked bar plot showing how the levels of the SHOE SPONSOR feature are distributed in instances having each level of CAREER STAGE. The distributions of the levels of SHOE SPONSOR are almost the same for each level of CAREER STAGE, and therefore we can conclude that there is no relationship between these two features.
In figure (b) on the right, the POSITION and SHOE SPONSOR features are shown. In this case we can see that distributions of the levels of the SHOE SPONSOR feature are not the same for each position. From this we can again conclude that guards are more likely to have a shoe sponsor than players in the other positions.
AGE follows a uniform distribution across a range from about 19 to about 35
These histograms show a slight tendency for centers to be a little older than guards and forwards, but the relationship does not appear very strong as each of the smaller histograms are similar to the overall uniform distribution of the AGE feature
HEIGHT follows a normal distribution centered around a mean of approximately 194
The three smaller histograms depart from this distribution and suggest that centers tend to be taller than forwards, who in turn tend to be taller than guards
Fig (a) shows a box plot for AGE across the full dataset, while Fig (b) shows individual box plots for AGE for each level of the POSITION feature.
This visualization shows a slight indication that centers tend to be older than forwards and guards, but the three box plots overlap significantly, suggesting that this relationship is not very strong.
In Fig (a), the box plot for the HEIGHT feature across the entire dataset is plotted, while Fig (b) shows the individual height box plots for each level of the POSITION feature.
Fig (b) is typical of a series of box plots showing a strong relationship between a continuous and a categorical feature. We can see that the average height of centers is above that of forwards, which in turn is above that of guards. Although the whiskers show that there is some overlap between the three groups, they do appear to be well separated.
A covariance matrix contains a row and column for each feature, and each element of the matrix lists the covariance between the corresponding pairs of features. As a result, the elements along the main diagonal list the covariance between a feature and itself, in other words, the variance of the feature.
use the performance analytics package in R to generate the SPLOMS, which will include the correlation coefficient between each pair of continuous features on the top half of the matrix
In this figure, the cells above the diagonal show the correlation coefficients for each pair of features. The font sizes of the correlation coefficients are scaled according to the absolute value of the strength of the correlation to draw attention to those pairs of features with the strongest relationships.
Francis Anscombe (English statistician)
The effect of curvatures (top right) and outliers (bottom) drastically thrown off the summary statistics
Illustrates how important it is to always plot the data, rather than relying on summary statistics only