Data Science Chapter 4: Machine Learning 101

Excerpts from Data Science (Chapter 4)
by Kelleher, J.D. and Tierney, B. (2018)
Data Science
Machine Learning 101
Mpumelelo Ndlovu
January 7, 2023

1
“The real challenge in using ML is to
find the algorithm whose learning
bias is the best match for a particular
data set ”
Kelleher & Tierney, (2018)
Mpumelelo Ndlovu | Data Science: Machine Learning 101 | Kelleher & Tierney (2018) | Chapter 4

2
Plan of Talk
Introduction
Classification of Algorithms
Prediction Models
Regression Models
Neural Networks and Deep Learning
Decision Trees
Bias in Data Science
Conclusion
References and Bibliography

3
Introduction
☞ Chapter 3 introduced the computing infrastructure used by a
data scientists.
☞ Chapter 4 introduces data science as a partnership between the
data scientist and the computer .
☞ As shown in Figure 1 below, the data scientist does most of the
heavy lifting.
☞ The sequence of decisions taken by the data scientist are
determined by the CRISP-DM1
model described in Chapter 2 of
this book.
☞ Machine learning is the field of study that develops and evaluates
the algorithms used by the computer to identify and extract
patterns in data
☞ Machine learning algorithms are mainly applied during the
Modelling phase of the CRISP-DM framework
1https://www.ibm.com/docs/en/spss-modeler/saas?topic=dm-crisp-help-overview

4
The Data Scientist - Computer Partnership
Data Science
The Data Scientist
Defines the problem
Designs the data set
Prepares the data
Decides on the type of data analysis
Evaluates & interprets the result
The Computer
Processes data
Searches for patterns in the data
Figure 1 : The Scientist and the Computer

5
The Modelling Phase
☞ The Modelling stage of the CRISP-DM process is split into phases:
Phase 1 of the Modelling Stage
1. The algorithm is applied on the data set to identify useful patterns
2. The patterns can be represented in may ways called models which
gives this stage of the CRISP-DM framework its name
3. These models include decision trees, regression models, and neural
networks
Phase 2 of the Modelling Stage
1. The models, the output of Phase 1, are used for data analysis.
2. Sometimes the structure of the model itself can reveal what the
important attributes are. For example, factors that strongly correlate with
Stroke
3. A model can also be used to classify or label new examples like new
types of Spam emails

6
Supervised Learning
The majority of algorithms can be classified as either
supervised learning or unsupervised learning.
Supervised Learning Algorithms
- learns a function that map values of attributes
that describe an instance to a target attribute
- the output pattern is a function that maps
the input attributes to the values of the
target attributes
- each instance in the data set must be labelled
with the value of the target attribute.
- searches through lots of functions to find one that
best maps inputs to outputs;
- learning bias is used to limit the number of pre-
ferred functions.
- Regression (lin-
ear, polynomial)
- Decision trees
- Random
Forests
- Classification
(KNN, Logis-
tic regression,
Naive Bayes,
SVM)

7
Unsupervised Learning
Unsupervised Learning Algorithms
- there is no target attribute so there is no need
to waste time labelling instances in the data
set with a target attribute;
- unsupervised algorithms are more difficult to
learn
- these algorithms search for irregularities in
the data.
- the main challenge for clustering is to mea-
sure similarity between the instances in the
data set
- Table 3 below lists some of the common sim-
ilarity measures.
- Clustering (K-
Means)
- Association-rule
Mining (Apriori
algorithm)
- Dimensionality
Reduction
(PCAs)
- Gaussian Mix-
ture Models
(GMM)
Table 2 : Unsupervised Learning

8
Unsupervised Learning
Common Similarity Measures
Similarity Measure Notes
Euclidean Distance - straight-line similarity used where all at-
tributes are numeric and have similar ranges
Jaccard similarity - measures the similarity between two sets of
data to see which members are shared and
distinct.
Cosine Similarity/Adjusted
Cosine Similarity
- suitable for numeric attributes with different
ranges which need to be normalized before
calculating similarity
Weighted Similarity - takes into account the importance of at-
tributes by ranking them before calculating sim-
ilarity
Table 3 : Similarity Measures

9
Learning Prediction Models
✍ Prediction algorithms estimate the value of a target attribute
based on the values of other attributes
✍ Prediction models are produced by supervised learning
algorithms.
✍ They are the most popular type of problems ML is used for.
✍ One concept that is fundamental to Prediction problems is
correlation analysis
Figure 2 : Correlation Analysis (Source:Scribbr2
)
2https://www.scribbr.com/statistics/correlation-coefficient/

10
Correlation Analysis
✍ A correlation is the strength of association between 2 attributes.
✍ The Pearson Correlation (r) is the most common measure of
linear strength between 2 numeric attributes whose values range
from −1 to 1.
✍ A coefficient of r = 0 means that the attributes are not correlated;
r = +1 means perfect positive correlation; and r = −1 indicates
the 2 attributes have a perfect negative correlation
✍ Identifying attributes that are highly-correlated to the target
attribute is very key in understanding the cause of an issue.
✍ Like correlation analysis, prediction techniques involve analysing
the relationship between attributes.
✍ If strong correlation exists between an input attribute and a target
attribute, then the ML algorithm is likely to generate an accurate
prediction model and vice versa

11
Pearson Correlation Analysis
- To calculate correlation for the population:
ρ =
cov(X, Y)
σx σy
(1)
- To calculate the estimate (sample):
r =
Pn
i=1(xi − x)(yi − y)
qPn
i=1(xi − x)2(yi − y)2
(2)
- Table 4 below shows the general guidelines for interpreting Pearson
coefficients:
r ≈ ±0.7 r ≈ ±0.5 r ≈ ±0.3 r ≈ 0
Strong linear
relationship
Moderate linear
relationship
Weak
relationship
No
relationship
Table 4 : Interpreting Pearson Coefficients

12
The BMI Example
The Data Set
ID Height
(m)
Weight(kg) Shoe
Size
Exercise
(min-
utes/day)
Diabetes
(%Likelihood)
1 1.70 70 5 130 0.05
2 1.77 88 9 80 0.11
3 1.85 112 11 0 0.18
Table 5 : Diabetes Data Set

13
The BMI Example
✍ A very popular application of correlation in real life is in
calculating the BMI index which classifies people as
underweight, normal weight, and overweight
✍ BMI:
- takes a number of attributes and maps them to a target
value - a new derived value.
- It’s easy to calculate the correlation between BMI and other
person’s attributes
✍ Diabetes has a higher correlation with BMI than with weight and
height independently;
✍ During data preparation, it is also important to check the effect of
a combination of attributes like BMI.
✍ Another benefit of ML is that ML algorithms can learn interactions
between attributes and create useful derived attributes

14
Linear Regression
✍ Regression models are preferred when the data set consists of
numeric attributes
✍ The first step is to hypothesize the structure and relationship of
attributes followed by a parameterized mathematical model
called by a regression function
Figure 3 : Regression Functions

15
The Regression Function
✍ The Regression Function converts inputs into outputs
✍ The best approach in linear regression analysis is to assume a
simple model first before considering a multi-parameter one
✍ A simple, single-parameter regression function models the
relationship betwen 2 attributes, X and Y:
Y = wo + w1X (3)
✍ The variables w0 and w1 are the parameters of the regression
function, where w0 is the Y intercept and w0 is the gradient of the
line.
✍ Modifying these parameters changes how the function maps
from the input X to the output Y
✍ Finding parameters is equivalent to defining the line that best fits
our data by reducing the overall error
✍ In high school mathematics this equation is usually written as:
y = mx + c (4)

16
Regression Analysis
Calculating overall error
Sum of Squared Errors (SSE)
➊ The regression function is applied to the data set to estimate the target
attribute using the input attribute(s).
➋ The error of the function is calculated per instance by subtracting the
estimated value of the target attribute from the actual target value
➌ The error of the function for each instance is squared to eliminate
negative values and the squared values are summed up.
➥ Equation 5 below shows the formula to calculate the Sum of Squared
Error (SSE) for a data set with n instances; targeti is the target attribute
for instance i and predictioni is the predicted target attribute by the
function for the same instance i.
SSE =
n
X
i=i
(targeti − predictioni )2
(5)
➥ The strategy of fitting a linear function by minimizing the SSE is known
as least squares

17
The Regression Function
An Implementation Example
✍ Replacing X with the BMI attribute and Y with the diabetes attribute from
Table 5, in equation 3 to find the best-fit line using the least-squares
approach, produced equation 6:
Diabetes = −7.38431 + 0.55593 ∗ BMI (6)
✍ Where, -7.38431 is the Y intercept w0 and 0.55593 is the gradient w1.
✍ If BMI is 24, the model (Diabetes = −7.38431 + 0.55593 ∗ 24) produces
a prediction of 5.96%
✍ The least-squares method calculates a weighted average over the
instances based on their distance from the best-fit line.
✍ The farther the instance is away from the line, the larger the residual
squared and the more weight is applied to the instance. This skews the
algorithm towards outliers if they were not removed during data
preparation
✍ Multiple linear regression functions extend the linear regression model
by taking more parameters

18
Neural Networks
✍ A neural network is a set of interconnected neurons which take a
numeric values as input and map them to a single output
✍ Unlike a multiple linear regression model, a neuron passes its output
through an activation function
✍ Figure 4 below shows a neural network with a single activation layer
✍ Table 6 below lists some of the most common non-linear activation
functions
✍ The activation functions take the single-value output of the multi-input
linear regression function and map it to a non-linear output
✍ Each neuron in a neural network:
➊ Multiplies each input by a weight
➋ Adds together the results of the multiplication
➌ Pushes the result to the activation function

19
A Simple Neural Network
x1
x2
x3
x4
Output
Hidden
layer
Input
layer
Output
layer
Figure 4 : Neural Network

20
Common Activation Functions
Name Function Derivative Figure
Logistic σ(x) =
1
1 + e−x
f′
(x) = f(x)(1 − f(x))2
Tanh σ(x) =
ex
− e−x
ez + e−z
f′
(x) = 1 − f(x)2
ReLU f(x)
(
0 if x < 0
x if x ≥ 0.
f(x)
(
0 if x < 0
x if x ≥ 0.
Softmax f(x) =
ex
P
i ex
f′
(x) =
ex
P
i ex
−
(ex
)2
(
P
i ex )2
Table 6 : Non-linear Activation Functions.

21
Understanding a Neural Network
✍ The neural network in figure 5 is organised into 3 layers the input layer,
hidden layer, and the output layer.
✍ Nodes h1 to hn are neurons that make up the hidden layer, a layer which
is neither the input nor the output;
✍ The arrows represent the flow of information. Feed-forward neural
networks have no loops; all the connections point forward;
✍ A fully-connected network is one where every neuron is connected to all
other neurons;
✍ Most of the work in developing neural networks involves finding the best
network layout, number of hidden layers, types of activation functions
used and the direction of the connections;
✍ The labels on each arrow (ωn) represent the weights and the f node
represents the activation function.
✍ The output of figure 5 with a tanh activation function would be
output = tanh(ω1h1 + ω2h2 + ω3h3 + ... + ωnhn)

22
Weights in a Neural Network
f
Σ output
.
.
.
.
.
.
x1
h1
ω1
x2
h2
ω2
x3
h3
ω3
xn
hn
ωn
Figure 5 : Neural Network with Weights

23
Training Neural Networks
The Weight-update Rule
✍ Training a neural network involves finding the correct weights
(ω1, ω2, ω3, ..., ωn) using the weight-update rule.
✍ At a high level, the weight-update rule works like this:
❶ If the error is 0, then don’t change the weights
❷ If the error is positive, increase the weights for all connections
where the input is positive and reduce he weights for connections
where the input is negative.
❸ If the error is negative, increase the weights for all connections
where the input is negative and reduce he weights for connections
where the input is positive.
✍ The major challenge with the weight-update rule is that it is difficult to
calculate the error for neurons earlier layers in deep neural networks.
✍ The standard way to train a neural network is to use the
backpropagation algorithm, a supervised learning algorithm illustrated in
figure 6.

24
The Backpropagation Algorithm
x1
x2
x3
x4
h
h
h
h
h
Σ Output
Hidden
layer
Input
layer
Output
layer
Back propagation
Figure 6 : Backpropagation Simplified

25
Backpropagation in Action
✍ The algorithm propagates the error resulting from the training of each
instance back to the network starting from the output layer as shown in
figure 6.
✍ The main steps of the algorithm are as follows:
❶ Calculate the error for the neurons in the output layer and use the
weight-update rule to adjust the weights down the network
❷ Share the error calculated in a neuron with the neurons in the
preceding layer
❸ Work back through the layers repeating steps 1 and 2 above
✍ The idea is to reduce, not eliminate error to avoid overfitting, and allow
the network to generalize to new instances that are not in the data set;

26
Deep Learning
✍ Deep Learning networks are neural networks with with more than one
hidden layers.
✍ Figure 7 below has 3 hidden layers of 5 neurons each; and 5 layers
overall;
✍ You don’t need to have the same number of neurons in each layer as
shown in figure 7, the input layer has 3 neurons, the hidden layer has 5
neurons in 3 hidden layers and 3 neurons in the output layer.
✍ Figure 7 is also a feed-forward network since it has no loops.
✍ Visit Kaggle3
for a comprehensive deep learning cheat sheet.
✍ Table 7 gives a summary of some of the common deep learning
networks and their applications.
3https://www.kaggle.com/getting-started/151100

27
Deep Neural Networks
Figure 7 : Deep Neural Network

28
Common Deep Learning Networks
Network Key Features Applications
Convolutional Neural
Networks (CNN)
one or more convolutional
layers;one or more fully con-
nected layers
image recogni-
tion;classification
Reccurent neural net-
work (RNN)
connections between units
have a directed cycle
Time series predic-
tion; text generation
Long short-term
memory (LSTM)
type of RNN; remember
longer than RNN
Time series predic-
tion; text generation
Deep Belief Network
(DBN)
has connections between
layers but not within layer
unsupervised learn-
ing tasks to reduce
the dimensionality of
features
Self-Organising
Maps (SOM)
convert input data to low
dimensional space
Visualization
Table 7 : Deep Learning Networks

29
Decision Trees
✍ Linear regression and neural networks work best with numeric inputs,
not with nominal or ordinal data.
✍ Decision trees work well with nominal and ordinal data types
✍ Figure 8 shows a decision tree for deciding whether an email is a spam
or not. The rounded rectangles are attributes.
✍ A decision tree encodes a set of if then, else rules in a tree structure
✍ Each path in a decision tree, from root to leaf, defines a classification
rule
✍ The Iterative Dichotomiser 3 (ID3)4
algorithm is considered the father of
all decision tree algorithms;
✍ Decision trees are very sensitive to noise in the data set. It is
recommended to keep them shallow.
✍ A random forest model is made up of a set of decision trees.
4https://en.wikipedia.org/wiki/ID3_algorithm

30
Decision Tree Example
Figure 8 : A sample decision tree

31
The ID3 Algorithm
✍ The ID3 algorithm recursively builds the decision tree in a depth-first
manner adding one node at a time starting with the root node;
✍ The ID3 chooses the attribute to test at each node in the tree so as to
minimize the number of sets that have the same value as the target
attribute (pure sets);
✍ The entropy metric can be used to measure the purity of a set.
✍ The ID3 selects the that results in the lowest weighted entropy attribute
to test a node after splitting data set at the node using this attribute.
✍ To calculate the weighted entropy of a node:
❶ split the data set using the attribute;
❷ calculate the entropy of the resulting sets;
❸ weight each entropy by the fraction of data in the set;
❹ sum-up the results
✍ Decision trees are easy to understand

32
Bias in Data Science
✍ The major objective of ML is to create models that encode appropriate
generalizations from the data;
✍ Two key factors determine the quality of an ML model:
❶ The data set the algorithm is run on. If the data set is not a true
reflection of real-life events, the model will not be accurate. This is
referred to a sampling bias
❷ The choice of ML algorithm. ML algorithms use learning bias or
modelling/selection bias to generalize from a data set. A wrong
choice of algorithm will result in an incorrect learning bias.
✍ While sampling bias is bad, without learning bias there can be no
learning and the algorithm will only memorize the data
✍ There is no best ML algorithm so the Modelling Phase of CRISP-DM
process involves building multiple models using different algorithms and
choosing the best in terms of accuracy, generalization and other
performance metrics.

33
“The golden rule for evaluating
models is that models should
never be tested on the same data
they were trained on ”
Kelleher & Tierney, (2018)

34
Evaluating Models
✍ After generating models, the next step is to create a test plan;
✍ A model that simply memorizes training data will not perform well on test
and any other previously unseen data.
✍ The normal practice is to split the data into 3 sets: training data; testing
data and validation data;
✍ The other important aspect of a test plan is choosing the appropriate
evaluation metrics to use during testing;
✍ Table 8 below shows some of the most commonly used metrics and
their applications.

35
Common Evaluation Metrics
Metric Applications Equation
Mean Absolute Error (MAE) Regressions MAE =
Pd
i=1 |xi − yi |
Root Mean Squared Error
(RMSE)
Regressions RMSE =
q
1
n
Pn
i=1(ŷi − yi )2
Recall/AUC Classification Recall = TP
TP+FN
Precision Classification Precision = TP
TP+FP
Accuracy Classification Accuracy = TP+TN
TP+TN+FP+FN
Table 8 : Common Evaluation Metrics

36
Conclusion
✓ Chapter 4 started by asserting the partnership between a data scientist
and a computer;
✓ The computer generates a model from a data set prepared by the data
scientist.
✓ The data scientist interprets and evaluates the model;
✓ Model evaluation follows the golden rule
✓ The best model is chosen based on its accuracy, but in future,
data-usage and privacy may affect model selection;
✓ Chapter 5 discusses converting a business problem to a data science
problem and Chapter 6 will discuss the impact of Privacy Laws on data
science.

37
References and Bibliography
Kelleher, J.D.& Tierney, B. - 2018 - Data Science,
MIT Press. pp. 101–150.

Data Science Chapter 4: Machine Learning 101

Recommended

Recommended

More Related Content

Similar to Data Science Chapter 4: Machine Learning 101

Similar to Data Science Chapter 4: Machine Learning 101 (20)

Recently uploaded

Recently uploaded (20)

Data Science Chapter 4: Machine Learning 101