4-Introduction to Machine Learning Lecture # 4.pdf

Dr. Ashfaq Ahmad Associate Professor
Department of Electronics & Electrical System
The University of Lahore
Introduction to Machine Learning
Lecture # 4

Lecture
Objectives
Data With Descriptive Statistics
Peek at Your Data
Dimensions of Your Data
Data Type For Each Attribute
Descriptive Statistics
Class Distribution (Classification Only)
Correlations Between Attributes
Dr. Ashfaq Ahmad, Associate Professor, Department of Electronics & Electrical Systems
2
Skew of Univariate Distributions

Overfitting and Underfitting
Generalization in Machine Learning
Statistical Fit
Overfitting in Machine Learning
Underfitting in Machine Learning
A Good Fit in Machine Learning
How To Limit Overfitting
Dr. Ashfaq Ahmad, Associate Professor, Department of Electronics & Electrical Systems
3
Lecture
Objectives

Data
Understanding
With Descriptive
Statistics
➢ In order to get the best results, you must
understand your data.
➢ In this lecture you will learn about 7 ways that can
use in Python to better understand your machine
learning data.
➢ All 7 ways is demonstrated by loading the
Diabetes classification dataset.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 4

Peek At Your Data
✓ There is no substitute for looking at the raw data.
✓ Analyzing the raw data can provide information
that cannot be learned in any other way.
✓ Additionally, it can sow the seeds that may later
grow into ideas on how to better prepare and
manage data for machine learning tasks.
✓ You can review the first 20 rows of your data using
the head() function on the Pandas DataFrame.

Example of Reviewing The First Few Rows of Data
# View first 20 rows
from pandas import read_csv
filename = "diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
peek = data.head(20)
print(peek)

Example of Reviewing The First Few Rows of Data
You can see that the first column list shows the row number, This is useful for
referring to a particular observation.
preg plas pres skin test mass pedi age class
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
5 5 116 74 0 0 25.6 0.201 30 0
6 3 78 50 32 88 31.0 0.248 26 1
7 10 115 0 0 0 35.3 0.134 29 0
8 2 197 70 45 543 30.5 0.158 53 1
9 8 125 96 0 0 0.0 0.232 54 1
Output of
reviewing
the first
few rows
of data.

Dimensions of Data
❖ The quantity of your data, both in terms of rows and
columns, must be very well understood.
✓ Too many rows and algorithms may take too long to
train.
✓ Too few and perhaps you do not have enough data to
train the algorithms.
✓ The curse of dimensionality can cause some algorithms
to be distracted or to perform poorly when there are too
many features.

An Illustration of
Examining The Shape
of The Data
❑ You can review the shape and size of your dataset by
printing the shape property on the Pandas DataFrame.
❑ # Dimensions of your data
❑ from pandas import read_csv
❑ filename = "diabetes.data.csv"
❑ names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi',
'age', 'class']
❑ data = read_csv(filename, names=names)
❑ shape = data.shape
❑ print(shape)
The results are listed in rows then columns. You can see that the dataset
has 768 rows and 9 columns.
The Results of Examining The Data’s Shape
(768, 9)

Type of Data For Each Attribute
❑ The type of each attribute is important.
❑ Strings may need to be converted to floating point
values or integers to represent categorical or ordinal
values.
❑ You can get an idea of the types of attributes by
peeking at the raw data, as above.
❑ You can also list the data types used by the
DataFrame to characterize each attribute using the
dtypes property.

Example of Reviewing The
Data Types of The Data
# Data Types for Each Attribute
names = ['preg', 'plas', 'pres', 'skin', 'test',
'mass', 'pedi', 'age', 'class']
types = data.dtypes
print(types)
Output of Reviewing The
Data Types of The Data
➢ You can see that most of the attributes are integers
the only mass and pedi are floating point types
preg int64
plas int64
pres int64
skin int64
test int64
mass float64
pedi float64
age int64
class int64
dtype: object

Descriptive
Statistics
❑ Descriptive statistics can give you great perception
into the shape of each attribute.
❑ Often you can create more summaries at the time
of review.
❑ The describe() function on the Pandas DataFrame
lists 8 statistical properties of each attribute. They
are:
➢ Count.
➢ Mean.
➢ Standard Deviation.
➢ Minimum Value.
➢ 25th Percentile.
➢ 50th Percentile (Median).
➢ 75th Percentile.
➢ Maximum Value.

Example of Reviewing A
Statistical Summary of The Data
# Statistical Summary
from pandas import set_option
names = ['preg’, 'plas’, 'pres’, 'skin’, 'test’, 'mass’, 'pedi’, 'age’, 'class']
set_option('display.width', 100)
set_option('precision’, 3)
description = data.describe()
print(description)

Output of
Reviewing A
Statistical
Summary of
The Data
✓ You can see that you do get a lot of data.
✓ It should be noted that the precision of the numbers and
the preferred width of the output can be change by using
the pandas.set option() function.
✓ This is to make it more readable for this example.
✓ When describing your data this way, it is worth taking
some time and reviewing observations from the results.
✓ This might include the presence of NA values for missing
data or surprising distributions for attributes.

Class Distribution (Classification Only)
❑ On classification problems you need to know how balanced the class values are.
❑ Highly imbalanced problems (a lot more observations for one class than another)
are common and may need special handling in the data preparation stage of your
project.
❑ You can quickly get an idea of the distribution of the class attribute in Pandas.

Example of Reviewing A
Class Breakdown of The Data
# Class Distribution
names = ['preg', 'plas', 'pres', 'skin', 'test',
'mass', 'pedi', 'age', 'class']
class_counts = data.groupby('class').size()
print(class_counts)
Output of Reviewing A
Class Breakdown of the Data
❑ You can see that there are nearly
double the number of observations
with class 0 (no onset of diabetes) than
there are with class 1 (onset of
diabetes).
class
0 500
1 268

Correlations Between Attributes
➢ Correlation refers to the relationship between two variables and how they may or may not
change together.
➢ The most common method for calculating correlation is Pearson's Correlation Coefficient,
that assumes a normal distribution of the attributes involved.
➢ A correlation of -1 or 1 shows a full negative or positive correlation, respectively.
➢ Whereas a value of 0 shows no correlation at all. Some machine learning algorithms like
linear and logistic regression can suffer poor performance if there are highly correlated
attributes in your dataset.
➢ As such, it is a good idea to review all the pairwise correlations of the attributes in your
dataset.
➢ You can use the corr () function on the Pandas DataFrame to calculate a correlation
matrix.

Example of Reviewing Correlations of Attributes In The Data
# Pairwise Pearson correlations
from pandas import set_option
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
set_option('display.width', 100)
set_option('precision', 3)
correlations = data.corr(method='pearson')
print(correlations)

Output of Reviewing
Correlations of
Attributes In The
Data
❑ The matrix lists all attributes across the top and down the
side, to give correlation between all pairs of attributes
(twice, because the matrix is symmetrical).
❑ You can see the diagonal line through the matrix from the
top left to bottom right corners of the matrix shows perfect
correlation of each attribute with itself.

Skew of Univariate Distributions
➢ The skewness in statistics is a measure of asymmetry or the deviation of a given random
variable's distribution from a symmetric distribution (like normal Distribution). In Normal
Distribution, we know that: Median = Mode = Mean.
➢ The formula given in most textbooks is Skew = 3 * (Mean – Median) / Standard Deviation
➢ Skew refers to a distribution that is assumed Gaussian (normal or bell curve) that is shifted or
squashed in one direction or another.
➢ Many machine learning algorithms assume a Gaussian distribution.
➢ Knowing that an attribute has a skew may allow you to perform data preparation to correct the
skew and later improve the accuracy of your models.
➢ You can calculate the skew of each attribute using the skew() function on the Pandas
DataFrame.

Example of Reviewing Skew of
Attribute Distributions In The Data
# Skew for each attribute
names = ['preg', 'plas', 'pres', 'skin',
'test', 'mass', 'pedi', 'age', 'class']
skew = data.skew()
print(skew)
Output of Reviewing Skew of
Attribute Distributions In The Data
The skew result show a positive (right) or
negative (left) skew. Values closer to zero
show less skew.

Overfitting
And
Underfitting
The cause of poor
performance in machine
learning is either overfitting
or underfitting the data.
In this lecture we will
discover the concept of
generalization in machine
learning and the problems of
overfitting and underfitting
that go along with it.

➢ In machine learning we describe the learning of the target function from training
data as inductive learning.
➢ Induction refers to learning general concepts from specific examples which is
exactly the problem that supervised machine learning problems aim to solve.
➢ This is different from deduction that is the other way around and seeks to learn
specific concepts from general rules.

➢ Generalization refers to how well the concepts learned by a machine learning
model apply to specific examples not seen by the model when it was learning.
➢ The goal of a good machine learning model is to generalize well from the training
data to any data from the problem domain.
➢ This allows us to make predictions in the future on data the model has never seen.
➢ There is a terminology used in machine learning when we talk about how well a
machine learning model learns and generalizes to new data, namely overfitting
and underfitting.
➢ Overfitting and underfitting are the two biggest causes for poor performance of
machine learning algorithms.

Statistical Fit
❑ In statistics a fit refers to how well you approximate a target function.
❑ This is good terminology to use in machine learning, because supervised machine
learning algorithms seek to approximate the unknown underlying mapping function for
the output variables given the input variables.
❑ Statistics often describe the goodness of fit which refers to measures used to estimate how
well the approximation of the function matches the target function.
❑ Some of these methods are useful in machine learning (e.g. calculating the residual
errors), but some of these techniques assume we know the form of the target function we
are approximating, which is not the case in machine learning.
❑ If we knew the form of the target function, we would use it directly to make predictions,
rather than trying to learn an approximation from samples of noisy training data.

❑ Overfitting refers to a model that models the training data too well.
❑ Overfitting happens when a model learns the detail and noise in the training data
to the extent that it negatively impacts the performance on the model on new
data.
❑ This means that the noise or random fluctuations in the training data is picked up
and learned as concepts by the model.
❑ The problem is that these concepts do not apply to new data and negatively
impact the model's ability to generalize.

❑ Overfitting is more likely with nonparametric and nonlinear models that have
more flexibility when learning a target function.
❑ As such, many nonparametric machine learning algorithms also include
parameters or techniques to limit and constrain how much detail the model learns.
❑ For example, decision trees are a nonparametric machine learning algorithm that
is very flexible and is subject to overfitting training data.
❑ This problem can be addressed by pruning a tree after it has learned in order to
remove some of the detail it has picked up.

Underfitting in Machine Learning
❖ Underfitting refers to a model that can neither model the training data not
generalize to new data.
❖ An underfit machine learning model is not a suitable model and will be obvious
as it will have poor performance on the training data.
❖ Underfitting is often not discussed as it is easy to detect given a good
performance metric.
❖ The remedy is to move on and try alternate machine learning algorithms.
Nevertheless, it does provide good contrast to the problem of concept of
overfitting.

❑ Ideally, you want to select a model at the sweet spot between underfitting and
overfitting. This is the goal but is very difficult to do in practice.
❑ To understand this goal, we can look at the performance of a machine learning
algorithm over time as it is learning a training data.
❑ We can plot the skill on the training data the skill on a test dataset we have held
back from the training process.
❑ Over time, as the algorithm learns, the error for the model on the training data
goes down and so does the error on the test dataset.

❑ If we train for too long, the performance on the training dataset may continue to
decrease because the model is overfitting and learning the irrelevant detail and
noise in the training dataset.
❑ At the same time the error for the test set starts to rise again as the model's ability
to generalize decreases.
❑ The sweet spot is the point just before the error on the test dataset starts to
increase where the model has good skill on both the training dataset and the
unseen test dataset.

❑ You can perform this experiment with your favorite machine learning algorithms.
❑ This is often not useful technique in practice, because by choosing the stopping
point for training using the skill on the test dataset it means that the test set is no
longer unseen or a standalone objective measure.
❑ Some knowledge (a lot of useful knowledge) about that data has leaked into the
training procedure.
❑ There are two additional techniques you can use to help find the sweet spot in
practice: resampling methods and a validation dataset.

❑ Both overfitting and underfitting can lead to poor model performance.
❑ But by far the most common problem in applied machine learning is overfitting.
❑ Overfitting is such a problem because the evaluation of machine learning algorithms on
training data is different from the evaluation, we care the most about, namely how well
the algorithm performs on unseen data.
❑ There are two important techniques that you can use when evaluating machine learning
algorithms to limit overfitting:
✓ Use a resampling technique to estimate model accuracy
✓ Hold back a validation dataset.

❑ The most popular resampling technique is k-fold cross validation.
❑ It allows you to train and test your model k-times on different subsets of training
data and build up an estimate of the performance of a machine learning model on
unseen data.
❑ A validation dataset is simply a subset of your training data that you hold back
from your machine learning algorithms until the very end of your project.
❑ After you have selected and tuned your machine learning algorithms on your
training dataset you can evaluate the learned models on the validation dataset to
get a final objective idea of how the models might perform on unseen data.

❑ Using cross validation is a gold standard in applied machine learning for
estimating model accuracy on unseen data.
❑ If you have the data, using a validation dataset is also an excellent practice.

4-Introduction to Machine Learning Lecture # 4.pdf

4-Introduction to Machine Learning Lecture # 4.pdf

Recommended

Recommended

More Related Content

Similar to 4-Introduction to Machine Learning Lecture # 4.pdf

Similar to 4-Introduction to Machine Learning Lecture # 4.pdf (20)

Recently uploaded

Recently uploaded (20)

4-Introduction to Machine Learning Lecture # 4.pdf