SlideShare a Scribd company logo
1 of 35
Download to read offline
Dr. Ashfaq Ahmad Associate Professor
Department of Electronics & Electrical System
The University of Lahore
Introduction to Machine Learning
Lecture # 4
Lecture
Objectives
Data With Descriptive Statistics
Peek at Your Data
Dimensions of Your Data
Data Type For Each Attribute
Descriptive Statistics
Class Distribution (Classification Only)
Correlations Between Attributes
Dr. Ashfaq Ahmad, Associate Professor, Department of Electronics & Electrical Systems
2
Skew of Univariate Distributions
Overfitting and Underfitting
Generalization in Machine Learning
Statistical Fit
Overfitting in Machine Learning
Underfitting in Machine Learning
A Good Fit in Machine Learning
How To Limit Overfitting
Dr. Ashfaq Ahmad, Associate Professor, Department of Electronics & Electrical Systems
3
Lecture
Objectives
Data
Understanding
With Descriptive
Statistics
➢ In order to get the best results, you must
understand your data.
➢ In this lecture you will learn about 7 ways that can
use in Python to better understand your machine
learning data.
➢ All 7 ways is demonstrated by loading the
Diabetes classification dataset.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 4
Peek At Your Data
✓ There is no substitute for looking at the raw data.
✓ Analyzing the raw data can provide information
that cannot be learned in any other way.
✓ Additionally, it can sow the seeds that may later
grow into ideas on how to better prepare and
manage data for machine learning tasks.
✓ You can review the first 20 rows of your data using
the head() function on the Pandas DataFrame.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 5
Example of Reviewing The First Few Rows of Data
# View first 20 rows
from pandas import read_csv
filename = "diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
peek = data.head(20)
print(peek)
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 6
Example of Reviewing The First Few Rows of Data
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 7
You can see that the first column list shows the row number, This is useful for
referring to a particular observation.
preg plas pres skin test mass pedi age class
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
5 5 116 74 0 0 25.6 0.201 30 0
6 3 78 50 32 88 31.0 0.248 26 1
7 10 115 0 0 0 35.3 0.134 29 0
8 2 197 70 45 543 30.5 0.158 53 1
9 8 125 96 0 0 0.0 0.232 54 1
Output of
reviewing
the first
few rows
of data.
Dimensions of Data
❖ The quantity of your data, both in terms of rows and
columns, must be very well understood.
✓ Too many rows and algorithms may take too long to
train.
✓ Too few and perhaps you do not have enough data to
train the algorithms.
✓ The curse of dimensionality can cause some algorithms
to be distracted or to perform poorly when there are too
many features.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 8
An Illustration of
Examining The Shape
of The Data
❑ You can review the shape and size of your dataset by
printing the shape property on the Pandas DataFrame.
❑ # Dimensions of your data
❑ from pandas import read_csv
❑ filename = "diabetes.data.csv"
❑ names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi',
'age', 'class']
❑ data = read_csv(filename, names=names)
❑ shape = data.shape
❑ print(shape)
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 9
The results are listed in rows then columns. You can see that the dataset
has 768 rows and 9 columns.
The Results of Examining The Data’s Shape
(768, 9)
Type of Data For Each Attribute
❑ The type of each attribute is important.
❑ Strings may need to be converted to floating point
values or integers to represent categorical or ordinal
values.
❑ You can get an idea of the types of attributes by
peeking at the raw data, as above.
❑ You can also list the data types used by the
DataFrame to characterize each attribute using the
dtypes property.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 10
Example of Reviewing The
Data Types of The Data
# Data Types for Each Attribute
from pandas import read_csv
filename = "diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test',
'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
types = data.dtypes
print(types)
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 11
Output of Reviewing The
Data Types of The Data
➢ You can see that most of the attributes are integers
the only mass and pedi are floating point types
preg int64
plas int64
pres int64
skin int64
test int64
mass float64
pedi float64
age int64
class int64
dtype: object
Descriptive
Statistics
❑ Descriptive statistics can give you great perception
into the shape of each attribute.
❑ Often you can create more summaries at the time
of review.
❑ The describe() function on the Pandas DataFrame
lists 8 statistical properties of each attribute. They
are:
➢ Count.
➢ Mean.
➢ Standard Deviation.
➢ Minimum Value.
➢ 25th Percentile.
➢ 50th Percentile (Median).
➢ 75th Percentile.
➢ Maximum Value.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 12
Example of Reviewing A
Statistical Summary of The Data
# Statistical Summary
from pandas import read_csv
from pandas import set_option
filename = "diabetes.data.csv"
names = ['preg’, 'plas’, 'pres’, 'skin’, 'test’, 'mass’, 'pedi’, 'age’, 'class']
data = read_csv(filename, names=names)
set_option('display.width', 100)
set_option('precision’, 3)
description = data.describe()
print(description)
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 13
Output of
Reviewing A
Statistical
Summary of
The Data
✓ You can see that you do get a lot of data.
✓ It should be noted that the precision of the numbers and
the preferred width of the output can be change by using
the pandas.set option() function.
✓ This is to make it more readable for this example.
✓ When describing your data this way, it is worth taking
some time and reviewing observations from the results.
✓ This might include the presence of NA values for missing
data or surprising distributions for attributes.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 14
Class Distribution (Classification Only)
❑ On classification problems you need to know how balanced the class values are.
❑ Highly imbalanced problems (a lot more observations for one class than another)
are common and may need special handling in the data preparation stage of your
project.
❑ You can quickly get an idea of the distribution of the class attribute in Pandas.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 15
Example of Reviewing A
Class Breakdown of The Data
# Class Distribution
from pandas import read_csv
filename = "diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test',
'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
class_counts = data.groupby('class').size()
print(class_counts)
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 16
Output of Reviewing A
Class Breakdown of the Data
❑ You can see that there are nearly
double the number of observations
with class 0 (no onset of diabetes) than
there are with class 1 (onset of
diabetes).
class
0 500
1 268
Correlations Between Attributes
➢ Correlation refers to the relationship between two variables and how they may or may not
change together.
➢ The most common method for calculating correlation is Pearson's Correlation Coefficient,
that assumes a normal distribution of the attributes involved.
➢ A correlation of -1 or 1 shows a full negative or positive correlation, respectively.
➢ Whereas a value of 0 shows no correlation at all. Some machine learning algorithms like
linear and logistic regression can suffer poor performance if there are highly correlated
attributes in your dataset.
➢ As such, it is a good idea to review all the pairwise correlations of the attributes in your
dataset.
➢ You can use the corr () function on the Pandas DataFrame to calculate a correlation
matrix.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 17
Example of Reviewing Correlations of Attributes In The Data
# Pairwise Pearson correlations
from pandas import read_csv
from pandas import set_option
filename = "diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
set_option('display.width', 100)
set_option('precision', 3)
correlations = data.corr(method='pearson')
print(correlations)
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 18
Output of Reviewing
Correlations of
Attributes In The
Data
❑ The matrix lists all attributes across the top and down the
side, to give correlation between all pairs of attributes
(twice, because the matrix is symmetrical).
❑ You can see the diagonal line through the matrix from the
top left to bottom right corners of the matrix shows perfect
correlation of each attribute with itself.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 19
Skew of Univariate Distributions
➢ The skewness in statistics is a measure of asymmetry or the deviation of a given random
variable's distribution from a symmetric distribution (like normal Distribution). In Normal
Distribution, we know that: Median = Mode = Mean.
➢ The formula given in most textbooks is Skew = 3 * (Mean – Median) / Standard Deviation
➢ Skew refers to a distribution that is assumed Gaussian (normal or bell curve) that is shifted or
squashed in one direction or another.
➢ Many machine learning algorithms assume a Gaussian distribution.
➢ Knowing that an attribute has a skew may allow you to perform data preparation to correct the
skew and later improve the accuracy of your models.
➢ You can calculate the skew of each attribute using the skew() function on the Pandas
DataFrame.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 20
Example of Reviewing Skew of
Attribute Distributions In The Data
# Skew for each attribute
from pandas import read_csv
filename = "diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin',
'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
skew = data.skew()
print(skew)
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 21
Output of Reviewing Skew of
Attribute Distributions In The Data
The skew result show a positive (right) or
negative (left) skew. Values closer to zero
show less skew.
Overfitting
And
Underfitting
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 22
The cause of poor
performance in machine
learning is either overfitting
or underfitting the data.
In this lecture we will
discover the concept of
generalization in machine
learning and the problems of
overfitting and underfitting
that go along with it.
Generalization in Machine Learning
➢ In machine learning we describe the learning of the target function from training
data as inductive learning.
➢ Induction refers to learning general concepts from specific examples which is
exactly the problem that supervised machine learning problems aim to solve.
➢ This is different from deduction that is the other way around and seeks to learn
specific concepts from general rules.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 23
Generalization in Machine Learning
➢ Generalization refers to how well the concepts learned by a machine learning
model apply to specific examples not seen by the model when it was learning.
➢ The goal of a good machine learning model is to generalize well from the training
data to any data from the problem domain.
➢ This allows us to make predictions in the future on data the model has never seen.
➢ There is a terminology used in machine learning when we talk about how well a
machine learning model learns and generalizes to new data, namely overfitting
and underfitting.
➢ Overfitting and underfitting are the two biggest causes for poor performance of
machine learning algorithms.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 24
Statistical Fit
❑ In statistics a fit refers to how well you approximate a target function.
❑ This is good terminology to use in machine learning, because supervised machine
learning algorithms seek to approximate the unknown underlying mapping function for
the output variables given the input variables.
❑ Statistics often describe the goodness of fit which refers to measures used to estimate how
well the approximation of the function matches the target function.
❑ Some of these methods are useful in machine learning (e.g. calculating the residual
errors), but some of these techniques assume we know the form of the target function we
are approximating, which is not the case in machine learning.
❑ If we knew the form of the target function, we would use it directly to make predictions,
rather than trying to learn an approximation from samples of noisy training data.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 25
Overfitting in Machine Learning
❑ Overfitting refers to a model that models the training data too well.
❑ Overfitting happens when a model learns the detail and noise in the training data
to the extent that it negatively impacts the performance on the model on new
data.
❑ This means that the noise or random fluctuations in the training data is picked up
and learned as concepts by the model.
❑ The problem is that these concepts do not apply to new data and negatively
impact the model's ability to generalize.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 26
Overfitting in Machine Learning
❑ Overfitting is more likely with nonparametric and nonlinear models that have
more flexibility when learning a target function.
❑ As such, many nonparametric machine learning algorithms also include
parameters or techniques to limit and constrain how much detail the model learns.
❑ For example, decision trees are a nonparametric machine learning algorithm that
is very flexible and is subject to overfitting training data.
❑ This problem can be addressed by pruning a tree after it has learned in order to
remove some of the detail it has picked up.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 27
Underfitting in Machine Learning
❖ Underfitting refers to a model that can neither model the training data not
generalize to new data.
❖ An underfit machine learning model is not a suitable model and will be obvious
as it will have poor performance on the training data.
❖ Underfitting is often not discussed as it is easy to detect given a good
performance metric.
❖ The remedy is to move on and try alternate machine learning algorithms.
Nevertheless, it does provide good contrast to the problem of concept of
overfitting.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 28
A Good Fit in Machine Learning
❑ Ideally, you want to select a model at the sweet spot between underfitting and
overfitting. This is the goal but is very difficult to do in practice.
❑ To understand this goal, we can look at the performance of a machine learning
algorithm over time as it is learning a training data.
❑ We can plot the skill on the training data the skill on a test dataset we have held
back from the training process.
❑ Over time, as the algorithm learns, the error for the model on the training data
goes down and so does the error on the test dataset.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 29
A Good Fit in Machine Learning
❑ If we train for too long, the performance on the training dataset may continue to
decrease because the model is overfitting and learning the irrelevant detail and
noise in the training dataset.
❑ At the same time the error for the test set starts to rise again as the model's ability
to generalize decreases.
❑ The sweet spot is the point just before the error on the test dataset starts to
increase where the model has good skill on both the training dataset and the
unseen test dataset.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 30
A Good Fit in Machine Learning
❑ You can perform this experiment with your favorite machine learning algorithms.
❑ This is often not useful technique in practice, because by choosing the stopping
point for training using the skill on the test dataset it means that the test set is no
longer unseen or a standalone objective measure.
❑ Some knowledge (a lot of useful knowledge) about that data has leaked into the
training procedure.
❑ There are two additional techniques you can use to help find the sweet spot in
practice: resampling methods and a validation dataset.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 31
How To Limit Overfitting
❑ Both overfitting and underfitting can lead to poor model performance.
❑ But by far the most common problem in applied machine learning is overfitting.
❑ Overfitting is such a problem because the evaluation of machine learning algorithms on
training data is different from the evaluation, we care the most about, namely how well
the algorithm performs on unseen data.
❑ There are two important techniques that you can use when evaluating machine learning
algorithms to limit overfitting:
✓ Use a resampling technique to estimate model accuracy
✓ Hold back a validation dataset.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 32
How To Limit Overfitting
❑ The most popular resampling technique is k-fold cross validation.
❑ It allows you to train and test your model k-times on different subsets of training
data and build up an estimate of the performance of a machine learning model on
unseen data.
❑ A validation dataset is simply a subset of your training data that you hold back
from your machine learning algorithms until the very end of your project.
❑ After you have selected and tuned your machine learning algorithms on your
training dataset you can evaluate the learned models on the validation dataset to
get a final objective idea of how the models might perform on unseen data.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 33
How To Limit Overfitting
❑ Using cross validation is a gold standard in applied machine learning for
estimating model accuracy on unseen data.
❑ If you have the data, using a validation dataset is also an excellent practice.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 34
4-Introduction to Machine Learning Lecture # 4.pdf

More Related Content

Similar to 4-Introduction to Machine Learning Lecture # 4.pdf

Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxPremaGanesh1
 
Why ADaM for a statistician?
Why ADaM for a statistician?Why ADaM for a statistician?
Why ADaM for a statistician?Kevin Lee
 
A course work on R programming for basics to advance statistics and GIS.pdf
A course work on R programming for basics to advance statistics and GIS.pdfA course work on R programming for basics to advance statistics and GIS.pdf
A course work on R programming for basics to advance statistics and GIS.pdfSEEMAB AKHTAR
 
data analysis techniques and statistical softwares
data analysis techniques and statistical softwaresdata analysis techniques and statistical softwares
data analysis techniques and statistical softwaresDr.ammara khakwani
 
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...cscpconf
 
More Stored Procedures and MUMPS for DivConq
More Stored Procedures and  MUMPS for DivConqMore Stored Procedures and  MUMPS for DivConq
More Stored Procedures and MUMPS for DivConqeTimeline, LLC
 
ADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODEL
ADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODELADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODEL
ADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODELijcsit
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Seval Çapraz
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Bringing OpenClinica Data into SAS
Bringing OpenClinica Data into SASBringing OpenClinica Data into SAS
Bringing OpenClinica Data into SASRick Watts
 
Machine Learning with WEKA
Machine Learning with WEKAMachine Learning with WEKA
Machine Learning with WEKAbutest
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Researchjim
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Researchbutest
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Researchkevinlan
 
Six sigma statistics
Six sigma statisticsSix sigma statistics
Six sigma statisticsShankaran Rd
 
Edu 8006 8 assignment edu8006
Edu 8006 8 assignment edu8006Edu 8006 8 assignment edu8006
Edu 8006 8 assignment edu8006arnitaetsitty
 
Eli plots visualizing innumerable number of correlations
Eli plots   visualizing innumerable number of correlationsEli plots   visualizing innumerable number of correlations
Eli plots visualizing innumerable number of correlationsLeonardo Auslender
 

Similar to 4-Introduction to Machine Learning Lecture # 4.pdf (20)

Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptx
 
Why ADaM for a statistician?
Why ADaM for a statistician?Why ADaM for a statistician?
Why ADaM for a statistician?
 
A course work on R programming for basics to advance statistics and GIS.pdf
A course work on R programming for basics to advance statistics and GIS.pdfA course work on R programming for basics to advance statistics and GIS.pdf
A course work on R programming for basics to advance statistics and GIS.pdf
 
data analysis techniques and statistical softwares
data analysis techniques and statistical softwaresdata analysis techniques and statistical softwares
data analysis techniques and statistical softwares
 
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
 
50120140504015
5012014050401550120140504015
50120140504015
 
More Stored Procedures and MUMPS for DivConq
More Stored Procedures and  MUMPS for DivConqMore Stored Procedures and  MUMPS for DivConq
More Stored Procedures and MUMPS for DivConq
 
ADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODEL
ADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODELADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODEL
ADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODEL
 
CSL0777-L07.pptx
CSL0777-L07.pptxCSL0777-L07.pptx
CSL0777-L07.pptx
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Bringing OpenClinica Data into SAS
Bringing OpenClinica Data into SASBringing OpenClinica Data into SAS
Bringing OpenClinica Data into SAS
 
Machine Learning with WEKA
Machine Learning with WEKAMachine Learning with WEKA
Machine Learning with WEKA
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Research
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Six sigma statistics
Six sigma statisticsSix sigma statistics
Six sigma statistics
 
Hanaa phd presentation 14-4-2017
Hanaa phd  presentation  14-4-2017Hanaa phd  presentation  14-4-2017
Hanaa phd presentation 14-4-2017
 
Edu 8006 8 assignment edu8006
Edu 8006 8 assignment edu8006Edu 8006 8 assignment edu8006
Edu 8006 8 assignment edu8006
 
Eli plots visualizing innumerable number of correlations
Eli plots   visualizing innumerable number of correlationsEli plots   visualizing innumerable number of correlations
Eli plots visualizing innumerable number of correlations
 

Recently uploaded

Linux Systems Programming: Semaphores, Shared Memory, and Message Queues
Linux Systems Programming: Semaphores, Shared Memory, and Message QueuesLinux Systems Programming: Semaphores, Shared Memory, and Message Queues
Linux Systems Programming: Semaphores, Shared Memory, and Message QueuesRashidFaridChishti
 
Software Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdfSoftware Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdfssuser5c9d4b1
 
Geometric constructions Engineering Drawing.pdf
Geometric constructions Engineering Drawing.pdfGeometric constructions Engineering Drawing.pdf
Geometric constructions Engineering Drawing.pdfJNTUA
 
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdfInstruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdfEr.Sonali Nasikkar
 
21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docxrahulmanepalli02
 
Seizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networksSeizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networksIJECEIAES
 
The Entity-Relationship Model(ER Diagram).pptx
The Entity-Relationship Model(ER Diagram).pptxThe Entity-Relationship Model(ER Diagram).pptx
The Entity-Relationship Model(ER Diagram).pptxMANASINANDKISHORDEOR
 
Filters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility ApplicationsFilters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility ApplicationsMathias Magdowski
 
Worksharing and 3D Modeling with Revit.pptx
Worksharing and 3D Modeling with Revit.pptxWorksharing and 3D Modeling with Revit.pptx
Worksharing and 3D Modeling with Revit.pptxMustafa Ahmed
 
handbook on reinforce concrete and detailing
handbook on reinforce concrete and detailinghandbook on reinforce concrete and detailing
handbook on reinforce concrete and detailingAshishSingh1301
 
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdflitvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdfAlexander Litvinenko
 
Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2T.D. Shashikala
 
CLOUD COMPUTING SERVICES - Cloud Reference Modal
CLOUD COMPUTING SERVICES - Cloud Reference ModalCLOUD COMPUTING SERVICES - Cloud Reference Modal
CLOUD COMPUTING SERVICES - Cloud Reference ModalSwarnaSLcse
 
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...drjose256
 
Lab Manual Arduino UNO Microcontrollar.docx
Lab Manual Arduino UNO Microcontrollar.docxLab Manual Arduino UNO Microcontrollar.docx
Lab Manual Arduino UNO Microcontrollar.docxRashidFaridChishti
 
Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1T.D. Shashikala
 
UNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxUNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxkalpana413121
 
21scheme vtu syllabus of visveraya technological university
21scheme vtu syllabus of visveraya technological university21scheme vtu syllabus of visveraya technological university
21scheme vtu syllabus of visveraya technological universityMohd Saifudeen
 
Online crime reporting system project.pdf
Online crime reporting system project.pdfOnline crime reporting system project.pdf
Online crime reporting system project.pdfKamal Acharya
 
What is Coordinate Measuring Machine? CMM Types, Features, Functions
What is Coordinate Measuring Machine? CMM Types, Features, FunctionsWhat is Coordinate Measuring Machine? CMM Types, Features, Functions
What is Coordinate Measuring Machine? CMM Types, Features, FunctionsVIEW
 

Recently uploaded (20)

Linux Systems Programming: Semaphores, Shared Memory, and Message Queues
Linux Systems Programming: Semaphores, Shared Memory, and Message QueuesLinux Systems Programming: Semaphores, Shared Memory, and Message Queues
Linux Systems Programming: Semaphores, Shared Memory, and Message Queues
 
Software Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdfSoftware Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdf
 
Geometric constructions Engineering Drawing.pdf
Geometric constructions Engineering Drawing.pdfGeometric constructions Engineering Drawing.pdf
Geometric constructions Engineering Drawing.pdf
 
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdfInstruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
 
21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx
 
Seizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networksSeizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networks
 
The Entity-Relationship Model(ER Diagram).pptx
The Entity-Relationship Model(ER Diagram).pptxThe Entity-Relationship Model(ER Diagram).pptx
The Entity-Relationship Model(ER Diagram).pptx
 
Filters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility ApplicationsFilters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility Applications
 
Worksharing and 3D Modeling with Revit.pptx
Worksharing and 3D Modeling with Revit.pptxWorksharing and 3D Modeling with Revit.pptx
Worksharing and 3D Modeling with Revit.pptx
 
handbook on reinforce concrete and detailing
handbook on reinforce concrete and detailinghandbook on reinforce concrete and detailing
handbook on reinforce concrete and detailing
 
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdflitvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
 
Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2
 
CLOUD COMPUTING SERVICES - Cloud Reference Modal
CLOUD COMPUTING SERVICES - Cloud Reference ModalCLOUD COMPUTING SERVICES - Cloud Reference Modal
CLOUD COMPUTING SERVICES - Cloud Reference Modal
 
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
 
Lab Manual Arduino UNO Microcontrollar.docx
Lab Manual Arduino UNO Microcontrollar.docxLab Manual Arduino UNO Microcontrollar.docx
Lab Manual Arduino UNO Microcontrollar.docx
 
Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1
 
UNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxUNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptx
 
21scheme vtu syllabus of visveraya technological university
21scheme vtu syllabus of visveraya technological university21scheme vtu syllabus of visveraya technological university
21scheme vtu syllabus of visveraya technological university
 
Online crime reporting system project.pdf
Online crime reporting system project.pdfOnline crime reporting system project.pdf
Online crime reporting system project.pdf
 
What is Coordinate Measuring Machine? CMM Types, Features, Functions
What is Coordinate Measuring Machine? CMM Types, Features, FunctionsWhat is Coordinate Measuring Machine? CMM Types, Features, Functions
What is Coordinate Measuring Machine? CMM Types, Features, Functions
 

4-Introduction to Machine Learning Lecture # 4.pdf

  • 1. Dr. Ashfaq Ahmad Associate Professor Department of Electronics & Electrical System The University of Lahore Introduction to Machine Learning Lecture # 4
  • 2. Lecture Objectives Data With Descriptive Statistics Peek at Your Data Dimensions of Your Data Data Type For Each Attribute Descriptive Statistics Class Distribution (Classification Only) Correlations Between Attributes Dr. Ashfaq Ahmad, Associate Professor, Department of Electronics & Electrical Systems 2 Skew of Univariate Distributions
  • 3. Overfitting and Underfitting Generalization in Machine Learning Statistical Fit Overfitting in Machine Learning Underfitting in Machine Learning A Good Fit in Machine Learning How To Limit Overfitting Dr. Ashfaq Ahmad, Associate Professor, Department of Electronics & Electrical Systems 3 Lecture Objectives
  • 4. Data Understanding With Descriptive Statistics ➢ In order to get the best results, you must understand your data. ➢ In this lecture you will learn about 7 ways that can use in Python to better understand your machine learning data. ➢ All 7 ways is demonstrated by loading the Diabetes classification dataset. DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 4
  • 5. Peek At Your Data ✓ There is no substitute for looking at the raw data. ✓ Analyzing the raw data can provide information that cannot be learned in any other way. ✓ Additionally, it can sow the seeds that may later grow into ideas on how to better prepare and manage data for machine learning tasks. ✓ You can review the first 20 rows of your data using the head() function on the Pandas DataFrame. DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 5
  • 6. Example of Reviewing The First Few Rows of Data # View first 20 rows from pandas import read_csv filename = "diabetes.data.csv" names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data = read_csv(filename, names=names) peek = data.head(20) print(peek) DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 6
  • 7. Example of Reviewing The First Few Rows of Data DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 7 You can see that the first column list shows the row number, This is useful for referring to a particular observation. preg plas pres skin test mass pedi age class 0 6 148 72 35 0 33.6 0.627 50 1 1 1 85 66 29 0 26.6 0.351 31 0 2 8 183 64 0 0 23.3 0.672 32 1 3 1 89 66 23 94 28.1 0.167 21 0 4 0 137 40 35 168 43.1 2.288 33 1 5 5 116 74 0 0 25.6 0.201 30 0 6 3 78 50 32 88 31.0 0.248 26 1 7 10 115 0 0 0 35.3 0.134 29 0 8 2 197 70 45 543 30.5 0.158 53 1 9 8 125 96 0 0 0.0 0.232 54 1 Output of reviewing the first few rows of data.
  • 8. Dimensions of Data ❖ The quantity of your data, both in terms of rows and columns, must be very well understood. ✓ Too many rows and algorithms may take too long to train. ✓ Too few and perhaps you do not have enough data to train the algorithms. ✓ The curse of dimensionality can cause some algorithms to be distracted or to perform poorly when there are too many features. DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 8
  • 9. An Illustration of Examining The Shape of The Data ❑ You can review the shape and size of your dataset by printing the shape property on the Pandas DataFrame. ❑ # Dimensions of your data ❑ from pandas import read_csv ❑ filename = "diabetes.data.csv" ❑ names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] ❑ data = read_csv(filename, names=names) ❑ shape = data.shape ❑ print(shape) DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 9 The results are listed in rows then columns. You can see that the dataset has 768 rows and 9 columns. The Results of Examining The Data’s Shape (768, 9)
  • 10. Type of Data For Each Attribute ❑ The type of each attribute is important. ❑ Strings may need to be converted to floating point values or integers to represent categorical or ordinal values. ❑ You can get an idea of the types of attributes by peeking at the raw data, as above. ❑ You can also list the data types used by the DataFrame to characterize each attribute using the dtypes property. DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 10
  • 11. Example of Reviewing The Data Types of The Data # Data Types for Each Attribute from pandas import read_csv filename = "diabetes.data.csv" names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data = read_csv(filename, names=names) types = data.dtypes print(types) DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 11 Output of Reviewing The Data Types of The Data ➢ You can see that most of the attributes are integers the only mass and pedi are floating point types preg int64 plas int64 pres int64 skin int64 test int64 mass float64 pedi float64 age int64 class int64 dtype: object
  • 12. Descriptive Statistics ❑ Descriptive statistics can give you great perception into the shape of each attribute. ❑ Often you can create more summaries at the time of review. ❑ The describe() function on the Pandas DataFrame lists 8 statistical properties of each attribute. They are: ➢ Count. ➢ Mean. ➢ Standard Deviation. ➢ Minimum Value. ➢ 25th Percentile. ➢ 50th Percentile (Median). ➢ 75th Percentile. ➢ Maximum Value. DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 12
  • 13. Example of Reviewing A Statistical Summary of The Data # Statistical Summary from pandas import read_csv from pandas import set_option filename = "diabetes.data.csv" names = ['preg’, 'plas’, 'pres’, 'skin’, 'test’, 'mass’, 'pedi’, 'age’, 'class'] data = read_csv(filename, names=names) set_option('display.width', 100) set_option('precision’, 3) description = data.describe() print(description) DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 13
  • 14. Output of Reviewing A Statistical Summary of The Data ✓ You can see that you do get a lot of data. ✓ It should be noted that the precision of the numbers and the preferred width of the output can be change by using the pandas.set option() function. ✓ This is to make it more readable for this example. ✓ When describing your data this way, it is worth taking some time and reviewing observations from the results. ✓ This might include the presence of NA values for missing data or surprising distributions for attributes. DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 14
  • 15. Class Distribution (Classification Only) ❑ On classification problems you need to know how balanced the class values are. ❑ Highly imbalanced problems (a lot more observations for one class than another) are common and may need special handling in the data preparation stage of your project. ❑ You can quickly get an idea of the distribution of the class attribute in Pandas. DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 15
  • 16. Example of Reviewing A Class Breakdown of The Data # Class Distribution from pandas import read_csv filename = "diabetes.data.csv" names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data = read_csv(filename, names=names) class_counts = data.groupby('class').size() print(class_counts) DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 16 Output of Reviewing A Class Breakdown of the Data ❑ You can see that there are nearly double the number of observations with class 0 (no onset of diabetes) than there are with class 1 (onset of diabetes). class 0 500 1 268
  • 17. Correlations Between Attributes ➢ Correlation refers to the relationship between two variables and how they may or may not change together. ➢ The most common method for calculating correlation is Pearson's Correlation Coefficient, that assumes a normal distribution of the attributes involved. ➢ A correlation of -1 or 1 shows a full negative or positive correlation, respectively. ➢ Whereas a value of 0 shows no correlation at all. Some machine learning algorithms like linear and logistic regression can suffer poor performance if there are highly correlated attributes in your dataset. ➢ As such, it is a good idea to review all the pairwise correlations of the attributes in your dataset. ➢ You can use the corr () function on the Pandas DataFrame to calculate a correlation matrix. DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 17
  • 18. Example of Reviewing Correlations of Attributes In The Data # Pairwise Pearson correlations from pandas import read_csv from pandas import set_option filename = "diabetes.data.csv" names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data = read_csv(filename, names=names) set_option('display.width', 100) set_option('precision', 3) correlations = data.corr(method='pearson') print(correlations) DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 18
  • 19. Output of Reviewing Correlations of Attributes In The Data ❑ The matrix lists all attributes across the top and down the side, to give correlation between all pairs of attributes (twice, because the matrix is symmetrical). ❑ You can see the diagonal line through the matrix from the top left to bottom right corners of the matrix shows perfect correlation of each attribute with itself. DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 19
  • 20. Skew of Univariate Distributions ➢ The skewness in statistics is a measure of asymmetry or the deviation of a given random variable's distribution from a symmetric distribution (like normal Distribution). In Normal Distribution, we know that: Median = Mode = Mean. ➢ The formula given in most textbooks is Skew = 3 * (Mean – Median) / Standard Deviation ➢ Skew refers to a distribution that is assumed Gaussian (normal or bell curve) that is shifted or squashed in one direction or another. ➢ Many machine learning algorithms assume a Gaussian distribution. ➢ Knowing that an attribute has a skew may allow you to perform data preparation to correct the skew and later improve the accuracy of your models. ➢ You can calculate the skew of each attribute using the skew() function on the Pandas DataFrame. DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 20
  • 21. Example of Reviewing Skew of Attribute Distributions In The Data # Skew for each attribute from pandas import read_csv filename = "diabetes.data.csv" names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data = read_csv(filename, names=names) skew = data.skew() print(skew) DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 21 Output of Reviewing Skew of Attribute Distributions In The Data The skew result show a positive (right) or negative (left) skew. Values closer to zero show less skew.
  • 22. Overfitting And Underfitting DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 22 The cause of poor performance in machine learning is either overfitting or underfitting the data. In this lecture we will discover the concept of generalization in machine learning and the problems of overfitting and underfitting that go along with it.
  • 23. Generalization in Machine Learning ➢ In machine learning we describe the learning of the target function from training data as inductive learning. ➢ Induction refers to learning general concepts from specific examples which is exactly the problem that supervised machine learning problems aim to solve. ➢ This is different from deduction that is the other way around and seeks to learn specific concepts from general rules. DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 23
  • 24. Generalization in Machine Learning ➢ Generalization refers to how well the concepts learned by a machine learning model apply to specific examples not seen by the model when it was learning. ➢ The goal of a good machine learning model is to generalize well from the training data to any data from the problem domain. ➢ This allows us to make predictions in the future on data the model has never seen. ➢ There is a terminology used in machine learning when we talk about how well a machine learning model learns and generalizes to new data, namely overfitting and underfitting. ➢ Overfitting and underfitting are the two biggest causes for poor performance of machine learning algorithms. DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 24
  • 25. Statistical Fit ❑ In statistics a fit refers to how well you approximate a target function. ❑ This is good terminology to use in machine learning, because supervised machine learning algorithms seek to approximate the unknown underlying mapping function for the output variables given the input variables. ❑ Statistics often describe the goodness of fit which refers to measures used to estimate how well the approximation of the function matches the target function. ❑ Some of these methods are useful in machine learning (e.g. calculating the residual errors), but some of these techniques assume we know the form of the target function we are approximating, which is not the case in machine learning. ❑ If we knew the form of the target function, we would use it directly to make predictions, rather than trying to learn an approximation from samples of noisy training data. DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 25
  • 26. Overfitting in Machine Learning ❑ Overfitting refers to a model that models the training data too well. ❑ Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance on the model on new data. ❑ This means that the noise or random fluctuations in the training data is picked up and learned as concepts by the model. ❑ The problem is that these concepts do not apply to new data and negatively impact the model's ability to generalize. DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 26
  • 27. Overfitting in Machine Learning ❑ Overfitting is more likely with nonparametric and nonlinear models that have more flexibility when learning a target function. ❑ As such, many nonparametric machine learning algorithms also include parameters or techniques to limit and constrain how much detail the model learns. ❑ For example, decision trees are a nonparametric machine learning algorithm that is very flexible and is subject to overfitting training data. ❑ This problem can be addressed by pruning a tree after it has learned in order to remove some of the detail it has picked up. DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 27
  • 28. Underfitting in Machine Learning ❖ Underfitting refers to a model that can neither model the training data not generalize to new data. ❖ An underfit machine learning model is not a suitable model and will be obvious as it will have poor performance on the training data. ❖ Underfitting is often not discussed as it is easy to detect given a good performance metric. ❖ The remedy is to move on and try alternate machine learning algorithms. Nevertheless, it does provide good contrast to the problem of concept of overfitting. DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 28
  • 29. A Good Fit in Machine Learning ❑ Ideally, you want to select a model at the sweet spot between underfitting and overfitting. This is the goal but is very difficult to do in practice. ❑ To understand this goal, we can look at the performance of a machine learning algorithm over time as it is learning a training data. ❑ We can plot the skill on the training data the skill on a test dataset we have held back from the training process. ❑ Over time, as the algorithm learns, the error for the model on the training data goes down and so does the error on the test dataset. DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 29
  • 30. A Good Fit in Machine Learning ❑ If we train for too long, the performance on the training dataset may continue to decrease because the model is overfitting and learning the irrelevant detail and noise in the training dataset. ❑ At the same time the error for the test set starts to rise again as the model's ability to generalize decreases. ❑ The sweet spot is the point just before the error on the test dataset starts to increase where the model has good skill on both the training dataset and the unseen test dataset. DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 30
  • 31. A Good Fit in Machine Learning ❑ You can perform this experiment with your favorite machine learning algorithms. ❑ This is often not useful technique in practice, because by choosing the stopping point for training using the skill on the test dataset it means that the test set is no longer unseen or a standalone objective measure. ❑ Some knowledge (a lot of useful knowledge) about that data has leaked into the training procedure. ❑ There are two additional techniques you can use to help find the sweet spot in practice: resampling methods and a validation dataset. DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 31
  • 32. How To Limit Overfitting ❑ Both overfitting and underfitting can lead to poor model performance. ❑ But by far the most common problem in applied machine learning is overfitting. ❑ Overfitting is such a problem because the evaluation of machine learning algorithms on training data is different from the evaluation, we care the most about, namely how well the algorithm performs on unseen data. ❑ There are two important techniques that you can use when evaluating machine learning algorithms to limit overfitting: ✓ Use a resampling technique to estimate model accuracy ✓ Hold back a validation dataset. DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 32
  • 33. How To Limit Overfitting ❑ The most popular resampling technique is k-fold cross validation. ❑ It allows you to train and test your model k-times on different subsets of training data and build up an estimate of the performance of a machine learning model on unseen data. ❑ A validation dataset is simply a subset of your training data that you hold back from your machine learning algorithms until the very end of your project. ❑ After you have selected and tuned your machine learning algorithms on your training dataset you can evaluate the learned models on the validation dataset to get a final objective idea of how the models might perform on unseen data. DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 33
  • 34. How To Limit Overfitting ❑ Using cross validation is a gold standard in applied machine learning for estimating model accuracy on unseen data. ❑ If you have the data, using a validation dataset is also an excellent practice. DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 34