Artificial Intelligence
Dr Bishwajeet Pandey, SMIEEE
Professor, Department of MCA, GL Bajaj College of
Technology and Management, India
PhD (Gran Sasso Science Institute, L'Aquila, Italy)
M. Tech in CSE (IIIT Gwalior, India)
Visiting Professor at
UCSI UNIVERSITY-Malaysia (QS World Rank 265)
Eurasian National University-Kazakhstan (QS Work Rank 321)
• PhD from Gran Sasso Science Institute, Italy
• PhD Supervisor Prof Paolo Prinetto from Politecnico Di Torino.
• MTech from Indian Institute of Information Technology, Gwalior
• Visited 49 Countries Across The Globe
• Written 300+ Research paper with 218 Researcher from 93 Universities
• Scopus Profile: https://www.scopus.com/authid/detail.uri?authorId=57203239026
• Google Scholar: https://scholar.google.com/citations?user=UZ_8yAMAAAAJ&hl=hi
• IBM Certified Solution Designer
• EC-Council Certified Ethical Hacker
• AWS Certified Cloud Practitioner
• Qualified GATE 4 times
• Email dr.pandey@ieee.org, bishwajeet.pandey@glbctm.ac.in
ABOUT COURSE TEACHER
PROFESSOR OF THE YEAR AWARD-2023
BY LONDON ORGANIZATION OF SKILLS DEVELOPMENT
My CERTIFICATION
Syllabus of AI: Unit 5
• Introduction and design principles
• Statistical pattern recognition
• Parameter Estimation Methods
• Principle Componenet Analysis
• Linear Discrimination Analysis
• Classification Techniques
• Nearest Neighbour rule and Bayes Classifier
• K-Means clustering
• Support Vector Machine
Pattern
Pattern is everything around in this digital world. A pattern can
either be seen physically or it can be observed mathematically by
applying algorithms.
Example: Angelica flowerhead, a sphere made of spheres
Pattern
Example: Animals often show mirror or
bilateral symmetry, like this tiger.
Fibonacci Spiral
Pattern
Example: Romanesque broccoli is a striking
example of the Fibonacci. Each nub is a
Fibonacci spiral of its own.
Pattern
Example: Spiral aloe. Numerous cactus display the Fibonacci
spiral. You can see how each set of leaves spirals outward.
Pattern
Example: This pepper has grown into a Fibonacci Spiral.
Pattern
Example: The Fibonacci spiral is a little more
subtle in this photo, but you can still see the spiral
in the unopened disk florets.
Pattern
Example: The tail of these creatures naturally curls into a Fibonacci
spiral.
Pattern
Example: millipede. The fibonacci is thought to be the design of least
resistance.
Pattern
Example: Fibonacci and armor = very safe. The Pangolin is able to
protect its soft underbelly by forming a Fibonacci spiral.
Pattern
Pattern
Example: Water falls into the shape of a Fibonacci
sequence during numerous events.
Pattern
Example: The population density and comparison with
the Fibonacci sequence
Pattern Recognition
• Pattern recognition is the process of recognizing patterns using a machine
learning algorithm. Pattern recognition is the classification of data based on
knowledge already gained or statistical information extracted from patterns
and/or their representation. One of the important aspects of pattern
recognition is its application potential.
• Examples: Speech recognition, speaker identification, multimedia document
recognition (MDR), automatic medical diagnosis.
Pattern Recognition
• In a typical pattern recognition application, the raw data is processed and
converted into a form that is amenable for a machine to use. Pattern
recognition involves the classification and cluster of patterns.
• In classification, an appropriate class label is assigned to a pattern based on
an abstraction that is generated using a set of training patterns or domain
knowledge. Classification is used in supervised learning.
• Clustering generated a partition of the data which helps decision making, the
specific decision-making activity of interest to us. Clustering is used in
unsupervised learning.
Pattern Recognition
• Features may be represented as continuous, discrete, or discrete binary
variables. A feature is a function of one or more measurements, computed so
that it quantifies some significant characteristics of the object.
• Example: consider our face then eyes, ears, nose, etc are features of the face.
A set of features that are taken together, forms the features vector.
• Example: In the above example of a face, if all the features (eyes, ears, nose,
etc) are taken together then the sequence is a feature vector([eyes, ears, nose]).
The feature vector is the sequence of a feature represented as a d-dimensional
column vector. In the case of speech, MFCC (Mel-frequency Cepstral
Coefficient) is the spectral feature of the speech. The sequence of the first 13
features forms a feature vector.
Pattern Recognition
• Pattern recognition possesses the following features:
• Pattern recognition system should recognize familiar patterns quickly and
accurate
• Recognize and classify unfamiliar objects
• Accurately recognize shapes and objects from different angles
• Identify patterns and objects even when partly hidden
• Recognize patterns quickly with ease, and with automaticity.
Pattern Recognition
• Applications:
• Image processing, segmentation, and analysis
Pattern recognition is used to give human recognition intelligence to
machines that are required in image processing.
• Computer vision
Pattern recognition is used to extract meaningful features from given
image/video samples and is used in computer vision for various applications
like biological and biomedical imaging.
Pattern Recognition
• Applications:
• Speech recognition
The greatest success in speech recognition has been obtained using pattern
recognition paradigms. It is used in various algorithms of speech recognition
which tries to avoid the problems of using a phoneme level of description
and treats larger units such as words as pattern
• Fingerprint identification
Fingerprint recognition technology is a dominant technology in the biometric
market. A number of recognition methods have been used to perform
fingerprint matching out of which pattern recognition approaches are widely
used.
Statistical Pattern Recognition
Statistical Pattern Recognition
Statistical Pattern Recognition
Model for Statistical Pattern Recognition
Various Approaches in Statistical Pattern Recognition
Feature Extraction and Projection Methods
Parametric Estimation
Parameter and Estimate
Parametric Estimation
• Parameter estimation is all about figuring out the unknown values in a
mathematical model based on data we have collected. In parameter
estimation, we use sample data to estimate a larger population's
characteristics (parameters).
• For instance, if a factory produces thousands of electronic components,
instead of testing each item, quality control teams might randomly sample a
certain number of items (e.g., 100 components) and check how many are
defective. If they find that 4 out of 100 components are defective, they can
estimate the proportion of defective items in the entire production batch as
4%.
Parametric Estimation
• Types of Parameter Estimation
• Parameter estimation generally falls into two types:
• Point Estimation
• Interval Estimation
Point Estimation
• Point estimation provides a single best guess of a parameter's
value.
• The result is one number that is considered the best
approximation of the population parameter based on the
sample data.
• For example, if we want to estimate the average height of
students in a school, the sample mean (average), we
calculate from a group of students serves as the point
estimate of the population mean.
Interval Estimation
• Interval estimation provides a range of values within which the
true parameter likely falls. This is more informative than point
estimation because it includes a measure of uncertainty.
• The range is known as a confidence interval, and it is
associated with a confidence level (often 95% or 99%) which
indicates the degree of certainty that the interval contains the
population parameter.
• For instance, if we estimate that the average height of students
in a school is between 150 cm and 160 cm with 95%
confidence, it means that we are 95% sure that the true
population mean lies within this interval.
Methods of Parameter Estimation
There are various methods for estimating parameters, each suited for different
types of data and situations.
• Maximum Likelihood Estimation (MLE)
• Method of Moments
• Bayesian Estimation
• Principle Component Analysis
• Linear Discrimination Analysis
Maximum Likelihood Estimation
• MLE is one of the most popular and widely used methods of parameter
estimation. The idea behind MLE is to find the parameter values that
maximize the likelihood function, which represents the probability of
observing the given sample data.
• The parameter value that maximizes the likelihood function is considered
the best estimate.
• MLE works well for large sample sizes and has desirable statistical properties,
such as being consistent (the estimate gets closer to the true value as the
sample size increases).
Method of Moments
• The method of moments is a simpler, less computationally intense approach
than MLE.
• It is based on the idea that sample moments (such as the sample mean,
variance, etc.) can be used to estimate the population moments (such as the
population mean, variance, etc.).
• While not as precise as MLE, the method of moments is often easier to apply,
especially for small datasets.
Bayesian Estimation
• Bayesian estimation is based on Bayes' theorem, which updates prior beliefs
about a parameter using observed data.
• In this method, we start with a "prior" distribution that reflects our initial
beliefs about the parameter. Then, as we gather sample data, we use Bayes'
theorem to calculate a "posterior" distribution, which combines the prior
information with the new data.
• Bayesian estimation is highly flexible and allows for incorporating external
information or expert knowledge into the estimation process. However, it
requires more complex computations, especially for large datasets.
Principle Component Analysis
• Principal Component Analysis is an unsupervised learning algorithm that is
used for the dimensionality reduction in machine learning.
• It is a statistical process that converts the observations of correlated features
into a set of linearly uncorrelated features with the help of orthogonal
transformation. These new transformed features are called the Principal
Components.
• It is a technique to draw strong patterns from the given dataset by reducing
the variances.
• PCA generally tries to find the lower-dimensional surface to project the high-
dimensional data
Principle Component Analysis
• Some common terms used in PCA algorithm:
• Dimensionality: It is the number of features or variables present in the given
dataset. More easily, it is the number of columns present in the dataset.
• Correlation: It signifies that how strongly two variables are related to each other.
Such as if one changes, the other variable also gets changed. The correlation value
ranges from -1 to +1. Here, -1 occurs if variables are inversely proportional to each
other, and +1 indicates that variables are directly proportional to each other.
• Orthogonal: It defines that variables are not correlated to each other, and hence
the correlation between the pair of variables is zero.
• Eigenvectors: If there is a square matrix M, and a non-zero vector v is given. Then
v will be eigenvector if Av is the scalar multiple of v.
• Covariance Matrix: A matrix containing the covariance between the pair of
variables is called the Covariance Matrix.
Principle Component Analysis
• The number of these PCs are either equal to or less than the original features
present in the dataset. Some properties of these principal components are
given below:
• The principal component must be the linear combination of the original
features.
• These components are orthogonal, i.e., the correlation between a pair of
variables is zero.
• The importance of each component decreases when going to 1 to n, it means
the 1 PC has the most importance, and n PC will have the least importance.
Steps of Principle Component Analysis
• Getting the dataset
Firstly, we need to take the input dataset and divide it into two subparts X and Y, where X is the
training set, and Y is the validation set.
• Representing data into a structure
Now we will represent our dataset into a structure. Such as we will represent the two-dimensional
matrix of independent variable X. Here each row corresponds to the data items, and the column
corresponds to the Features. The number of columns is the dimensions of the dataset.
• Standardizing the data
In this step, we will standardize our dataset. Such as in a particular column, the features with high
variance are more important compared to the features with lower variance.
If the importance of features is independent of the variance of the feature, then we will divide each
data item in a column with the standard deviation of the column. Here we will name the matrix as
Z.
• Calculating the Covariance of Z
To calculate the covariance of Z, we will take the matrix Z, and will transpose it. After transpose, we
will multiply it by Z. The output matrix will be the Covariance matrix of Z.
Steps of Principle Component Analysis
• Calculating the Eigen Values and Eigen Vectors
Now we need to calculate the eigenvalues and eigenvectors for the resultant covariance
matrix Z. Eigenvectors or the covariance matrix are the directions of the axes with high
information. And the coefficients of these eigenvectors are defined as the eigenvalues.
• Sorting the Eigen Vectors
In this step, we will take all the eigenvalues and will sort them in decreasing order, which
means from largest to smallest. And simultaneously sort the eigenvectors accordingly in
matrix P of eigenvalues. The resultant matrix will be named as P*.
• Calculating the new features Or Principal Components
Here, we will calculate the new features. To do this, we will multiply the P* matrix to the Z. In
the resultant matrix Z*, each observation is the linear combination of original features. Each
column of the Z* matrix is independent of each other.
• Remove less or unimportant features from the new dataset.
The new feature set has occurred, so we will decide here what to keep and what to remove. It
means, we will only keep the relevant or important features in the new dataset, and
unimportant features will be removed out.
Linear Discrimination Analysis
• Linear Discriminant Analysis (LDA), also known as Normal Discriminant Analysis or
Discriminant Function Analysis, is a dimensionality reduction technique primarily
utilized in supervised classification problems.
• It facilitates the modeling of distinctions between groups, effectively separating
two or more classes.
• LDA operates by projecting features from a higher-dimensional space into a
lower-dimensional one.
• In machine learning, LDA serves as a supervised learning algorithm specifically
designed for classification tasks, aiming to identify a linear combination of
features that optimally segregates classes within a dataset.
Linear Discrimination Analysis
• For example, we have two classes and we need to separate them
efficiently.
• Classes can have multiple features. Using only a single feature to classify
them may result in some overlapping as shown in the below figure. So, we
will keep on increasing the number of features for proper classification.
Working of LDA
• LDA works by projecting the data onto a lower-dimensional
space that maximizes the separation between the classes.
• It does this by finding a set of linear discriminants that
maximize the ratio of between-class variance to within-class
variance.
• In other words, it finds the directions in the feature space that
best separates the different classes of data.
Mathematics of LDA
Mathematics of LDA
Mathematics of LDA
LDA versus PCA
What is Classification?
• In short, classification is a form of “pattern recognition,” with
classification algorithms applied to the training data to find the
same pattern (similar words or sentiments, number sequences,
etc.) in future sets of data.
• Using classification algorithms, text analysis software can
perform tasks like aspect-based sentiment analysis to categorize
unstructured text by topic and polarity of opinion (positive,
negative, neutral, and beyond).
What is Classification?
• Classification is the process of recognizing, understanding, and
grouping ideas and objects into preset categories or “sub-populations.”
Using pre-categorized training datasets, machine learning programs
use a variety of algorithms to classify future datasets into categories.
• Classification algorithms in machine learning use input training data to
predict the likelihood that subsequent data will fall into one of the
predetermined categories. One of the most common uses of
classification is filtering emails into “spam” or “non-spam.”
Top 3 Classification Algorithms in Machine Learning
● Naive Bayes
● K-Nearest Neighbors
● Support Vector Machines
Naive Bayes
Naive Bayes
K-nearest Neighbors
• K-nearest neighbors (k-NN) is a pattern recognition algorithm that
uses training datasets to find the k closest relatives in future
examples.
• When k-NN is used in classification, you calculate to place data within
the category of its nearest neighbor.
• If k = 1, then it would be placed in the class nearest 1. K is classified
by a plurality poll of its neighbors.
Support Vector Machines
• A support vector machine (SVM) uses
algorithms to train and classify data
within degrees of polarity, taking it to a
degree beyond X/Y prediction.
• For a simple visual explanation, we’ll
use two tags: red and blue, with two
data features: X and Y, then train our
classifier to output an X/Y coordinate
as either red or blue.
Support Vector Machines
• The SVM then assigns a
hyperplane that best separates the
tags. In two dimensions this is
simply a line. Anything on one side
of the line is red and anything on
the other side is blue. In sentiment
analysis, for example, this would
be positive and negative.
• In order to maximize machine
learning, the best hyperplane is
the one with the largest distance
between each tag:
4 Applications of Classification Algorithms
● Sentiment Analysis
● Email Spam Classification
● Document Classification
● Image Classification
Sentiment Analysis
• Sentiment analysis is a machine learning text analysis technique that assigns sentiment
(opinion, feeling, or emotion) to words within a text, or an entire text, on a polarity scale of
Positive, Negative, or Neutral.
• It can automatically read through thousands of pages in minutes or constantly monitor
social media for posts about you. The tweet below, for example, about the messaging app,
Slack, would be analyzed to pull all of the individual statements as Positive. This allows
companies to follow product releases and marketing campaigns in real-time, to see how
customers are reacting.
Email Spam Classification
• One of the most common uses of classification, working non-stop and with
little need for human interaction, email spam classification saves us from
tedious deletion tasks and sometimes even costly phishing scams.
• Email applications use the above algorithms to calculate the likelihood that an
email is either not intended for the recipient or unwanted spam. Using text
analysis classification techniques, spam emails are weeded out from the
regular inbox: perhaps a recipient’s name is spelled incorrectly, or certain
scamming keywords are used.
• Spam classifiers do still need to be trained to a degree, as we’ve all
experienced when signing up for an email list of some sort that ends up in the
spam folder.
Document Classification
• Document classification is the ordering of documents into categories
according to their content. This was previously done manually, as in the library
sciences or hand-ordered legal files. Machine learning classification
algorithms, however, allow this to be performed automatically.
• Document classification differs from text classification, in that, entire
documents, rather than just words or phrases, are classified. This is put into
practice when using search engines online, cross-referencing topics in legal
documents, and searching healthcare records by drug and diagnosis.
Image Classification
• Image classification assigns previously trained categories to a given image.
These could be the subject of the image, a numerical value, a theme, etc.
• Image classification can even use multi-label image classifiers, that work
similarly to multi-label text classifiers, to tag an image of a stream, for
example, into different labels, like “stream,” “water,” “outdoors,” etc.
• Using supervised learning algorithms, you can tag images to train your model
for appropriate categories. As with all machine learning models, the more you
train it, the better it will work.
K-nearest Neighbors
• K-nearest neighbors (k-NN) is a pattern recognition algorithm that
uses training datasets to find the k closest relatives in future
examples.
• When k-NN is used in classification, you calculate to place data within
the category of its nearest neighbor.
• If k = 1, then it would be placed in the class nearest 1. K is classified
by a plurality poll of its neighbors.
Explain the K Nearest Neighbor Algorithm
• K nearest neighbor algorithm is a classification algorithm
that works in a way that a new data point is assigned to a
neighboring group to which it is most similar.
• In K nearest neighbors, K can be an integer greater than 1.
• So, for every new data point, we want to classify, we
compute to which neighboring group it is closest.
Explain the K Nearest Neighbor Algorithm
Let us classify an object using the following example.
Consider there are three clusters:
● Football
● Basketball
● Tennis ball
Explain the K Nearest Neighbor Algorithm
● Let the new data point to be classified is a
black ball. We use KNN to classify it.
● Assume K = 5 (initially). Next, we find the K
(five) nearest data points.
● Observe that all five selected points do not
belong to the same cluster.
● There are three tennis balls and one each
of basketball and football.
● When multiple classes are involved, we
prefer the majority.
● Here the majority is with the tennis ball, so
the new data point is assigned to this
Compare K-means and KNN Algorithms
KNeighborsClassifier
KNeighborsClassifier
KNeighborsClassifier
Naïve Bayes Classifier Algorithm
• In machine learning, Naïve Bayes classification is a straightforward and powerful
algorithm for the classification task.
• Naïve Bayes classification is based on applying Bayes’ theorem with strong
independence assumption between the features.
• Naïve Bayes classification produces good results when we use it for textual data
analysis such as Natural Language Processing.
Naïve Bayes Classifier Algorithm
• Naïve Bayes models are also known as simple Bayes or independent Bayes.
• All these names refer to the application of Bayes’ theorem in the classifier’s decision
rule.
• Naïve Bayes classifier applies the Bayes’ theorem in practice.
• This classifier brings the power of Bayes’ theorem to machine learning.
Naïve Bayes Classifier Algorithm
Naïve Bayes is one of the most straightforward and fast classification algorithm. It is
very well suited for large volume of data. It is successfully used in various applications
such as
1. Spam filtering
2. Text classification
3. Sentiment analysis
4. Recommender systems
It uses Bayes theorem of probability for prediction of unknown class.
Naïve Bayes Classifier Algorithm
Naïve Bayes Classifier Algorithm
ABOUT DATASET
An individual’s annual income results from various factors. Intuitively, it is influenced by the individual’s education
level, age, gender, occupation, and etc.
Fields
The dataset contains 16 columns
Target: Income
-- The income is divide into two classes: <=50K and >50K
Number of attributes: 14
-- These are the demographics and other features to describe a person
https://www.kaggle.com/datasets/wenruliu/adult-income-dataset
Naïve Bayes Classifier Algorithm
Attribute Information:
1. age: continuous.
2. workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
3. fnlwgt: continuous.
4. education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th,
Preschool.
5. education-num: continuous.
6. marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
7. occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-
fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
8. relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
9. race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
10. sex: Female, Male.
11. capital-gain: continuous.
12. capital-loss: continuous.
13. hours-per-week: continuous.
14. native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba,
Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia,
Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
class: >50K, <=50K
Naïve Bayes Classifier Algorithm
Naïve Bayes Classifier Algorithm
Naïve Bayes Classifier Algorithm
Naïve Bayes Classifier Algorithm
Naïve Bayes Classifier Algorithm
Naïve Bayes Classifier Algorithm
Naïve Bayes Classifier Algorithm
Naïve Bayes Classifier Algorithm
Naïve Bayes Classifier Algorithm
Naïve Bayes Classifier Algorithm
Naïve Bayes Classifier Algorithm
Naïve Bayes Classifier Algorithm
Naïve Bayes Classifier Algorithm
Naïve Bayes Classifier Algorithm
Naïve Bayes Classifier Algorithm
Naïve Bayes Classifier Algorithm
Naïve Bayes Classifier Algorithm
Naïve Bayes Classifier Algorithm
Naïve Bayes Classifier Algorithm
Naïve Bayes Classifier Algorithm
What Is ‘naive’ in the Naive Bayes Classifier?
• The classifier is called ‘naive’ because it makes assumptions that
may or may not turn out to be correct.
• The algorithm assumes that the presence of one feature of a class
is not related to the presence of any other feature (absolute
independence of features), given the class variable.
• For instance, a fruit may be considered to be a cherry if it is red in
color and round in shape, regardless of other features. This
assumption may or may not be right (as an apple also matches the
description).
Support Vector Machines
• A support vector machine (SVM) uses
algorithms to train and classify data
within degrees of polarity, taking it to a
degree beyond X/Y prediction.
• For a simple visual explanation, we’ll
use two tags: red and blue, with two
data features: X and Y, then train our
classifier to output an X/Y coordinate
as either red or blue.
Support Vector Machines
• The SVM then assigns a
hyperplane that best separates the
tags. In two dimensions this is
simply a line. Anything on one side
of the line is red and anything on
the other side is blue. In sentiment
analysis, for example, this would
be positive and negative.
• In order to maximize machine
learning, the best hyperplane is
the one with the largest distance
between each tag:
What are Support Vectors in SVM?
Support Vector Machines
Support Vector Machines
Support Vector Machines
# Import train_test_split function
from sklearn.model_selection import train_test_split
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target,
test_size=0.3,random_state=109) # 70% training and 30% test
Splitting Data
To understand model performance, dividing the dataset into a training set and a test set is a good
strategy.
Split the dataset by using the function train_test_split(). you need to pass 3 parameters features, target,
and test_set size. Additionally, you can use random_state to select records randomly.
Support Vector Machines
Support Vector Machines
KNN & SVM
KNN & SVM
KNN & SVM
KNN & SVM

AI Unit 5 Pattern Recognition AKTU .pptx

  • 1.
    Artificial Intelligence Dr BishwajeetPandey, SMIEEE Professor, Department of MCA, GL Bajaj College of Technology and Management, India PhD (Gran Sasso Science Institute, L'Aquila, Italy) M. Tech in CSE (IIIT Gwalior, India) Visiting Professor at UCSI UNIVERSITY-Malaysia (QS World Rank 265) Eurasian National University-Kazakhstan (QS Work Rank 321)
  • 2.
    • PhD fromGran Sasso Science Institute, Italy • PhD Supervisor Prof Paolo Prinetto from Politecnico Di Torino. • MTech from Indian Institute of Information Technology, Gwalior • Visited 49 Countries Across The Globe • Written 300+ Research paper with 218 Researcher from 93 Universities • Scopus Profile: https://www.scopus.com/authid/detail.uri?authorId=57203239026 • Google Scholar: https://scholar.google.com/citations?user=UZ_8yAMAAAAJ&hl=hi • IBM Certified Solution Designer • EC-Council Certified Ethical Hacker • AWS Certified Cloud Practitioner • Qualified GATE 4 times • Email dr.pandey@ieee.org, bishwajeet.pandey@glbctm.ac.in ABOUT COURSE TEACHER
  • 3.
    PROFESSOR OF THEYEAR AWARD-2023 BY LONDON ORGANIZATION OF SKILLS DEVELOPMENT
  • 4.
  • 5.
    Syllabus of AI:Unit 5 • Introduction and design principles • Statistical pattern recognition • Parameter Estimation Methods • Principle Componenet Analysis • Linear Discrimination Analysis • Classification Techniques • Nearest Neighbour rule and Bayes Classifier • K-Means clustering • Support Vector Machine
  • 6.
    Pattern Pattern is everythingaround in this digital world. A pattern can either be seen physically or it can be observed mathematically by applying algorithms. Example: Angelica flowerhead, a sphere made of spheres
  • 7.
    Pattern Example: Animals oftenshow mirror or bilateral symmetry, like this tiger.
  • 8.
  • 9.
    Pattern Example: Romanesque broccoliis a striking example of the Fibonacci. Each nub is a Fibonacci spiral of its own.
  • 10.
    Pattern Example: Spiral aloe.Numerous cactus display the Fibonacci spiral. You can see how each set of leaves spirals outward.
  • 11.
    Pattern Example: This pepperhas grown into a Fibonacci Spiral.
  • 12.
    Pattern Example: The Fibonaccispiral is a little more subtle in this photo, but you can still see the spiral in the unopened disk florets.
  • 13.
    Pattern Example: The tailof these creatures naturally curls into a Fibonacci spiral.
  • 14.
    Pattern Example: millipede. Thefibonacci is thought to be the design of least resistance.
  • 15.
    Pattern Example: Fibonacci andarmor = very safe. The Pangolin is able to protect its soft underbelly by forming a Fibonacci spiral.
  • 16.
  • 17.
    Pattern Example: Water fallsinto the shape of a Fibonacci sequence during numerous events.
  • 18.
    Pattern Example: The populationdensity and comparison with the Fibonacci sequence
  • 19.
    Pattern Recognition • Patternrecognition is the process of recognizing patterns using a machine learning algorithm. Pattern recognition is the classification of data based on knowledge already gained or statistical information extracted from patterns and/or their representation. One of the important aspects of pattern recognition is its application potential. • Examples: Speech recognition, speaker identification, multimedia document recognition (MDR), automatic medical diagnosis.
  • 20.
    Pattern Recognition • Ina typical pattern recognition application, the raw data is processed and converted into a form that is amenable for a machine to use. Pattern recognition involves the classification and cluster of patterns. • In classification, an appropriate class label is assigned to a pattern based on an abstraction that is generated using a set of training patterns or domain knowledge. Classification is used in supervised learning. • Clustering generated a partition of the data which helps decision making, the specific decision-making activity of interest to us. Clustering is used in unsupervised learning.
  • 21.
    Pattern Recognition • Featuresmay be represented as continuous, discrete, or discrete binary variables. A feature is a function of one or more measurements, computed so that it quantifies some significant characteristics of the object. • Example: consider our face then eyes, ears, nose, etc are features of the face. A set of features that are taken together, forms the features vector. • Example: In the above example of a face, if all the features (eyes, ears, nose, etc) are taken together then the sequence is a feature vector([eyes, ears, nose]). The feature vector is the sequence of a feature represented as a d-dimensional column vector. In the case of speech, MFCC (Mel-frequency Cepstral Coefficient) is the spectral feature of the speech. The sequence of the first 13 features forms a feature vector.
  • 22.
    Pattern Recognition • Patternrecognition possesses the following features: • Pattern recognition system should recognize familiar patterns quickly and accurate • Recognize and classify unfamiliar objects • Accurately recognize shapes and objects from different angles • Identify patterns and objects even when partly hidden • Recognize patterns quickly with ease, and with automaticity.
  • 23.
    Pattern Recognition • Applications: •Image processing, segmentation, and analysis Pattern recognition is used to give human recognition intelligence to machines that are required in image processing. • Computer vision Pattern recognition is used to extract meaningful features from given image/video samples and is used in computer vision for various applications like biological and biomedical imaging.
  • 24.
    Pattern Recognition • Applications: •Speech recognition The greatest success in speech recognition has been obtained using pattern recognition paradigms. It is used in various algorithms of speech recognition which tries to avoid the problems of using a phoneme level of description and treats larger units such as words as pattern • Fingerprint identification Fingerprint recognition technology is a dominant technology in the biometric market. A number of recognition methods have been used to perform fingerprint matching out of which pattern recognition approaches are widely used.
  • 25.
  • 26.
  • 27.
  • 28.
    Model for StatisticalPattern Recognition
  • 29.
    Various Approaches inStatistical Pattern Recognition
  • 30.
    Feature Extraction andProjection Methods
  • 31.
  • 32.
  • 33.
    Parametric Estimation • Parameterestimation is all about figuring out the unknown values in a mathematical model based on data we have collected. In parameter estimation, we use sample data to estimate a larger population's characteristics (parameters). • For instance, if a factory produces thousands of electronic components, instead of testing each item, quality control teams might randomly sample a certain number of items (e.g., 100 components) and check how many are defective. If they find that 4 out of 100 components are defective, they can estimate the proportion of defective items in the entire production batch as 4%.
  • 34.
    Parametric Estimation • Typesof Parameter Estimation • Parameter estimation generally falls into two types: • Point Estimation • Interval Estimation
  • 35.
    Point Estimation • Pointestimation provides a single best guess of a parameter's value. • The result is one number that is considered the best approximation of the population parameter based on the sample data. • For example, if we want to estimate the average height of students in a school, the sample mean (average), we calculate from a group of students serves as the point estimate of the population mean.
  • 36.
    Interval Estimation • Intervalestimation provides a range of values within which the true parameter likely falls. This is more informative than point estimation because it includes a measure of uncertainty. • The range is known as a confidence interval, and it is associated with a confidence level (often 95% or 99%) which indicates the degree of certainty that the interval contains the population parameter. • For instance, if we estimate that the average height of students in a school is between 150 cm and 160 cm with 95% confidence, it means that we are 95% sure that the true population mean lies within this interval.
  • 37.
    Methods of ParameterEstimation There are various methods for estimating parameters, each suited for different types of data and situations. • Maximum Likelihood Estimation (MLE) • Method of Moments • Bayesian Estimation • Principle Component Analysis • Linear Discrimination Analysis
  • 38.
    Maximum Likelihood Estimation •MLE is one of the most popular and widely used methods of parameter estimation. The idea behind MLE is to find the parameter values that maximize the likelihood function, which represents the probability of observing the given sample data. • The parameter value that maximizes the likelihood function is considered the best estimate. • MLE works well for large sample sizes and has desirable statistical properties, such as being consistent (the estimate gets closer to the true value as the sample size increases).
  • 39.
    Method of Moments •The method of moments is a simpler, less computationally intense approach than MLE. • It is based on the idea that sample moments (such as the sample mean, variance, etc.) can be used to estimate the population moments (such as the population mean, variance, etc.). • While not as precise as MLE, the method of moments is often easier to apply, especially for small datasets.
  • 40.
    Bayesian Estimation • Bayesianestimation is based on Bayes' theorem, which updates prior beliefs about a parameter using observed data. • In this method, we start with a "prior" distribution that reflects our initial beliefs about the parameter. Then, as we gather sample data, we use Bayes' theorem to calculate a "posterior" distribution, which combines the prior information with the new data. • Bayesian estimation is highly flexible and allows for incorporating external information or expert knowledge into the estimation process. However, it requires more complex computations, especially for large datasets.
  • 41.
    Principle Component Analysis •Principal Component Analysis is an unsupervised learning algorithm that is used for the dimensionality reduction in machine learning. • It is a statistical process that converts the observations of correlated features into a set of linearly uncorrelated features with the help of orthogonal transformation. These new transformed features are called the Principal Components. • It is a technique to draw strong patterns from the given dataset by reducing the variances. • PCA generally tries to find the lower-dimensional surface to project the high- dimensional data
  • 42.
    Principle Component Analysis •Some common terms used in PCA algorithm: • Dimensionality: It is the number of features or variables present in the given dataset. More easily, it is the number of columns present in the dataset. • Correlation: It signifies that how strongly two variables are related to each other. Such as if one changes, the other variable also gets changed. The correlation value ranges from -1 to +1. Here, -1 occurs if variables are inversely proportional to each other, and +1 indicates that variables are directly proportional to each other. • Orthogonal: It defines that variables are not correlated to each other, and hence the correlation between the pair of variables is zero. • Eigenvectors: If there is a square matrix M, and a non-zero vector v is given. Then v will be eigenvector if Av is the scalar multiple of v. • Covariance Matrix: A matrix containing the covariance between the pair of variables is called the Covariance Matrix.
  • 43.
    Principle Component Analysis •The number of these PCs are either equal to or less than the original features present in the dataset. Some properties of these principal components are given below: • The principal component must be the linear combination of the original features. • These components are orthogonal, i.e., the correlation between a pair of variables is zero. • The importance of each component decreases when going to 1 to n, it means the 1 PC has the most importance, and n PC will have the least importance.
  • 44.
    Steps of PrincipleComponent Analysis • Getting the dataset Firstly, we need to take the input dataset and divide it into two subparts X and Y, where X is the training set, and Y is the validation set. • Representing data into a structure Now we will represent our dataset into a structure. Such as we will represent the two-dimensional matrix of independent variable X. Here each row corresponds to the data items, and the column corresponds to the Features. The number of columns is the dimensions of the dataset. • Standardizing the data In this step, we will standardize our dataset. Such as in a particular column, the features with high variance are more important compared to the features with lower variance. If the importance of features is independent of the variance of the feature, then we will divide each data item in a column with the standard deviation of the column. Here we will name the matrix as Z. • Calculating the Covariance of Z To calculate the covariance of Z, we will take the matrix Z, and will transpose it. After transpose, we will multiply it by Z. The output matrix will be the Covariance matrix of Z.
  • 45.
    Steps of PrincipleComponent Analysis • Calculating the Eigen Values and Eigen Vectors Now we need to calculate the eigenvalues and eigenvectors for the resultant covariance matrix Z. Eigenvectors or the covariance matrix are the directions of the axes with high information. And the coefficients of these eigenvectors are defined as the eigenvalues. • Sorting the Eigen Vectors In this step, we will take all the eigenvalues and will sort them in decreasing order, which means from largest to smallest. And simultaneously sort the eigenvectors accordingly in matrix P of eigenvalues. The resultant matrix will be named as P*. • Calculating the new features Or Principal Components Here, we will calculate the new features. To do this, we will multiply the P* matrix to the Z. In the resultant matrix Z*, each observation is the linear combination of original features. Each column of the Z* matrix is independent of each other. • Remove less or unimportant features from the new dataset. The new feature set has occurred, so we will decide here what to keep and what to remove. It means, we will only keep the relevant or important features in the new dataset, and unimportant features will be removed out.
  • 46.
    Linear Discrimination Analysis •Linear Discriminant Analysis (LDA), also known as Normal Discriminant Analysis or Discriminant Function Analysis, is a dimensionality reduction technique primarily utilized in supervised classification problems. • It facilitates the modeling of distinctions between groups, effectively separating two or more classes. • LDA operates by projecting features from a higher-dimensional space into a lower-dimensional one. • In machine learning, LDA serves as a supervised learning algorithm specifically designed for classification tasks, aiming to identify a linear combination of features that optimally segregates classes within a dataset.
  • 47.
    Linear Discrimination Analysis •For example, we have two classes and we need to separate them efficiently. • Classes can have multiple features. Using only a single feature to classify them may result in some overlapping as shown in the below figure. So, we will keep on increasing the number of features for proper classification.
  • 48.
    Working of LDA •LDA works by projecting the data onto a lower-dimensional space that maximizes the separation between the classes. • It does this by finding a set of linear discriminants that maximize the ratio of between-class variance to within-class variance. • In other words, it finds the directions in the feature space that best separates the different classes of data.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
    What is Classification? •In short, classification is a form of “pattern recognition,” with classification algorithms applied to the training data to find the same pattern (similar words or sentiments, number sequences, etc.) in future sets of data. • Using classification algorithms, text analysis software can perform tasks like aspect-based sentiment analysis to categorize unstructured text by topic and polarity of opinion (positive, negative, neutral, and beyond).
  • 54.
    What is Classification? •Classification is the process of recognizing, understanding, and grouping ideas and objects into preset categories or “sub-populations.” Using pre-categorized training datasets, machine learning programs use a variety of algorithms to classify future datasets into categories. • Classification algorithms in machine learning use input training data to predict the likelihood that subsequent data will fall into one of the predetermined categories. One of the most common uses of classification is filtering emails into “spam” or “non-spam.”
  • 55.
    Top 3 ClassificationAlgorithms in Machine Learning ● Naive Bayes ● K-Nearest Neighbors ● Support Vector Machines
  • 56.
  • 57.
  • 58.
    K-nearest Neighbors • K-nearestneighbors (k-NN) is a pattern recognition algorithm that uses training datasets to find the k closest relatives in future examples. • When k-NN is used in classification, you calculate to place data within the category of its nearest neighbor. • If k = 1, then it would be placed in the class nearest 1. K is classified by a plurality poll of its neighbors.
  • 59.
    Support Vector Machines •A support vector machine (SVM) uses algorithms to train and classify data within degrees of polarity, taking it to a degree beyond X/Y prediction. • For a simple visual explanation, we’ll use two tags: red and blue, with two data features: X and Y, then train our classifier to output an X/Y coordinate as either red or blue.
  • 60.
    Support Vector Machines •The SVM then assigns a hyperplane that best separates the tags. In two dimensions this is simply a line. Anything on one side of the line is red and anything on the other side is blue. In sentiment analysis, for example, this would be positive and negative. • In order to maximize machine learning, the best hyperplane is the one with the largest distance between each tag:
  • 61.
    4 Applications ofClassification Algorithms ● Sentiment Analysis ● Email Spam Classification ● Document Classification ● Image Classification
  • 62.
    Sentiment Analysis • Sentimentanalysis is a machine learning text analysis technique that assigns sentiment (opinion, feeling, or emotion) to words within a text, or an entire text, on a polarity scale of Positive, Negative, or Neutral. • It can automatically read through thousands of pages in minutes or constantly monitor social media for posts about you. The tweet below, for example, about the messaging app, Slack, would be analyzed to pull all of the individual statements as Positive. This allows companies to follow product releases and marketing campaigns in real-time, to see how customers are reacting.
  • 63.
    Email Spam Classification •One of the most common uses of classification, working non-stop and with little need for human interaction, email spam classification saves us from tedious deletion tasks and sometimes even costly phishing scams. • Email applications use the above algorithms to calculate the likelihood that an email is either not intended for the recipient or unwanted spam. Using text analysis classification techniques, spam emails are weeded out from the regular inbox: perhaps a recipient’s name is spelled incorrectly, or certain scamming keywords are used. • Spam classifiers do still need to be trained to a degree, as we’ve all experienced when signing up for an email list of some sort that ends up in the spam folder.
  • 64.
    Document Classification • Documentclassification is the ordering of documents into categories according to their content. This was previously done manually, as in the library sciences or hand-ordered legal files. Machine learning classification algorithms, however, allow this to be performed automatically. • Document classification differs from text classification, in that, entire documents, rather than just words or phrases, are classified. This is put into practice when using search engines online, cross-referencing topics in legal documents, and searching healthcare records by drug and diagnosis.
  • 65.
    Image Classification • Imageclassification assigns previously trained categories to a given image. These could be the subject of the image, a numerical value, a theme, etc. • Image classification can even use multi-label image classifiers, that work similarly to multi-label text classifiers, to tag an image of a stream, for example, into different labels, like “stream,” “water,” “outdoors,” etc. • Using supervised learning algorithms, you can tag images to train your model for appropriate categories. As with all machine learning models, the more you train it, the better it will work.
  • 66.
    K-nearest Neighbors • K-nearestneighbors (k-NN) is a pattern recognition algorithm that uses training datasets to find the k closest relatives in future examples. • When k-NN is used in classification, you calculate to place data within the category of its nearest neighbor. • If k = 1, then it would be placed in the class nearest 1. K is classified by a plurality poll of its neighbors.
  • 67.
    Explain the KNearest Neighbor Algorithm • K nearest neighbor algorithm is a classification algorithm that works in a way that a new data point is assigned to a neighboring group to which it is most similar. • In K nearest neighbors, K can be an integer greater than 1. • So, for every new data point, we want to classify, we compute to which neighboring group it is closest.
  • 68.
    Explain the KNearest Neighbor Algorithm Let us classify an object using the following example. Consider there are three clusters: ● Football ● Basketball ● Tennis ball
  • 69.
    Explain the KNearest Neighbor Algorithm ● Let the new data point to be classified is a black ball. We use KNN to classify it. ● Assume K = 5 (initially). Next, we find the K (five) nearest data points. ● Observe that all five selected points do not belong to the same cluster. ● There are three tennis balls and one each of basketball and football. ● When multiple classes are involved, we prefer the majority. ● Here the majority is with the tennis ball, so the new data point is assigned to this
  • 70.
    Compare K-means andKNN Algorithms
  • 71.
  • 72.
  • 73.
  • 74.
    Naïve Bayes ClassifierAlgorithm • In machine learning, Naïve Bayes classification is a straightforward and powerful algorithm for the classification task. • Naïve Bayes classification is based on applying Bayes’ theorem with strong independence assumption between the features. • Naïve Bayes classification produces good results when we use it for textual data analysis such as Natural Language Processing.
  • 75.
    Naïve Bayes ClassifierAlgorithm • Naïve Bayes models are also known as simple Bayes or independent Bayes. • All these names refer to the application of Bayes’ theorem in the classifier’s decision rule. • Naïve Bayes classifier applies the Bayes’ theorem in practice. • This classifier brings the power of Bayes’ theorem to machine learning.
  • 76.
    Naïve Bayes ClassifierAlgorithm Naïve Bayes is one of the most straightforward and fast classification algorithm. It is very well suited for large volume of data. It is successfully used in various applications such as 1. Spam filtering 2. Text classification 3. Sentiment analysis 4. Recommender systems It uses Bayes theorem of probability for prediction of unknown class.
  • 77.
  • 78.
    Naïve Bayes ClassifierAlgorithm ABOUT DATASET An individual’s annual income results from various factors. Intuitively, it is influenced by the individual’s education level, age, gender, occupation, and etc. Fields The dataset contains 16 columns Target: Income -- The income is divide into two classes: <=50K and >50K Number of attributes: 14 -- These are the demographics and other features to describe a person https://www.kaggle.com/datasets/wenruliu/adult-income-dataset
  • 79.
    Naïve Bayes ClassifierAlgorithm Attribute Information: 1. age: continuous. 2. workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. 3. fnlwgt: continuous. 4. education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. 5. education-num: continuous. 6. marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. 7. occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming- fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. 8. relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. 9. race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. 10. sex: Female, Male. 11. capital-gain: continuous. 12. capital-loss: continuous. 13. hours-per-week: continuous. 14. native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands. class: >50K, <=50K
  • 80.
  • 81.
  • 82.
  • 83.
  • 84.
  • 85.
  • 86.
  • 87.
  • 88.
  • 89.
  • 90.
  • 91.
  • 92.
  • 93.
  • 94.
  • 95.
  • 96.
  • 97.
  • 98.
  • 99.
  • 100.
    What Is ‘naive’in the Naive Bayes Classifier? • The classifier is called ‘naive’ because it makes assumptions that may or may not turn out to be correct. • The algorithm assumes that the presence of one feature of a class is not related to the presence of any other feature (absolute independence of features), given the class variable. • For instance, a fruit may be considered to be a cherry if it is red in color and round in shape, regardless of other features. This assumption may or may not be right (as an apple also matches the description).
  • 101.
    Support Vector Machines •A support vector machine (SVM) uses algorithms to train and classify data within degrees of polarity, taking it to a degree beyond X/Y prediction. • For a simple visual explanation, we’ll use two tags: red and blue, with two data features: X and Y, then train our classifier to output an X/Y coordinate as either red or blue.
  • 102.
    Support Vector Machines •The SVM then assigns a hyperplane that best separates the tags. In two dimensions this is simply a line. Anything on one side of the line is red and anything on the other side is blue. In sentiment analysis, for example, this would be positive and negative. • In order to maximize machine learning, the best hyperplane is the one with the largest distance between each tag:
  • 103.
    What are SupportVectors in SVM?
  • 104.
  • 105.
  • 106.
    Support Vector Machines #Import train_test_split function from sklearn.model_selection import train_test_split # Split dataset into training set and test set X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size=0.3,random_state=109) # 70% training and 30% test Splitting Data To understand model performance, dividing the dataset into a training set and a test set is a good strategy. Split the dataset by using the function train_test_split(). you need to pass 3 parameters features, target, and test_set size. Additionally, you can use random_state to select records randomly.
  • 107.
  • 108.
  • 109.
  • 110.
  • 111.
  • 112.