SlideShare a Scribd company logo
1 of 299
Download to read offline
P1WU
UNIT – III: CLASSIFICATION
Topic 1: A CHARACTERIZATION OF TEXT
CLASSIFICATION
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
UNIT III
1.A Characterization of
Text Classification
2. Unsupervised
Algorithms: Clustering
3. Naïve Text Classification
4. Supervised Algorithms
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
8. Feature Selection or
Dimensionality Reduction
9. Evaluation metrics
10. Accuracy and Error
11. Organizing the classes
12. Indexing and Searching
13. Inverted Indexes
14. Sequential Searching
15. Multi-dimensional
Indexing
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO CLASSIFICATION
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO CLASSIFICATION
• Scientists became very serious about addressing the question:
• “Can we build a model that learns from available data and
automatically makes the right decisions and predictions?”
• Answer can be found in numerous applications that are emerging
from the fields of
1. pattern classification,
2. machine learning, and
3. artificial intelligence.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO CLASSIFICATION
• Data from various sensoring devices combined with powerful
learning algorithms and domain knowledge led to :
• many great inventions that we now take for granted in our
everyday life:
• Internet queries via search engines like Google,
• text recognition at the post office,
• barcode scanners at the supermarket, the diagnosis of diseases,
• speech recognition by Siri or
• Google Now on our mobile phone, just to name a few.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO CLASSIFICATION
• Classification is:
• the data mining process of
• finding a model (or function) that
• describes and distinguishes data classes or concepts,
• for the purpose of being able to use the model to predict the class of objects
whose class label is unknown.
• That is, predicts categorical class labels (discrete or nominal).
• Classifies the data (constructs a model) based on the training set.
• It predict group membership for data instances.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO CLASSIFICATION
What is CLASSIFICATION?
• Classification and prediction are :
• two forms of data analysis that can used to extract models describing
important data classes or to predict the future data trends.
• C & P help us to provide a better understanding of large data.
• Classification predicts categorical (discrete, unordered) labels.
• Prediction models continuous valued functions.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO CLASSIFICATION
• How can we classify?
• The trick here is Machine Learning which requires us to make classifications based on past
observations (the learning part).
• We give the machine a set of data having texts with labels tagged to it and then we let the model
to learn on all these data which will later give us some useful insight on the categories of text
input we feed.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Applications of Classification
• Classification of (potential) customers for:
• Credit approval, risk prediction, selective marketing
• Performance prediction based on
• selected indicators
• Medical diagnosis based on symptoms or reactions to Therapy
• Application areas:
• Credit approval
• Target marketing
• Medical diagnosis
• Treatment effectiveness analysis
• Performance prediction
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
When is classification needed?
• Scenarios:
• In each of these examples, the data analysis task is classification,
• where a model or classifier is constructed to predict categorical labels, such as
• “safe” or “risky” for the loan application data;
• “yes” or “no” for the marketing data; or
• “treatment A,” “treatment B,” or “treatment C” for the medical data.
• These categories can be represented by discrete values, where the ordering among values
has no meaning.
• For example,
• the values 1, 2, and 3 may be used to represent treatments A, B, and C,
• where there is no ordering implied among this group of treatment regimes.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO CLASSIFICATION
Aim: predict categorical class labels
for new tuples/samples
Input: a training set of tuples/samples,
each with a class label
Output: a model (a classifier) based on
the training set and the class labels
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Why Classification?
• A classical problem extensively studied by
• statisticians and machine learning researchers
• Predicts categorical class labels.
• Produces a model (classifier).
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Typical Applications of Classification
• Example:
• {credit history, salary} credit approval ( Yes/No)
• {Temp, Humidity}  Rain (Yes/No)
• A set of documents  sports, technology, etc.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
• Another Example:
• If x >= 90 then grade =A.
• If 80<=x<90 then grade =B.
• If 70<=x<80 then grade =C.
• If 60<=x<70 then grade =D.
• If x<50 then grade =F.
WHAT ARE TEXT CLASSIFICATION?
• Text classification is a machine
learning technique that assigns a
set of predefined categories
to open-ended text.
• Text classifiers can be used to
organize, structure, and categorize
pretty much any kind of text –
from documents, medical studies
and files, and all over the web.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
What is meant by text classification?
• Text classification or Text Categorization
is the activity of labeling natural
language texts with relevant categories
from a predefined set.
• In laymen terms, text classification is a
process of extracting generic tags from
unstructured text.
• These generic tags come from a set of
pre-defined categories.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
What is meant by text classification or Document classification ?
• Document classification or document categorization is
• a problem in library science, information science and
computer science.
• The task is to assign a document to one or more classes or
categories.
• This may be done "manually" or algorithmically.
•Wikipedia
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
What is meant by text classification?
• Text classification also known as text tagging or text
categorization is the process of categorizing text into
organized groups.
• By using Natural Language Processing (NLP), text
classifiers can automatically analyze text and then
assign a set of pre-defined tags or categories based on
its content.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Text Classification Examples
• Text classification is becoming
• an increasingly important part of businesses as it allows to
easily get insights from data and automate business processes.
• Some of the most common examples and use cases for
automatic text classification include the following:
a) Sentiment Analysis
b) Topic Detection
c) Language Detection
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Text Classification Examples
a) Sentiment Analysis: the process of understanding if a given text is
talking positively or negatively about a given subject
(e.g. for brand monitoring purposes).
b) Topic Detection: the task of identifying the theme or topic of a piece
of text
(e.g. know if a product review is about Ease of Use, Customer Support,
or Pricing when analyzing customer feedback).
c) Language Detection: the procedure of detecting the language of a
given text
(e.g. know if an incoming support ticket is written in English or Spanish for
automatically routing tickets to the appropriate team).
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
A Characterization of Text Classification
• For example,
• new articles can be organized by topics;
• support tickets can be organized by urgency;
• chat conversations can be organized by language;
• brand mentions can be organized by sentiment; and so on.
• Text classification is
• one of the fundamental tasks in natural language processing with broad applications such
as sentiment analysis, topic labeling, spam detection, and intent detection.
• Here’s an example of how it works:
• “The user interface is quite straightforward and easy to use.”
• A text classifier can take this phrase as an input, analyze its content, and then automatically
assign relevant tags, such as UI and Easy To Use.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
A Characterization of Text Classification
• First tactic for categorizing documents is to assign a
label to each document,
• but this solve the problem only when the users know the
labels of the documents they looking for.
• This tactic does not solve more generic problem of
finding documents on specific topic or subject.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
A Characterization of Text Classification
• For that case, better solution is to
• group documents by common generic topics and label each group
with a meaningful name.
• Each labeled group is called category or class.
• Document classification is
• the process of categorizing documents under a given cluster or
category using fully supervised learning process.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Why is Text Classification Important?
• It’s estimated that around 80% of all information is unstructured, with text
being one of the most common types of unstructured data.
• Because of the messy nature of text,
• analyzing, understanding, organizing, and sorting through text data is hard and time-consuming, so
most companies fail to use it to its full potential.
• This is where text classification with machine learning comes in.
• Using text classifiers, companies can automatically structure all manner of
relevant text, from
• , legal documents, social media, chatbots, surveys, and more in a fast and cost-effective way.
• This allows companies to
• save time analyzing text data, automate business processes, and make data-driven business
decisions.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Reasons for: Text Classification Important
a) Scalability
• Manually analyzing and organizing is slow and much less accurate..
• Machine learning can automatically analyze millions of surveys, comments, emails,
etc., at a fraction of the cost, often in just a few minutes.
• Text classification tools are scalable to any business needs, large or small.
b) Real-time analysis
• There are critical situations that companies need to identify as soon as possible and
take immediate action (e.g., PR crises on social media).
• Machine learning text classification can follow your brand mentions constantly and in
real time, so you'll identify critical information and be able to take action right away.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Reasons for: Text Classification Important
c) Consistent criteria
• Human annotators make mistakes when classifying text data due to
distractions, fatigue, and boredom, and human subjectivity creates inconsistent
criteria.
• Machine learning, on the other hand, applies the same lens and criteria to all
data and results.
• Once a text classification model is properly trained it performs with
unsurpassed accuracy.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
A Characterization of Text Classification
• Classification could be performed
1. manually by domain experts or
2. automatically using well- known and
• widely used classification algorithms such as decision tree and
Naïve Bayes.
• Documents are classified according to
• other attributes (e.g. author, document type, publishing year
etc.) or according to their subjects.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
A Characterization of Text Classification
• there are two main kind of subject classification of documents:
1. The content based approach and
2. the request based approach.
• In Content based classification,
• the weight that is given to subjects in a document decides the class to which the document is assigned.
• For example, it is a rule in some library classification that at least 15% of the content of a book
should be about the class to which the book is assigned.
• In automatic classification, the number of times given words appears in a document determine the
class.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
A Characterization of Text Classification
• In Request oriented
classification, the anticipated
request from users is impacting
how documents are being
classified.
• The classifier asks himself:
• “Under which description should this
entity be found?” and
• “think of all the possible queries and
decide for which ones the entity at
hand is relevant”.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Text Classification Applications
• With the help of text classification, businesses can make sense of large
amounts of data using techniques like
• aspect-based sentiment analysis to understand what people are talking about
and how they’re talking about each aspect.
• Text classification can help support teams provide a stellar experience
by
• automating tasks that are better left to computers, saving precious time that
can be spent on more important things.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Text Classification Applications
• models can help you analyze survey results to discover patterns and
insights like:
• What do people like about our product or service?
• What should we improve?
• What do we need to change?
• By combining both quantitative results and qualitative analyses,
• teams can make more informed decisions without having to spend hours
manually analyzing every single open-ended response.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Text Classification Applications
• Text classification has thousands of use cases and is applied to a wide range
of tasks.
• In some cases, data classification tools work behind the scenes to enhance
app features we interact with on a daily basis (like email spam filtering).
• In some other cases, classifiers are used by marketers, product managers,
engineers, and salespeople to automate business processes and save
hundreds of hours of manual data processing.
• Some of the top applications and use cases of text classification include:
1. Detecting urgent issues
2. Automating customer support processes
3. Listening to the Voice of customer (VoC)
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
A Characterization of Text Classification
• Automatic document classification tasks can be divided into three
types
1. Unsupervised document classification (document clustering): the
classification must be done totally without reference to external information.
2. Semi-supervised document classification: parts of the documents are labeled
by the external method.
3. Supervised document classification where some external method (such as
human feedback) provides information on the correct classification for
documents
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Computational Supervised Learning
• Computational Supervised Learning is also called classification aimed
to:
• Learn from past experience, and
• use the learned knowledge to classify new data
• Knowledge learned by intelligent algorithms
• Examples:
• Clinical diagnosis for patients
• Cell type classification
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Overall Picture of Supervised Learning
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Biomedical
Financial
Government
Scientific
Decision trees
Emerging patterns
SVM
Neural networks
Classifiers (M-Doctors)
Unsupervised Learning
• Unsupervised learning is a machine learning technique in which
models are not supervised using training dataset. Instead, models itself
find the hidden patterns and insights from the given data. It can be
compared to learning which takes place in the human brain while
learning new things. It can be defined as:
• “Unsupervised learning is a type of machine learning in which models
are trained using unlabeled dataset and are allowed to act on that data
without any supervision”.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Unsupervised Learning
Unsupervised learning cannot be directly applied to a regression or
classification problem because unlike supervised learning, we have the
input data but no corresponding output data.
The goal of unsupervised learning is to
find the underlying structure of dataset, group that data according to
similarities, and represent that dataset in a compressed format.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Unsupervised Learning
Example: Suppose the unsupervised learning algorithm is given an input
dataset containing images of different types of cats and dogs.
The algorithm is never trained upon the given dataset, which means it
does not have any idea about the features of the dataset.
The task of the unsupervised learning algorithm is to identify the image
features on their own.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Unsupervised Learning
• . Unsupervised learning algorithm will
• perform this task by clustering the image dataset into the groups according to
similarities between images.
• By Simply,
• no training data is provided Examples:
• neural network models
• independent component analysis
• clustering
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Supervised vs. Unsupervised Learning
classification Vs clustering
• Supervised learning (classification)
• Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
• New data is classified based on the training set
• Unsupervised learning (clustering)
• The class labels of training data is unknown
• Given a set of measurements, observations, etc. with the aim of establishing
the existence of classes or clusters in the data
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – III: CLASSIFICATION
Topic 2: UNSUPERVIZED ALGORITHMS -
CLUSTERING
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
UNIT III
1.A Characterization of Text
Classification
2. Unsupervised
Algorithms: Clustering
3. Naïve Text Classification
4. Supervised Algorithms
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
8. Feature Selection or
Dimensionality Reduction
9. Evaluation metrics
10. Accuracy and Error
11. Organizing the classes
12. Indexing and Searching
13. Inverted Indexes
14. Sequential Searching
15. Multi-dimensional
Indexing
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO UNSUPERVIZED ALGORITHMS
• Below is the list of some popular unsupervised learning algorithms:
• K-means clustering
• KNN (k-nearest neighbors)
• Hierarchal clustering
• Anomaly detection
• Neural Networks
• Principle Component Analysis
• Independent Component Analysis
• Apriori algorithm
• Singular value decomposition
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO UNSUPERVIZED ALGORITHMS
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
WHAT ARE CLUSTERING?
• Clustering or cluster analysis is a
machine learning technique, which
groups the unlabelled dataset.
• It can be defined as "A way of
grouping the data points into
different clusters, consisting of
similar data points. The objects with
the possible similarities remain in a
group that has less or no similarities
with another group."
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
WHAT ARE CLUSTERING?
• It does it by
• finding some similar patterns in the unlabelled dataset
such as shape, size, color, behavior, etc., and divides them
as per the presence and absence of those similar patterns.
• It is an unsupervised learning method,
• hence no supervision is provided to the algorithm, and it
deals with the unlabeled dataset.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Difference between Supervised and Unsupervised Learning
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Supervised Learning Unsupervised Learning
Supervised learning algorithms aretrained using labeled data. Unsupervised learning algorithmsare trained using unlabeled data.
Supervised learning model takesdirect feedback to check if it is
predicting correct output or not.
Unsupervised learning model doesnot take any feedback.
Supervised learning model predictsthe output. Unsupervised learning model findsthe hidden patterns in data.
Supervised learning needs supervision to train the model. Unsupervised learning does not needany supervision to train the model.
Supervised learning can becategorized
in Classification and Regression problems.
Unsupervised Learning can beclassified in Clustering and
Associations problems.
Supervised learning can be used for those cases where we
know theinput as well as corresponding outputs.
Unsupervised learning can be used for those cases where we have
onlyinput data and no corresponding output data.
Supervised learning model produces an accurate result. Unsupervised learning model may give less accurate result as compared
to supervised learning.
It includes various algorithms such It includes various algorithms such
Advantages of Unsupervised Learning
• Unsupervised learning is used for more complex tasks
as compared to supervised learning because,
• in unsupervised learning, we don't have labeled input data.
• Unsupervised learning is preferable as
• it is easy to get unlabeled data in comparison to labeled
data.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Disadvantages of Unsupervised Learning
• Unsupervised learning is
• intrinsically more difficult than supervised learning as it does not have
corresponding output.
• The result of the unsupervised learning algorithm might be
• less accurate as input data is not labeled, and algorithms do not know the
exact output in advance.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – III: CLASSIFICATION
Topic 3: NAÏVE TEXT CLASSIFICATION
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
UNIT III : TEXT CLASSIFICATION AND CLUSTERING
1.A Characterization of Text
Classification
2. Unsupervised Algorithms:
Clustering
3. Naïve Text Classification
4. Supervised Algorithms
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
8. Feature Selection or
Dimensionality Reduction
9. Evaluation metrics
10. Accuracy and Error
11. Organizing the classes
12. Indexing and Searching
13. Inverted Indexes
14. Sequential Searching
15. Multi-dimensional
Indexing
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
NAÏVE TEXT CLASSIFICATION
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO NAÏVE TEXT CLASSIFICATION
• Naive Bayes classifiers are a collection of classification
algorithms based on Bayes Theorem.
• It is not a single algorithm but a family of algorithms where all
of them share a common principle, i.e. every pair of features
being classified is independent of each other.
• Naive Bayes classifiers have been heavily used
for text classification and text analysis machine learning
problems.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO NAÏVE TEXT CLASSIFICATION
• Text Analysis is a major application field for machine learning
algorithms.
• However the raw data,
• a sequence of symbols (i.e. strings) cannot be fed directly to the algorithms
themselves as most of them expect numerical feature vectors with a fixed size
rather than the raw text documents with variable length.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
The Naive Bayes algorithm
• Naive Bayes classifiers are a collection of classification
algorithms based on Bayes’ Theorem.
• It is not a single algorithm but a family of algorithms where all
of them share a common principle,
• i.e. every pair of features being classified is independent of each other.
• The dataset is divided into two parts, namely,
feature matrix and the response/target vector.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
The Naive Bayes algorithm
• The Feature matrix (X) contains all the vectors(rows) of the
dataset in which each vector consists of the value of
dependent features. The number of features is d i.e. X =
(x1,x2,x2, xd).
• The Response/target vector (y) contains the value of
class/group variable for each row of feature matrix.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
The Bayes’ Theorem
Bayes’ Theorem finds the probability of an event
occurring given the probability of another event that
has already occurred.
Bayes’ theorem is stated mathematically as follows:
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
The Bayes’ Theorem
• where:
• A and B are called events.
• P(A | B) is the probability of event A, given the event B is true (has occured)
• Event B is also termed as evidence.
P(A) is the priori of A (the prior independent probability, i.e. probability of event
before evidence is seen).
• P(B | A) is the probability of B given event A, i.e. probability of event B after evidence
A is seen.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
The Bayes’ Theorem
• Summary
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Dealing with text data
• Text Analysis is a major application field for machine learning
algorithms.
However the raw data, a sequence of symbols (i.e. strings) cannot be fed
directly to the algorithms themselves as most of them expect
numerical feature vectors with a fixed size rather than the raw text
documents with variable length.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Dealing with text data
• In order to address this, scikit-learn provides utilities for the most
common ways to extract numerical features from text content,
namely:
• tokenizing strings and giving an integer id for each possible token, for
instance by using w ite-spaces and punctuation as token separators.
• counting the occurrences of tokens in each document.
• In this scheme, features and samples are defined as follows:
• each individual token occurrence frequency is treated as a feature.
• the vector of all the token frequencies for a given document is considered a multivariate
sample.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 1 : Using the Naive Bayesian Classifier
• We will consider the following training set.
• The data samples are described by attributes age, income, student,
and credit.
• The class label attribute, buy, tells whether the person buys a
computer, has two distinct values, yes (class C1) and no (class C2).
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 1 : Using the Naive Bayesian Classifier
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
RID Age Income student Credit Ci: buy
1 Youth High no Fair C2: no
2 Youth High no Excellent C2: no
3 middle-aged High no Fair C1: yes
4 Senior medium no Fair C1: yes
5 Senior Low yes Fair C1: yes
6 Senior Low yes Excellent C2: no
7 middle-aged Low yes Excellent C1: yes
8 Youth medium no Fair C2: no
9 Youth Low yes Fair C1: yes
10 Senior medium yes Fair C1: yes
11 Youth medium yes Excellent C1: yes
12 middle-aged medium no Excellent C1: yes
13 middle-aged High yes Fair C1: yes
14 Senior medium no Excellent C2: no
Example 1 : Using the Naive Bayesian Classifier
• The sample we wish to classify is
• X = (age = youth, income = medium, student = yes, credit = fair)
• We need to maximize P (X|Ci)P (Ci), for i = 1, 2. P (Ci), the a priori
probability of each class, can be estimated based on the training
samples:
• P(buy =yes ) = 9 /14
• P(buy =no ) = 5 /14
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 1 : Using the Naive Bayesian Classifier
• To compute P (X|Ci), for i = 1, 2, we compute the following conditional
probabilities:
• P(age=youth | buy =yes ) = 2/9
• P(income =medium | buy =yes ) = 4/9
• P(student =yes | buy =yes ) = 6/9
• P(credit =fair | buy =yes ) = 6/9
• P(age=youth | buy =no ) = 3/5
• P(income =medium | buy =no ) = 2/5
• P(student =yes | buy =no ) = 1/5
• P(credit =fair | buy =no ) = 2/5
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 1 : Using the Naive Bayesian Classifier
• Using the above probabilities, we obtain
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 1 : Using the Naive Bayesian Classifier
• Similarly
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
To find the class that maximizes P (X|Ci)P (Ci), we compute
Thus the naive Bayesian classifier predicts buy = yes for sample X
Example 2: Predicting a class label using naïve Bayesian classification
• Predicting a class label using naïve Bayesian classification.
• The training data set is given below:
• The data tuples are described by the attributes Owns Home?, Married,
Gender and Employed.
• The class label attribute Risk Class has three distinct values.
• Let C1 corresponds to the class A, and C2 corresponds to the class B
and C3 corresponds to the class C.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 1 : Using the Naive Bayesian Classifier
• The tuple is to classify is,
• X = (Owns Home = Yes, Married = No, Gender = Female, Employed = Yes)
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Owns Home Married Gender Employed Risk Class
Yes Yes Male Yes B
No No Female Yes A
Yes Yes Female Yes C
Yes No Male No B
No Yes Female Yes C
No No Female Yes A
No No Male No B
Yes No Female Yes A
No Yes Female Yes C
Yes Yes Female Yes C
Example 2: Predicting a class label using naïve Bayesian classification
• Solution
• There are 10 samples and three classes.
• Risk class A = 3 Risk class B = 3 Risk class C = 4
•
• The prior probabilities are obtained by dividing these frequencies by
the total number in the training data,
• P(A) = 3/10 = 0.3 P(B) = 3/10 = 0.3 P(C) = 4/10 = 0.4
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 2: Predicting a class label using naïve Bayesian classification
• To compute P(X/Ci) =P {yes, no, female, yes}/Ci) for each of the classes, the conditional probabilities for each:
• P(Owns Home = Yes/A) = 1/3 =0.33
• P(Married = No/A) = 3/3 =1
• P(Gender = Female/A) = 3/3 = 1
• P(Employed = Yes/A) = 3/3 = 1
•
• P(Owns Home = Yes/B) = 2/3 =0.67
• P(Married = No/B) = 2/3 =0.67
• P(Gender = Female/B) = 0/3 = 0
• P(Employed = Yes/B) = 1/3 = 0.33
•
• P(Owns Home = Yes/C) = 2/4 =0.5
• P(Married = No/C) = 0/4 =0
• P(Gender = Female/C) = 4/4 = 1
• P(Employed = Yes/C) = 4/4 = 1
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 2: Predicting a class label using naïve Bayesian classification
• Using the above probabilities, we obtain
• P(X/A)= P(Owns Home = Yes/A) X
• P(Married = No/A) x
• P(Gender = Female/A) X
• P(Employed = Yes/A)
= 0.33 x 1 x 1 x 1 = 0.33
• Similarly, P(X/B)= 0 , P(X/C) =0
•
• To find the class, G, that maximizes, P(X/Ci)P(Ci), we compute,
• P(X/A) P(A) = 0.33 X 0.3 = 0.099
• P(X/B) P(B) =0 X 0.3 = 0
• P(X/C) P(C) = 0 X 0.4 = 0.0
• Therefore x is assigned to class A
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Advantages and Disadvantages
• Advantages:
a) Have the minimum error rate in comparison to all other classifiers.
b) Easy to implement
c) Good results obtained in most of the cases.
d) They provide theoretical justification for other classifiers that do not
explicitly use
• Disadvantages:
a) Lack of available probability data.
b) Inaccuracies in the assumption.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – III: CLASSIFICATION
Topic 4: SUPERVISED ALGORITHMS
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
UNIT III : TEXT CLASSIFICATION AND CLUSTERING
1.A Characterization of Text
Classification
2. Unsupervised Algorithms:
Clustering
3. Naïve Text Classification
4. Supervised Algorithms
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
8. Feature Selection or
Dimensionality Reduction
9. Evaluation metrics
10. Accuracy and Error
11. Organizing the classes
12. Indexing and Searching
13. Inverted Indexes
14. Sequential Searching
15. Multi-dimensional
Indexing
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SUPERVIZED LEARNING
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO SUPERVIZED LEARNING
• Supervised learning, also
known as supervised machine
learning, is a subcategory of
machine learning and artificial
intelligence.
• It is defined by its use of
labeled datasets to train
algorithms that to classify data
or predict outcomes accurately.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
What is supervised learning?
• Supervised learning, also known as supervised machine
learning, is
• a subcategory of machine learning and artificial intelligence.
• It is defined by
• its use of labeled datasets to train algorithms that to classify
data or predict outcomes accurately.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Supervised Learning
• It is defined by its use of labeled datasets to
• train algorithms that to classify data or predict outcomes accurately.
• As input data is fed into the model, it adjusts its weights until the
model has been fitted appropriately,
• which occurs as part of the cross validation process.
• Supervised learning helps organizations solve for
• a variety of real-world problems at scale, such as classifying spam in a
separate folder from your inbox.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
What is the type of supervised learning?
• There are two types of Supervised Learning
techniques:
1. Regression and
2. Classification.
• Classification separates the data, Regression fits the
data.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example of Supervised Learning
• A great example of supervised learning is text classification
problems.
• In this set of problems, the goal is to
• predict the class label of a given piece of text.
• One particularly popular topic in text classification is to
• predict the sentiment of a piece of text, like a tweet or a product
review.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SUPERVIZED ALGORITHMS
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO SUPERVIZED ALGORITHMS
• Which are supervised algorithm?
• A supervised learning algorithm takes
• a known set of input data (the learning set) and known responses to the data
(the output), and forms a model to generate reasonable predictions for the
response to the new input data.
• Use supervised learning if you have existing data for the output
you are trying to predict.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SUPERVIZED ALGORITHMS EXAMPLE
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO SUPERVIZED ALGORITHMS
• Various algorithms and computation techniques are used in
supervised machine learning processes.
• Most commonly used learning methods, typically calculated
through use of programs like R or Python are:
1) Neural networks
• Primarily leveraged for deep learning algorithms, neural networks process
training data by mimicking the interconnectivity of the human brain through
layers of nodes
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO SUPERVIZED ALGORITHMS
2) Naive Bayes
• Naive Bayes is classification approach that adopts the principle of class conditional
independence from the Bayes Theorem.
3) Linear regression
• Linear regression is used to identify the relationship between a dependent variable and
one or more independent variables and is typically leveraged to make predictions about
future outcomes.
4) Logistic regression
• While linear regression is leveraged when dependent variables are continuous, logistical
regression is selected when the dependent variable is categorical, meaning they have
binary outputs, such as "true" and "false" or "yes" and "no."
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO SUPERVIZED ALGORITHMS
5) Support vector machine (SVM)
• A support vector machine is a popular supervised learning model developed by
Vladimir Vapnik, used for both data classification and regression.
6) K-nearest neighbor
• K-nearest neighbor, also known as the KNN algorithm, is a non-parametric
algorithm that classifies data points based on their proximity and association to
other available data.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO SUPERVIZED ALGORITHMS
7) Random forest
• Random forest is another flexible supervised machine learning algorithm used
for both classification and regression purposes.
• The "forest" references a collection of uncorrelated decision trees, which are
then merged together to reduce variance and create more accurate data
predictions.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – III: CLASSIFICATION
Topic 5: DECISION TREES
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
UNIT III : TEXT CLASSIFICATION AND CLUSTERING
1.A Characterization of Text
Classification
2. Unsupervised Algorithms:
Clustering
3. Naïve Text Classification 4.
Supervised Algorithms
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
8. Feature Selection or
Dimensionality Reduction
9. Evaluation metrics
10. Accuracy and Error
11. Organizing the classes
12. Indexing and Searching
13. Inverted Indexes
14. Sequential Searching
15. Multi-dimensional
Indexing
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
DECISION TREES
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO DECISION TREES
• What is a decision tree?
• A decision tree is a structure that includes a root node,
branches, and leaf nodes.
a) Each internal node denotes a test on an attribute,
b) each branch denotes the outcome of a test, and
c) each leaf node holds a class label.
• The topmost node in the tree is the root node.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO DECISION TREES
• ID3, C4.5, and CART adopt a greedy (i.e., non-backtracking)
approach in which decision trees are constructed in a top-
down recursive divide-and-conquer manner.
• Most algorithms for decision tree induction also follow such a
top-down approach, which starts with a training set of
tuples and their associated class labels.
• The training set is recursively partitioned into smaller
subsets as the tree is being built.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO DECISION TREES
• Decision tree induction is the learning of decision trees from
class-labeled training tuples.
• A decision tree is a flowchart-like tree structure, where each
internal node (nonleaf node) denotes a test on an attribute,
• each branch represents an outcome of the test, and
• each leaf node (or terminal node) holds a class label.
• The top most node in a tree is the root node.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO DECISION TREES
•A decision tree is a tree where
• internal node = a test on an attribute
• tree branch = an outcome of the test
• leaf node = class label or class distribution
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Benefits of Decision Trees
•The benefits of having a decision tree are as
follows −
a) It does not require any domain knowledge.
b) It is easy to comprehend.
c) The learning and classification steps of a decision tree are
simple and fast.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Brief History of Decision Trees
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
CLS (Hunt etal. 1966)--- cost driven
ID3 (Quinlan, 1986 MLJ) --- Information-driven
C4.5 (Quinlan, 1993) --- Gain ratio + Pruning ideas
CART (Breiman et al. 1984) --- Gini Index
Elegance of Decision Trees
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Structure of Decision Trees
• If x1 > a1 & x2 > a2, then it’s A class
• C4.5, CART, two of the most widely used
• Easy interpretation, but accuracy generally unattractive
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Leaf nodes
Internal nodes
Root node
A
B
B A
A
x1
x2
x4
x3
> a1
> a2
Example of Decision Tree
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Another Example of Decision Tree
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Decision Tree classification Tasks
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
5/16/2022 Data Mining: Concepts and Techniques 15
Apply Model to Test Data
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Assign Cheat to “No”
5/16/2022 Data Mining: Concepts and Techniques 16
Apply Model to Test Data
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
5/16/2022 Data Mining: Concepts and Techniques 17
Apply Model to Test Data
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
5/16/2022 Data Mining: Concepts and Techniques 18
Apply Model to Test Data
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
5/16/2022 Data Mining: Concepts and Techniques 19
Apply Model to Test Data
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
5/16/2022 Data Mining: Concepts and Techniques 20
Apply Model to Test Data
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Assign Cheat to “No”
5/16/2022 Data Mining: Concepts and Techniques 21
Decision Tree Classification Task
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Tree
Induction
algorithm
Training Set
Decision Tree
Constructing a Decision Tree
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Constructing a Decision Tree
• Two phases of decision tree generation:
1. tree construction
• at start, all the training examples at the root
• partition examples based on selected attributes
• test attributes are selected based on a heuristic or a statistical measure
2. tree pruning
• identify and remove branches that reflect noise or outliers
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Constructing a Decision Tree
• Basic step:
Determination of the root node of the tree and
the root node of its sub-trees
• Most Discriminatory Feature
• Every feature can be used to partition the training data
• If the partitions contain a pure class of training instances, then this feature is
most discriminatory
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Constructing a Decision Tree:- Example of Partitions
• Categorical feature
• Number of partitions of the training data is equal to the number of values of
this feature
• Numerical feature
• Two partitions
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Decision Tree Induction Algorithm
• A machine researcher named J. Ross Quinlan in 1980 developed a
decision tree algorithm known as ID3 (Iterative Dichotomiser).
• Later, he presented C4.5, which was the successor of ID3. ID3 and C4.5
adopt a greedy approach.
• In this algorithm, there is no backtracking; the trees are constructed
in a top-down recursive divide-and-conquer manner.
• Generating a decision tree form training tuples of data partition D
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Algorithm : Generate_decision_tree
• Input:
• Data partition, D, which is a set of training tuples and their
associated class labels.
• attribute_list, the set of candidate attributes.
• Attribute selection method, a procedure to determine the splitting
criterion that best partitions that the data tuples into individual
classes. This criterion includes a splitting_attribute and either a
splitting point or splitting subset.
Output: A Decision Tree
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Algorithm : Generate_decision_tree
• Method
1) create a node N;
2) if tuples in D are all of the same class, C then
3) return N as leaf node labeled with class C;
4) if attribute_list is empty then
5) return N as leaf node with labeled with majority class in D;//majority voting
6) apply attribute_selection_method(D, attribute_list) to find the best splitting_criterion;
7) label node N with splitting_criterion;
8) if splitting_attribute is discrete-valued and multiway splits allowed then // not restricted to binary trees
9) attribute_list = attribute_list - splitting attribute; // remove splitting attribute
10) for each outcome j of splitting criterion
11) // partition the tuples and grow subtrees for each partition
12) let Dj be the set of data tuples in D satisfying outcome j; // a partition
13) if Dj is empty then
14) attach a leaf labeled with the majority class in D to node N;
else attach the node returned by Generate_decision tree(Dj, attribute list) to node N;
end for
15) return N;
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Constructing Decision Tree Example :- Weather Forecasting
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Constructing Decision Tree :- A Simple Dataset
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
9 Play samples
5 Don’t
A total of 14.
Constructing Decision Tree :- A Simple Dataset
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Outlook Temp Humidity Windy class
Sunny 75 70 true Play
Sunny 80 90 true Don’t
Sunny 85 85 false Don’t
Sunny 72 95 true Don’t
Sunny 69 70 false Play
Overcast 72 90 true Play
Overcast 83 78 false Play
Overcast 64 65 true Play
Overcast 81 75 false Play
Rain 71 80 true Don’t
Rain 65 70 true Don’t
Rain 75 80 false Play
Rain 68 80 false Play
Rain 70 96 false Play
Instance #
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Constructing Decision Tree :- A Simple Dataset
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
2
outlook
windy
humidity
Play
Play
Play
Don’t
Don’t
sunny
overcast
rain
<= 75
> 75 false
true
2
4
3
3
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Total 14 training
instances
1,2,3,4,5
P,D,D,D,P
6,7,8,9
P,P,P,P
10,11,12,13,14
D, D, P, P, P
Outlook =
sunny
Outlook =
overcast
Outlook =
rain
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Total 14 training
instances
5,8,11,13,14
P,P, D, P, P
1,2,3,4,6,7,9,10,12
P,D,D,D,P,P,P,D,P
Temperature
<= 70
Temperature
> 70
Constructing Decision Tree :- A Simple Dataset
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
2
outlook
windy
humidity
Play
Play
Play
Don’t
Don’t
sunny
overcast
rain
<= 75
> 75 false
true
2
4
3
3
Constructing Decision Tree Example :-
Decision on Buying a Computer / customer likely to
purchase a computer
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Constructing Decision Tree Example :-
Decision on Buying a Computer
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
The following decision tree is for the concept buy_computer that
indicates :
Whether a customer at a company is likely to buy a computer or not?
Each internal node represents a test on an attribute.
Each leaf node represents a class.
Constructing Decision Tree :- Training Dataset
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
This follows
an
example of
Quinlan’s
ID3 (Playing
Tennis)
Constructing Decision Tree :- Output: A Decision Tree for
“buys_computer”
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
age?
overcast
student? credit rating?
<=30 >40
no yes yes
yes
31..40
fair
excellent
yes
no
From the training dataset , calculate entropy value, which indicates that splitting attribute is: age
A Decision Tree for “buys_computer”
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
A Decision Tree for “buys_computer”
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
From the training data set , age= youth has 2 classes based on student attribute
A Decision Tree for “buys_computer”
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
based on majority voting in student attribute , RID=3 is grouped under yes group.
A Decision Tree for “buys_computer”
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
From the training data set , age= senior has 2 classes based on credit rating.
A Decision Tree for “buys_computer”
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Final Decision Tree
Classification by Decision Tree
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Classification by Decision Tree
• A typical decision tree that represents the concept buys
computer, that is, it predicts whether a customer at
AllElectronics is likely to purchase a computer.
• Internal nodes are denoted by rectangles, and leaf nodes are
denoted by ovals.
• Some decision tree algorithms produce only binary trees
(where each internal node branches to exactly two other
nodes), whereas others can produce non binary trees.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Classification by Decision Tree
• “How are decision trees used for classification?”
• Given a tuple, X, for which the associated class label is
unknown, the attribute values of the tuple are tested against
the decision tree.
• A path is traced from the root to a leaf node, which holds the
class prediction for that tuple.
• Decision trees can easily be converted to classification rules.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Classification by Decision Tree
Why are decision tree classifiers so popular?
• The construction of decision tree classifiers does not require any domain
knowledge or parameter setting, and therefore is appropriate for
exploratory knowledge discovery.
• Decision trees can handle high dimensional data.
• Their representation of acquired knowledge in tree form is intuitive and
generally easy to assimilate by humans.
• The learning and classification steps of decision tree induction are simple
and fast.
• In general, decision tree classifiers have good accuracy.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Classification by Decision Tree
Extracting Classification Rules from Trees
• Represent the knowledge in the form of IF-THEN rules
• One rule is created for each path from the root to a leaf
• Each attribute-value pair along a path forms a conjunction
• The leaf node holds the class prediction
• Rules are easier for humans to understand
• Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”
IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Classification by Decision Tree
Training Set and Its AVC Sets
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
student Buy_Computer
yes no
yes 6 1
no 3 4
Age Buy_Computer
yes no
<=30 3 2
31..40 4 0
>40 3 2
Credit
rating
Buy_Computer
yes no
fair 6 2
excellent 3 3
age income studentcredit_rating
buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
AVC-set on income
AVC-set on Age
AVC-set on Student
Training Examples
income Buy_Computer
yes no
high 2 2
medium 4 2
low 3 1
AVC-set on
credit_rating
Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – III: CLASSIFICATION
Topic 6: K-NN CLASSIFIER
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
UNIT III : TEXT CLASSIFICATION AND CLUSTERING
1.A Characterization of Text
Classification
2. Unsupervised Algorithms:
Clustering
3. Naïve Text Classification 4.
Supervised Algorithms
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
8. Feature Selection or
Dimensionality Reduction
9. Evaluation metrics
10. Accuracy and Error
11. Organizing the classes
12. Indexing and Searching
13. Inverted Indexes
14. Sequential Searching
15. Multi-dimensional
Indexing
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
K-NN CLASSIFIER
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
K-NN CLASSIFIER
• K-Nearest Neighbour is one of the simplest Machine Learning
algorithms based on Supervised Learning technique.
• K-NN algorithm assumes the similarity between the new case/data
and available cases and put the new case into the category that is
most similar to the available categories.
• K-NN algorithm stores all the available data and classifies a new data
point based on the similarity.
• This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
K-NN CLASSIFIER
• supervised ML classification algorithm-KNN(K Nearest Neighbors)
algorithm.
• It is one of the simplest and widely used classification algorithms in
which a new data point is classified based on similarity in the
specific group of neighboring data points.
• This gives a competitive result.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
K-NN CLASSIFIER EXAMPLE
• Example: Suppose, we have an image of a creature that looks similar
to cat and dog,
• but we want to know either it is a cat or dog. So for this identification, we can
use the KNN algorithm, as it works on a similarity measure.
• Our KNN model will find the similar features of the new data set to
the cats and dogs images and based on the most similar features it
will put it in either cat or dog category.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
K-NN CLASSIFIER EXAMPLE
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
K Nearest Neighbor Classification
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO K-NN CLASSIFIER
• K nearest neighbors is a simple algorithm that stores
• all available cases and classifies new cases based on a similarity measure (e.g.,
distance functions).
• K represents number of nearest neighbors.
• It classify an unknown example with the most common class
among k closest examples.
• KNN is based on
• “tell me who your neighbors are, and I’ll tell you who you are”
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO K-NN CLASSIFIER :- Example
If K = 5, then in this case query instance xq will be classified
as negative since three of its nearest neighbors are classified
as negative.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Different Schemes of KNN
• 1-Nearest Neighbor
• K-Nearest Neighbor using a majority voting scheme
• K-NN using a weighted-sum voting Scheme
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Different Schemes of KNN
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Different Schemes of KNN
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
kNN: How to Choose k?
• In theory, if infinite number of samples available, the larger is k, the
better is classification
• The limitation is that all k neighbors have to be close
• Possible when infinite no of samples available
• Impossible in practice since no of samples is finite k = 1 is often used for efficiency, but
sensitive to “noise”
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
kNN: How to Choose k?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
kNN: How to Choose k?
• Larger k gives smoother boundaries, better for generalization But only
if locality is preserved. Locality is not preserved if end up looking at
samples too far away, not from the same class.
• Interesting theoretical properties if k < sqrt(n), n is # of examples .
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Find a heuristically optimal number k of nearest
neighbors, based on RMSE(root-mean-square error).
This is done using cross validation.
Cross-validation is another way to retrospectively determine a good K value by using an independent
dataset to validate the K value. Historically, the optimal K for most datasets has been between 3-10.
That produces much better results than 1NN.
Distance Measure in KNN
• There are three distance measures are valid for continuous variables.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Distance Measure in KNN
• It should also be noted that all In the instance of categorical variables the Hamming distance must be used.
• It also brings up the issue of standardization of the numerical variables between 0 and 1 when there is a
mixture of numerical and categorical variables in the dataset.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Simple KNN - Algorithm:
• For each training example , add the example to the list of training_examples.
• Given a query instance xq to be classified,
• Let x1 ,x2….xk denote the k instances from training_examples that are nearest to xq .
• Return the class that represents the maximum of the k instances
• Steps:
1. Determine parameter k= no of nearest neighbor
2. Calculate the distance between the query instance and all the training samples.
3. Sort the distance and determine nearest neighbor based on the k –th minimum distance
4. Gather the category of the nearest neighbors
5. Use simple majority of the category of nearest neighbors as the prediction value of the query
instance.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Simple KNN - Algorithm:
• K-NN algorithm can be used for Regression as well as for Classification but mostly
it is used for the Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
• It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new
data.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Simple KNN – Algorithm Example
• Example:
• Consider the following data concerning credit default. Age and Loan are two numerical variables
(predictors) and Default is the target.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Simple KNN – Algorithm Example
• Given Training Data set :
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Simple KNN – Algorithm Example
• Data to Classify:
• to classify an unknown case (Age=48 and Loan=$142,000) using Euclidean distance.
•
• Step1: Determine parameter k
• K=3
•
• Step 2: Calculate the distance
• D = Sqrt[(48-33)^2 + (142000-150000)^2] = 8000.01 >> Default=Y
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Simple KNN – Algorithm Example
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Simple KNN – Algorithm Example
• Step 3: Sort the distance ( refer above diagram) and mark upto kth rank i.e 1 to 3.
•
• Step 4: Gather the category of the nearest neighbors
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Age Loan Default Distance
33 $150000 Y 8000
35 $120000 N 22000
60 $100000 Y 42000
With K=3, there are two Default=Y and one Default=N out of three closest neighbors.
The prediction for the unknown case is Default=Y.
Standardized Distance ( Feature Normalization)
• One major drawback in calculating distance measures directly from the training set is in
the case where variables have different measurement scales or there is a mixture of
numerical and categorical variables.
• For example, if one variable is based on annual income in dollars, and the other is based
on age in years then income will have a much higher influence on the distance calculated.
One solution is to standardize the training set as shown below.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Standardized Distance ( Feature Normalization)
• For ex loan , X =$ 40000 ,
• Xs = 40000- 20000 = 0.11
• 220000-20000
•
Same way , calculate the standardized values for age and loan attributes, then
apply the KNN algorithm.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Simple KNN – Algorithm
• Advantages
• Can be applied to the data from any distribution
• for example, data does not have to be separable with a linear boundary
• Very simple and intuitive
• Good classification if the number of samples is large enough
•
• Disadvantages
• Choosing k may be tricky
• Test stage is computationally expensive
• No training stage, all the work is done during the test stage
• This is actually the opposite of what we want. Usually we can afford training step to take a long time, but we
want fast test step
• Need large number of samples for accuracy
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
How does K-NN work?
• he K-NN working can be explained on the basis of the below algorithm:
• Step-1: Select the number K of the neighbors
• Step-2: Calculate the Euclidean distance of K number of neighbors
• Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
• Step-4: Among these k neighbors, count the number of the data points in
each category.
• Step-5: Assign the new data points to that category for which the number
of the neighbor is maximum.
• Step-6: Our model is ready.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
How does K-NN work?
• Suppose we have a new data point and we need to put it in
the required category. Consider the below image:
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
How does K-NN work?
• Firstly, we will choose the number of neighbors, so we will choose the k=5.
• Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the
distance between two points, which we have already studied in geometry. It can be calculated as:
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
How does K-NN work?
• By calculating the Euclidean distance we got the nearest neighbors, as
three nearest neighbors in category A and two nearest neighbors in
category B. Consider the below image:
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Why do we need a K-NN Algorithm?
•Suppose there are two categories, i.e., Category A
and Category B, and we have a new data point x1, so
this data point will lie in which of these categories.
•To solve this type of problem, we need a K-NN
algorithm. With the help of K-NN, we can easily
identify the category or class of a particular dataset. :
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Why do we need a K-NN Algorithm?
•Consider the below diagram:
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – III: CLASSIFICATION
Topic 7: SVM CLASSIFIER
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
UNIT III : TEXT CLASSIFICATION AND CLUSTERING
1.A Characterization of Text
Classification
2. Unsupervised Algorithms:
Clustering
3. Naïve Text Classification
4. Supervised Algorithms
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
8. Feature Selection or
Dimensionality Reduction
9. Evaluation metrics
10. Accuracy and Error
11. Organizing the classes
12. Indexing and Searching
13. Inverted Indexes
14. Sequential Searching
15. Multi-dimensional
Indexing
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SUPPORT VECTOR MACHINE (SVM)
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO SVM
• A new classification method for both linear and nonlinear data
• It uses a nonlinear mapping to transform the original training data into a
higher dimension
• With the new dimension, it searches for the linear optimal separating
hyperplane (i.e., “decision boundary”)
• With an appropriate nonlinear mapping to a sufficiently high dimension, data
from two classes can always be separated by a hyperplane
• SVM finds this hyperplane using support vectors (“essential” training tuples)
and margins (defined by the support vectors)
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO SVM
• A support vector machine (SVM) is a supervised machine learning
model that uses classification algorithms.
• It is more preferred for classification but is sometimes very useful for
regression as well.
• Basically, SVM finds a hyper-plane that creates a boundary between the
types of data.
• In 2- dimensional space, this hyper-plane is nothing but a line.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SVM—History and Applications
• Vapnik and colleagues (1992)—groundwork from Vapnik & Chervonenkis’
statistical learning theory in 1960s
• Features: training can be slow but accuracy is high owing to their ability to model
complex nonlinear decision boundaries (margin maximization)
• Used both for classification and prediction
• Applications:
• handwritten digit recognition, object recognition, speaker identification, benchmarking time-
series prediction tests
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SVM—General Philosophy
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SVM—Margins and Support Vectors
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO SVM
• In SVM, we plot each data item in the dataset in an N-
dimensional space, where N is the number of features/attributes
in the data.
• Next, find the optimal hyperplane to separate the data.
• So by this, you must have understood that inherently, SVM can
only perform binary classification (i.e., choose between two
classes).
• However, there are various techniques to use for multi-class problems.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Support Vector Machine for Multi- class Problems
• To perform SVM on multi-class problems, we can create a binary classifier for
each class of the data.
• The two results of each classifier will be :
• The data point belongs to that class OR
• The data point does not belong to that class.
• For example, in a class of fruits, to perform multi-class classification, we can
create a binary classifier for each fruit.
• For say, the ‘mango’ class,
• there will be a binary classifier to predict if it IS a mango OR it is NOT a mango.
• The classifier with the highest score is chosen as the output of the SVM.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SVM—Linearly Separable
• A separating hyperplane can be written as
• W ● X + b = 0
• where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
• For 2-D it can be written as
• w0 + w1 x1 + w2 x2 = 0
• The hyperplane defining the sides of the margin:
• H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
• H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
• Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
• This becomes a constrained (convex) quadratic optimization problem: Quadratic objective
function and linear constraints  Quadratic Programming (QP)  Lagrangian multipliers
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SVM—When Data Is Linearly Separable
• Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples associated with
the class labels yi
• There are infinite lines (hyperplanes) separating the two classes but we want to find the
best one (the one that minimizes classification error on unseen data)
• SVM searches for the hyperplane with the largest margin, i.e., maximum marginal
hyperplane (MMH)
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SVM for complex (Non Linearly Separable)
SVM for complex (Non Linearly Separable) SVM works very well without any modifications
for linearly separable data.
Linearly Separable Data is any data that can be plotted in a graph and can be separated into
classes using a straight line.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
A: Linearly Separable Data B: Non-Linearly Separable Data
SVM CLASSIFIER
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SVM CLASSIFIER
• A vector space method for binary classification problems
documents represented in t-dimensional space
• find a decision surface (hyperplane) that best separate
documents of two classes new document classified by its
position relative to hyperplane.
• Simple 2D example: training documents linearly separable
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SVM CLASSIFIER
• Simple 2D example: training documents linearly separable
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SVM CLASSIFIER
• Line s—The Decision Hyperplane
• maximizes distances to closest docs of each class
• it is the best separating hyperplane
• Delimiting Hyperplanes
• parallel dashed lines that delimit region where to look for a
solution
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SVM CLASSIFIER
• Lines that cross the delimiting hyperplanes.
• candidates to be selected as the decision hyperplane
• lines that are parallel to delimiting hyperplanes: best candidates
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SVM CLASSIFIER
• Support vectors: documents that belong to, and define, the delimiting
hyperplanes Our example in a 2-dimensional system of coordinates
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SVM CLASSIFIER
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques
Text Classification Techniques

More Related Content

What's hot

Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering methodrajshreemuthiah
 
Information retrieval (introduction)
Information  retrieval (introduction) Information  retrieval (introduction)
Information retrieval (introduction) Primya Tamil
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrievalNanthini Dominique
 
Supervised learning and Unsupervised learning
Supervised learning and Unsupervised learning Supervised learning and Unsupervised learning
Supervised learning and Unsupervised learning Usama Fayyaz
 
Knowledge discovery process
Knowledge discovery process Knowledge discovery process
Knowledge discovery process Shuvra Ghosh
 
WEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMWEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMSai Kumar Ale
 
Machine learning ppt
Machine learning ppt Machine learning ppt
Machine learning ppt Poojamanic
 
Intro/Overview on Machine Learning Presentation
Intro/Overview on Machine Learning PresentationIntro/Overview on Machine Learning Presentation
Intro/Overview on Machine Learning PresentationAnkit Gupta
 
Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data MiningAmritanshu Mehra
 
Cloud computing system models for distributed and cloud computing
Cloud computing system models for distributed and cloud computingCloud computing system models for distributed and cloud computing
Cloud computing system models for distributed and cloud computinghrmalik20
 
Data mining slides
Data mining slidesData mining slides
Data mining slidessmj
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notesAnandh Arumugakan
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?Bernard Marr
 
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre...
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre..."An Introduction to Machine Learning and How to Teach Machines to See," a Pre...
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre...Edge AI and Vision Alliance
 

What's hot (20)

Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering method
 
Information retrieval (introduction)
Information  retrieval (introduction) Information  retrieval (introduction)
Information retrieval (introduction)
 
Data Mining
Data MiningData Mining
Data Mining
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
 
Supervised learning and Unsupervised learning
Supervised learning and Unsupervised learning Supervised learning and Unsupervised learning
Supervised learning and Unsupervised learning
 
Knowledge discovery process
Knowledge discovery process Knowledge discovery process
Knowledge discovery process
 
WEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMWEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEM
 
Machine learning ppt
Machine learning ppt Machine learning ppt
Machine learning ppt
 
CS8080 IRT UNIT - III SLIDES IN PDF.pdf
CS8080  IRT UNIT - III  SLIDES IN PDF.pdfCS8080  IRT UNIT - III  SLIDES IN PDF.pdf
CS8080 IRT UNIT - III SLIDES IN PDF.pdf
 
Intro/Overview on Machine Learning Presentation
Intro/Overview on Machine Learning PresentationIntro/Overview on Machine Learning Presentation
Intro/Overview on Machine Learning Presentation
 
Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining
 
Cloud computing system models for distributed and cloud computing
Cloud computing system models for distributed and cloud computingCloud computing system models for distributed and cloud computing
Cloud computing system models for distributed and cloud computing
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Text Analytics
Text Analytics Text Analytics
Text Analytics
 
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
 
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre...
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre..."An Introduction to Machine Learning and How to Teach Machines to See," a Pre...
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre...
 

Similar to Text Classification Techniques

Module Overview Careers in Analytics In this module, we .docx
Module Overview  Careers in Analytics In this module, we .docxModule Overview  Careers in Analytics In this module, we .docx
Module Overview Careers in Analytics In this module, we .docxaudeleypearl
 
Module Overview Careers in Analytics In this module, we .docx
Module Overview  Careers in Analytics In this module, we .docxModule Overview  Careers in Analytics In this module, we .docx
Module Overview Careers in Analytics In this module, we .docxroushhsiu
 
CodeLess Machine Learning
CodeLess Machine LearningCodeLess Machine Learning
CodeLess Machine LearningSharjeel Imtiaz
 
machine learning workflow with data input.pptx
machine learning workflow with data input.pptxmachine learning workflow with data input.pptx
machine learning workflow with data input.pptxjasontseng19
 
Quality technician job description
Quality technician job descriptionQuality technician job description
Quality technician job descriptionqualitymanagement246
 
CS8082_MachineLearnigTechniques _Unit-1.ppt
CS8082_MachineLearnigTechniques _Unit-1.pptCS8082_MachineLearnigTechniques _Unit-1.ppt
CS8082_MachineLearnigTechniques _Unit-1.pptpushpait
 

Similar to Text Classification Techniques (20)

CS8080_IRT_UNIT - III T8 FEATURE SELECTION OR DIMENSIONALITY REDUCTION.pdf
CS8080_IRT_UNIT - III T8  FEATURE SELECTION OR DIMENSIONALITY REDUCTION.pdfCS8080_IRT_UNIT - III T8  FEATURE SELECTION OR DIMENSIONALITY REDUCTION.pdf
CS8080_IRT_UNIT - III T8 FEATURE SELECTION OR DIMENSIONALITY REDUCTION.pdf
 
CS8080_IRT_UNIT - III T3 NAIVE TEXT CLASSIFICATION.pdf
CS8080_IRT_UNIT - III T3 NAIVE TEXT CLASSIFICATION.pdfCS8080_IRT_UNIT - III T3 NAIVE TEXT CLASSIFICATION.pdf
CS8080_IRT_UNIT - III T3 NAIVE TEXT CLASSIFICATION.pdf
 
CS8080_IRT_UNIT - III T9 EVALUATION METRICS.pdf
CS8080_IRT_UNIT - III T9 EVALUATION METRICS.pdfCS8080_IRT_UNIT - III T9 EVALUATION METRICS.pdf
CS8080_IRT_UNIT - III T9 EVALUATION METRICS.pdf
 
CS8080_IRT_UNIT - III T10 ACCURACY AND ERROR.pdf
CS8080_IRT_UNIT - III T10  ACCURACY AND ERROR.pdfCS8080_IRT_UNIT - III T10  ACCURACY AND ERROR.pdf
CS8080_IRT_UNIT - III T10 ACCURACY AND ERROR.pdf
 
CS8080_IRT_UNIT - III T2 UNSUPERVISED ALGORITHMS -CLUSTERING.pdf
CS8080_IRT_UNIT - III T2 UNSUPERVISED ALGORITHMS -CLUSTERING.pdfCS8080_IRT_UNIT - III T2 UNSUPERVISED ALGORITHMS -CLUSTERING.pdf
CS8080_IRT_UNIT - III T2 UNSUPERVISED ALGORITHMS -CLUSTERING.pdf
 
CS8080_IRT_UNIT - III T5 DECISION TREES.pdf
CS8080_IRT_UNIT - III T5  DECISION TREES.pdfCS8080_IRT_UNIT - III T5  DECISION TREES.pdf
CS8080_IRT_UNIT - III T5 DECISION TREES.pdf
 
CS8080_IRT_UNIT - III T7 SVM CLASSIFIER.pdf
CS8080_IRT_UNIT - III T7 SVM CLASSIFIER.pdfCS8080_IRT_UNIT - III T7 SVM CLASSIFIER.pdf
CS8080_IRT_UNIT - III T7 SVM CLASSIFIER.pdf
 
CS8080_IRT_UNIT - III T15 MULTI-DIMENSIONAL INDEXING.pdf
CS8080_IRT_UNIT - III T15 MULTI-DIMENSIONAL INDEXING.pdfCS8080_IRT_UNIT - III T15 MULTI-DIMENSIONAL INDEXING.pdf
CS8080_IRT_UNIT - III T15 MULTI-DIMENSIONAL INDEXING.pdf
 
Data Mining 101
Data Mining 101Data Mining 101
Data Mining 101
 
CS8080_IRT_UNIT - III T12 INDEXING AND SEARCHING.pdf
CS8080_IRT_UNIT - III T12 INDEXING AND SEARCHING.pdfCS8080_IRT_UNIT - III T12 INDEXING AND SEARCHING.pdf
CS8080_IRT_UNIT - III T12 INDEXING AND SEARCHING.pdf
 
CS8080_IRT_UNIT - III T6 K-NN CLASSIFIER.pdf
CS8080_IRT_UNIT - III T6 K-NN CLASSIFIER.pdfCS8080_IRT_UNIT - III T6 K-NN CLASSIFIER.pdf
CS8080_IRT_UNIT - III T6 K-NN CLASSIFIER.pdf
 
MLIntro_ADA.pptx
MLIntro_ADA.pptxMLIntro_ADA.pptx
MLIntro_ADA.pptx
 
CS8080_IRT_UNIT - III T11 ORGANIZING THE CLASSES.pdf
CS8080_IRT_UNIT - III T11 ORGANIZING THE CLASSES.pdfCS8080_IRT_UNIT - III T11 ORGANIZING THE CLASSES.pdf
CS8080_IRT_UNIT - III T11 ORGANIZING THE CLASSES.pdf
 
CS8080_IRT_UNIT - III T11 ORGANIZING THE CLASSES.pdf
CS8080_IRT_UNIT - III T11 ORGANIZING THE CLASSES.pdfCS8080_IRT_UNIT - III T11 ORGANIZING THE CLASSES.pdf
CS8080_IRT_UNIT - III T11 ORGANIZING THE CLASSES.pdf
 
Module Overview Careers in Analytics In this module, we .docx
Module Overview  Careers in Analytics In this module, we .docxModule Overview  Careers in Analytics In this module, we .docx
Module Overview Careers in Analytics In this module, we .docx
 
Module Overview Careers in Analytics In this module, we .docx
Module Overview  Careers in Analytics In this module, we .docxModule Overview  Careers in Analytics In this module, we .docx
Module Overview Careers in Analytics In this module, we .docx
 
CodeLess Machine Learning
CodeLess Machine LearningCodeLess Machine Learning
CodeLess Machine Learning
 
machine learning workflow with data input.pptx
machine learning workflow with data input.pptxmachine learning workflow with data input.pptx
machine learning workflow with data input.pptx
 
Quality technician job description
Quality technician job descriptionQuality technician job description
Quality technician job description
 
CS8082_MachineLearnigTechniques _Unit-1.ppt
CS8082_MachineLearnigTechniques _Unit-1.pptCS8082_MachineLearnigTechniques _Unit-1.ppt
CS8082_MachineLearnigTechniques _Unit-1.ppt
 

More from AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING

More from AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING (14)

JAVA PROGRAM CONSTRUCTS OR LANGUAGE BASICS.pptx
JAVA PROGRAM CONSTRUCTS OR LANGUAGE BASICS.pptxJAVA PROGRAM CONSTRUCTS OR LANGUAGE BASICS.pptx
JAVA PROGRAM CONSTRUCTS OR LANGUAGE BASICS.pptx
 
INTRO TO PROGRAMMING.ppt
INTRO TO PROGRAMMING.pptINTRO TO PROGRAMMING.ppt
INTRO TO PROGRAMMING.ppt
 
CS3391 OOP UT-I T4 JAVA BUZZWORDS.pptx
CS3391 OOP UT-I T4 JAVA BUZZWORDS.pptxCS3391 OOP UT-I T4 JAVA BUZZWORDS.pptx
CS3391 OOP UT-I T4 JAVA BUZZWORDS.pptx
 
CS3391 OOP UT-I T1 OVERVIEW OF OOP
CS3391 OOP UT-I T1 OVERVIEW OF OOPCS3391 OOP UT-I T1 OVERVIEW OF OOP
CS3391 OOP UT-I T1 OVERVIEW OF OOP
 
CS3391 OOP UT-I T3 FEATURES OF OBJECT ORIENTED PROGRAMMING
CS3391 OOP UT-I T3 FEATURES OF OBJECT ORIENTED PROGRAMMINGCS3391 OOP UT-I T3 FEATURES OF OBJECT ORIENTED PROGRAMMING
CS3391 OOP UT-I T3 FEATURES OF OBJECT ORIENTED PROGRAMMING
 
CS3391 OOP UT-I T2 OBJECT ORIENTED PROGRAMMING PARADIGM.pptx
CS3391 OOP UT-I T2 OBJECT ORIENTED PROGRAMMING PARADIGM.pptxCS3391 OOP UT-I T2 OBJECT ORIENTED PROGRAMMING PARADIGM.pptx
CS3391 OOP UT-I T2 OBJECT ORIENTED PROGRAMMING PARADIGM.pptx
 
CS3391 -OOP -UNIT – V NOTES FINAL.pdf
CS3391 -OOP -UNIT – V NOTES FINAL.pdfCS3391 -OOP -UNIT – V NOTES FINAL.pdf
CS3391 -OOP -UNIT – V NOTES FINAL.pdf
 
CS3391 -OOP -UNIT – IV NOTES FINAL.pdf
CS3391 -OOP -UNIT – IV NOTES FINAL.pdfCS3391 -OOP -UNIT – IV NOTES FINAL.pdf
CS3391 -OOP -UNIT – IV NOTES FINAL.pdf
 
CS3391 -OOP -UNIT – III NOTES FINAL.pdf
CS3391 -OOP -UNIT – III  NOTES FINAL.pdfCS3391 -OOP -UNIT – III  NOTES FINAL.pdf
CS3391 -OOP -UNIT – III NOTES FINAL.pdf
 
CS3391 -OOP -UNIT – II NOTES FINAL.pdf
CS3391 -OOP -UNIT – II  NOTES FINAL.pdfCS3391 -OOP -UNIT – II  NOTES FINAL.pdf
CS3391 -OOP -UNIT – II NOTES FINAL.pdf
 
CS3391 -OOP -UNIT – I NOTES FINAL.pdf
CS3391 -OOP -UNIT – I  NOTES FINAL.pdfCS3391 -OOP -UNIT – I  NOTES FINAL.pdf
CS3391 -OOP -UNIT – I NOTES FINAL.pdf
 
CS3251-_PIC
CS3251-_PICCS3251-_PIC
CS3251-_PIC
 
CS8080_IRT_UNIT - III T14 SEQUENTIAL SEARCHING.pdf
CS8080_IRT_UNIT - III T14 SEQUENTIAL SEARCHING.pdfCS8080_IRT_UNIT - III T14 SEQUENTIAL SEARCHING.pdf
CS8080_IRT_UNIT - III T14 SEQUENTIAL SEARCHING.pdf
 
CS8080_IRT_UNIT - III T13 INVERTED INDEXES.pdf
CS8080_IRT_UNIT - III T13 INVERTED  INDEXES.pdfCS8080_IRT_UNIT - III T13 INVERTED  INDEXES.pdf
CS8080_IRT_UNIT - III T13 INVERTED INDEXES.pdf
 

Recently uploaded

Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and usesDevarapalliHaritha
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2RajaP95
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 

Recently uploaded (20)

Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and uses
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 

Text Classification Techniques

  • 1. P1WU UNIT – III: CLASSIFICATION Topic 1: A CHARACTERIZATION OF TEXT CLASSIFICATION AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 2. UNIT III 1.A Characterization of Text Classification 2. Unsupervised Algorithms: Clustering 3. Naïve Text Classification 4. Supervised Algorithms 5. Decision Tree 6. k-NN Classifier 7. SVM Classifier 8. Feature Selection or Dimensionality Reduction 9. Evaluation metrics 10. Accuracy and Error 11. Organizing the classes 12. Indexing and Searching 13. Inverted Indexes 14. Sequential Searching 15. Multi-dimensional Indexing AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 3. INTRODUCTION TO CLASSIFICATION AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 4. INTRODUCTION TO CLASSIFICATION • Scientists became very serious about addressing the question: • “Can we build a model that learns from available data and automatically makes the right decisions and predictions?” • Answer can be found in numerous applications that are emerging from the fields of 1. pattern classification, 2. machine learning, and 3. artificial intelligence. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 5. INTRODUCTION TO CLASSIFICATION • Data from various sensoring devices combined with powerful learning algorithms and domain knowledge led to : • many great inventions that we now take for granted in our everyday life: • Internet queries via search engines like Google, • text recognition at the post office, • barcode scanners at the supermarket, the diagnosis of diseases, • speech recognition by Siri or • Google Now on our mobile phone, just to name a few. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 6. INTRODUCTION TO CLASSIFICATION • Classification is: • the data mining process of • finding a model (or function) that • describes and distinguishes data classes or concepts, • for the purpose of being able to use the model to predict the class of objects whose class label is unknown. • That is, predicts categorical class labels (discrete or nominal). • Classifies the data (constructs a model) based on the training set. • It predict group membership for data instances. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 7. INTRODUCTION TO CLASSIFICATION What is CLASSIFICATION? • Classification and prediction are : • two forms of data analysis that can used to extract models describing important data classes or to predict the future data trends. • C & P help us to provide a better understanding of large data. • Classification predicts categorical (discrete, unordered) labels. • Prediction models continuous valued functions. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 8. INTRODUCTION TO CLASSIFICATION • How can we classify? • The trick here is Machine Learning which requires us to make classifications based on past observations (the learning part). • We give the machine a set of data having texts with labels tagged to it and then we let the model to learn on all these data which will later give us some useful insight on the categories of text input we feed. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 9. Applications of Classification • Classification of (potential) customers for: • Credit approval, risk prediction, selective marketing • Performance prediction based on • selected indicators • Medical diagnosis based on symptoms or reactions to Therapy • Application areas: • Credit approval • Target marketing • Medical diagnosis • Treatment effectiveness analysis • Performance prediction AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 10. When is classification needed? • Scenarios: • In each of these examples, the data analysis task is classification, • where a model or classifier is constructed to predict categorical labels, such as • “safe” or “risky” for the loan application data; • “yes” or “no” for the marketing data; or • “treatment A,” “treatment B,” or “treatment C” for the medical data. • These categories can be represented by discrete values, where the ordering among values has no meaning. • For example, • the values 1, 2, and 3 may be used to represent treatments A, B, and C, • where there is no ordering implied among this group of treatment regimes. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 11. INTRODUCTION TO CLASSIFICATION Aim: predict categorical class labels for new tuples/samples Input: a training set of tuples/samples, each with a class label Output: a model (a classifier) based on the training set and the class labels AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 12. Why Classification? • A classical problem extensively studied by • statisticians and machine learning researchers • Predicts categorical class labels. • Produces a model (classifier). AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 13. Typical Applications of Classification • Example: • {credit history, salary} credit approval ( Yes/No) • {Temp, Humidity}  Rain (Yes/No) • A set of documents  sports, technology, etc. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES • Another Example: • If x >= 90 then grade =A. • If 80<=x<90 then grade =B. • If 70<=x<80 then grade =C. • If 60<=x<70 then grade =D. • If x<50 then grade =F.
  • 14. WHAT ARE TEXT CLASSIFICATION? • Text classification is a machine learning technique that assigns a set of predefined categories to open-ended text. • Text classifiers can be used to organize, structure, and categorize pretty much any kind of text – from documents, medical studies and files, and all over the web. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 15. What is meant by text classification? • Text classification or Text Categorization is the activity of labeling natural language texts with relevant categories from a predefined set. • In laymen terms, text classification is a process of extracting generic tags from unstructured text. • These generic tags come from a set of pre-defined categories. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 16. What is meant by text classification or Document classification ? • Document classification or document categorization is • a problem in library science, information science and computer science. • The task is to assign a document to one or more classes or categories. • This may be done "manually" or algorithmically. •Wikipedia AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 17. What is meant by text classification? • Text classification also known as text tagging or text categorization is the process of categorizing text into organized groups. • By using Natural Language Processing (NLP), text classifiers can automatically analyze text and then assign a set of pre-defined tags or categories based on its content. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 18. Text Classification Examples • Text classification is becoming • an increasingly important part of businesses as it allows to easily get insights from data and automate business processes. • Some of the most common examples and use cases for automatic text classification include the following: a) Sentiment Analysis b) Topic Detection c) Language Detection AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 19. Text Classification Examples a) Sentiment Analysis: the process of understanding if a given text is talking positively or negatively about a given subject (e.g. for brand monitoring purposes). b) Topic Detection: the task of identifying the theme or topic of a piece of text (e.g. know if a product review is about Ease of Use, Customer Support, or Pricing when analyzing customer feedback). c) Language Detection: the procedure of detecting the language of a given text (e.g. know if an incoming support ticket is written in English or Spanish for automatically routing tickets to the appropriate team). AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 20. A Characterization of Text Classification • For example, • new articles can be organized by topics; • support tickets can be organized by urgency; • chat conversations can be organized by language; • brand mentions can be organized by sentiment; and so on. • Text classification is • one of the fundamental tasks in natural language processing with broad applications such as sentiment analysis, topic labeling, spam detection, and intent detection. • Here’s an example of how it works: • “The user interface is quite straightforward and easy to use.” • A text classifier can take this phrase as an input, analyze its content, and then automatically assign relevant tags, such as UI and Easy To Use. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 21. A Characterization of Text Classification • First tactic for categorizing documents is to assign a label to each document, • but this solve the problem only when the users know the labels of the documents they looking for. • This tactic does not solve more generic problem of finding documents on specific topic or subject. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 22. A Characterization of Text Classification • For that case, better solution is to • group documents by common generic topics and label each group with a meaningful name. • Each labeled group is called category or class. • Document classification is • the process of categorizing documents under a given cluster or category using fully supervised learning process. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 23. Why is Text Classification Important? • It’s estimated that around 80% of all information is unstructured, with text being one of the most common types of unstructured data. • Because of the messy nature of text, • analyzing, understanding, organizing, and sorting through text data is hard and time-consuming, so most companies fail to use it to its full potential. • This is where text classification with machine learning comes in. • Using text classifiers, companies can automatically structure all manner of relevant text, from • , legal documents, social media, chatbots, surveys, and more in a fast and cost-effective way. • This allows companies to • save time analyzing text data, automate business processes, and make data-driven business decisions. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 24. Reasons for: Text Classification Important a) Scalability • Manually analyzing and organizing is slow and much less accurate.. • Machine learning can automatically analyze millions of surveys, comments, emails, etc., at a fraction of the cost, often in just a few minutes. • Text classification tools are scalable to any business needs, large or small. b) Real-time analysis • There are critical situations that companies need to identify as soon as possible and take immediate action (e.g., PR crises on social media). • Machine learning text classification can follow your brand mentions constantly and in real time, so you'll identify critical information and be able to take action right away. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 25. Reasons for: Text Classification Important c) Consistent criteria • Human annotators make mistakes when classifying text data due to distractions, fatigue, and boredom, and human subjectivity creates inconsistent criteria. • Machine learning, on the other hand, applies the same lens and criteria to all data and results. • Once a text classification model is properly trained it performs with unsurpassed accuracy. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 26. A Characterization of Text Classification • Classification could be performed 1. manually by domain experts or 2. automatically using well- known and • widely used classification algorithms such as decision tree and Naïve Bayes. • Documents are classified according to • other attributes (e.g. author, document type, publishing year etc.) or according to their subjects. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 27. A Characterization of Text Classification • there are two main kind of subject classification of documents: 1. The content based approach and 2. the request based approach. • In Content based classification, • the weight that is given to subjects in a document decides the class to which the document is assigned. • For example, it is a rule in some library classification that at least 15% of the content of a book should be about the class to which the book is assigned. • In automatic classification, the number of times given words appears in a document determine the class. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 28. A Characterization of Text Classification • In Request oriented classification, the anticipated request from users is impacting how documents are being classified. • The classifier asks himself: • “Under which description should this entity be found?” and • “think of all the possible queries and decide for which ones the entity at hand is relevant”. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 29. Text Classification Applications • With the help of text classification, businesses can make sense of large amounts of data using techniques like • aspect-based sentiment analysis to understand what people are talking about and how they’re talking about each aspect. • Text classification can help support teams provide a stellar experience by • automating tasks that are better left to computers, saving precious time that can be spent on more important things. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 30. Text Classification Applications • models can help you analyze survey results to discover patterns and insights like: • What do people like about our product or service? • What should we improve? • What do we need to change? • By combining both quantitative results and qualitative analyses, • teams can make more informed decisions without having to spend hours manually analyzing every single open-ended response. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 31. Text Classification Applications • Text classification has thousands of use cases and is applied to a wide range of tasks. • In some cases, data classification tools work behind the scenes to enhance app features we interact with on a daily basis (like email spam filtering). • In some other cases, classifiers are used by marketers, product managers, engineers, and salespeople to automate business processes and save hundreds of hours of manual data processing. • Some of the top applications and use cases of text classification include: 1. Detecting urgent issues 2. Automating customer support processes 3. Listening to the Voice of customer (VoC) AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 32. A Characterization of Text Classification • Automatic document classification tasks can be divided into three types 1. Unsupervised document classification (document clustering): the classification must be done totally without reference to external information. 2. Semi-supervised document classification: parts of the documents are labeled by the external method. 3. Supervised document classification where some external method (such as human feedback) provides information on the correct classification for documents AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 33. Computational Supervised Learning • Computational Supervised Learning is also called classification aimed to: • Learn from past experience, and • use the learned knowledge to classify new data • Knowledge learned by intelligent algorithms • Examples: • Clinical diagnosis for patients • Cell type classification AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 34. Overall Picture of Supervised Learning AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES Biomedical Financial Government Scientific Decision trees Emerging patterns SVM Neural networks Classifiers (M-Doctors)
  • 35. Unsupervised Learning • Unsupervised learning is a machine learning technique in which models are not supervised using training dataset. Instead, models itself find the hidden patterns and insights from the given data. It can be compared to learning which takes place in the human brain while learning new things. It can be defined as: • “Unsupervised learning is a type of machine learning in which models are trained using unlabeled dataset and are allowed to act on that data without any supervision”. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 36. Unsupervised Learning Unsupervised learning cannot be directly applied to a regression or classification problem because unlike supervised learning, we have the input data but no corresponding output data. The goal of unsupervised learning is to find the underlying structure of dataset, group that data according to similarities, and represent that dataset in a compressed format. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 37. Unsupervised Learning Example: Suppose the unsupervised learning algorithm is given an input dataset containing images of different types of cats and dogs. The algorithm is never trained upon the given dataset, which means it does not have any idea about the features of the dataset. The task of the unsupervised learning algorithm is to identify the image features on their own. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 38. Unsupervised Learning • . Unsupervised learning algorithm will • perform this task by clustering the image dataset into the groups according to similarities between images. • By Simply, • no training data is provided Examples: • neural network models • independent component analysis • clustering AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 39. Supervised vs. Unsupervised Learning classification Vs clustering • Supervised learning (classification) • Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations • New data is classified based on the training set • Unsupervised learning (clustering) • The class labels of training data is unknown • Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 40. Any Questions? AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 41. P1WU UNIT – III: CLASSIFICATION Topic 2: UNSUPERVIZED ALGORITHMS - CLUSTERING AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 42. UNIT III 1.A Characterization of Text Classification 2. Unsupervised Algorithms: Clustering 3. Naïve Text Classification 4. Supervised Algorithms 5. Decision Tree 6. k-NN Classifier 7. SVM Classifier 8. Feature Selection or Dimensionality Reduction 9. Evaluation metrics 10. Accuracy and Error 11. Organizing the classes 12. Indexing and Searching 13. Inverted Indexes 14. Sequential Searching 15. Multi-dimensional Indexing AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 43. INTRODUCTION TO UNSUPERVIZED ALGORITHMS • Below is the list of some popular unsupervised learning algorithms: • K-means clustering • KNN (k-nearest neighbors) • Hierarchal clustering • Anomaly detection • Neural Networks • Principle Component Analysis • Independent Component Analysis • Apriori algorithm • Singular value decomposition AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 44. INTRODUCTION TO UNSUPERVIZED ALGORITHMS AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 45. WHAT ARE CLUSTERING? • Clustering or cluster analysis is a machine learning technique, which groups the unlabelled dataset. • It can be defined as "A way of grouping the data points into different clusters, consisting of similar data points. The objects with the possible similarities remain in a group that has less or no similarities with another group." AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 46. WHAT ARE CLUSTERING? • It does it by • finding some similar patterns in the unlabelled dataset such as shape, size, color, behavior, etc., and divides them as per the presence and absence of those similar patterns. • It is an unsupervised learning method, • hence no supervision is provided to the algorithm, and it deals with the unlabeled dataset. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 47. Difference between Supervised and Unsupervised Learning AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES Supervised Learning Unsupervised Learning Supervised learning algorithms aretrained using labeled data. Unsupervised learning algorithmsare trained using unlabeled data. Supervised learning model takesdirect feedback to check if it is predicting correct output or not. Unsupervised learning model doesnot take any feedback. Supervised learning model predictsthe output. Unsupervised learning model findsthe hidden patterns in data. Supervised learning needs supervision to train the model. Unsupervised learning does not needany supervision to train the model. Supervised learning can becategorized in Classification and Regression problems. Unsupervised Learning can beclassified in Clustering and Associations problems. Supervised learning can be used for those cases where we know theinput as well as corresponding outputs. Unsupervised learning can be used for those cases where we have onlyinput data and no corresponding output data. Supervised learning model produces an accurate result. Unsupervised learning model may give less accurate result as compared to supervised learning. It includes various algorithms such It includes various algorithms such
  • 48. Advantages of Unsupervised Learning • Unsupervised learning is used for more complex tasks as compared to supervised learning because, • in unsupervised learning, we don't have labeled input data. • Unsupervised learning is preferable as • it is easy to get unlabeled data in comparison to labeled data. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 49. Disadvantages of Unsupervised Learning • Unsupervised learning is • intrinsically more difficult than supervised learning as it does not have corresponding output. • The result of the unsupervised learning algorithm might be • less accurate as input data is not labeled, and algorithms do not know the exact output in advance. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 50. Any Questions? AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 51. P1WU UNIT – III: CLASSIFICATION Topic 3: NAÏVE TEXT CLASSIFICATION AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 52. UNIT III : TEXT CLASSIFICATION AND CLUSTERING 1.A Characterization of Text Classification 2. Unsupervised Algorithms: Clustering 3. Naïve Text Classification 4. Supervised Algorithms 5. Decision Tree 6. k-NN Classifier 7. SVM Classifier 8. Feature Selection or Dimensionality Reduction 9. Evaluation metrics 10. Accuracy and Error 11. Organizing the classes 12. Indexing and Searching 13. Inverted Indexes 14. Sequential Searching 15. Multi-dimensional Indexing AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 53. NAÏVE TEXT CLASSIFICATION AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 54. INTRODUCTION TO NAÏVE TEXT CLASSIFICATION • Naive Bayes classifiers are a collection of classification algorithms based on Bayes Theorem. • It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other. • Naive Bayes classifiers have been heavily used for text classification and text analysis machine learning problems. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 55. INTRODUCTION TO NAÏVE TEXT CLASSIFICATION • Text Analysis is a major application field for machine learning algorithms. • However the raw data, • a sequence of symbols (i.e. strings) cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 56. The Naive Bayes algorithm • Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. • It is not a single algorithm but a family of algorithms where all of them share a common principle, • i.e. every pair of features being classified is independent of each other. • The dataset is divided into two parts, namely, feature matrix and the response/target vector. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 57. The Naive Bayes algorithm • The Feature matrix (X) contains all the vectors(rows) of the dataset in which each vector consists of the value of dependent features. The number of features is d i.e. X = (x1,x2,x2, xd). • The Response/target vector (y) contains the value of class/group variable for each row of feature matrix. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 58. The Bayes’ Theorem Bayes’ Theorem finds the probability of an event occurring given the probability of another event that has already occurred. Bayes’ theorem is stated mathematically as follows: AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 59. The Bayes’ Theorem • where: • A and B are called events. • P(A | B) is the probability of event A, given the event B is true (has occured) • Event B is also termed as evidence. P(A) is the priori of A (the prior independent probability, i.e. probability of event before evidence is seen). • P(B | A) is the probability of B given event A, i.e. probability of event B after evidence A is seen. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 60. The Bayes’ Theorem • Summary AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 61. Dealing with text data • Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols (i.e. strings) cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 62. Dealing with text data • In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from text content, namely: • tokenizing strings and giving an integer id for each possible token, for instance by using w ite-spaces and punctuation as token separators. • counting the occurrences of tokens in each document. • In this scheme, features and samples are defined as follows: • each individual token occurrence frequency is treated as a feature. • the vector of all the token frequencies for a given document is considered a multivariate sample. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 63. Example 1 : Using the Naive Bayesian Classifier • We will consider the following training set. • The data samples are described by attributes age, income, student, and credit. • The class label attribute, buy, tells whether the person buys a computer, has two distinct values, yes (class C1) and no (class C2). AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 64. Example 1 : Using the Naive Bayesian Classifier AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES RID Age Income student Credit Ci: buy 1 Youth High no Fair C2: no 2 Youth High no Excellent C2: no 3 middle-aged High no Fair C1: yes 4 Senior medium no Fair C1: yes 5 Senior Low yes Fair C1: yes 6 Senior Low yes Excellent C2: no 7 middle-aged Low yes Excellent C1: yes 8 Youth medium no Fair C2: no 9 Youth Low yes Fair C1: yes 10 Senior medium yes Fair C1: yes 11 Youth medium yes Excellent C1: yes 12 middle-aged medium no Excellent C1: yes 13 middle-aged High yes Fair C1: yes 14 Senior medium no Excellent C2: no
  • 65. Example 1 : Using the Naive Bayesian Classifier • The sample we wish to classify is • X = (age = youth, income = medium, student = yes, credit = fair) • We need to maximize P (X|Ci)P (Ci), for i = 1, 2. P (Ci), the a priori probability of each class, can be estimated based on the training samples: • P(buy =yes ) = 9 /14 • P(buy =no ) = 5 /14 AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 66. Example 1 : Using the Naive Bayesian Classifier • To compute P (X|Ci), for i = 1, 2, we compute the following conditional probabilities: • P(age=youth | buy =yes ) = 2/9 • P(income =medium | buy =yes ) = 4/9 • P(student =yes | buy =yes ) = 6/9 • P(credit =fair | buy =yes ) = 6/9 • P(age=youth | buy =no ) = 3/5 • P(income =medium | buy =no ) = 2/5 • P(student =yes | buy =no ) = 1/5 • P(credit =fair | buy =no ) = 2/5 AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 67. Example 1 : Using the Naive Bayesian Classifier • Using the above probabilities, we obtain AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 68. Example 1 : Using the Naive Bayesian Classifier • Similarly AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES To find the class that maximizes P (X|Ci)P (Ci), we compute Thus the naive Bayesian classifier predicts buy = yes for sample X
  • 69. Example 2: Predicting a class label using naïve Bayesian classification • Predicting a class label using naïve Bayesian classification. • The training data set is given below: • The data tuples are described by the attributes Owns Home?, Married, Gender and Employed. • The class label attribute Risk Class has three distinct values. • Let C1 corresponds to the class A, and C2 corresponds to the class B and C3 corresponds to the class C. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 70. Example 1 : Using the Naive Bayesian Classifier • The tuple is to classify is, • X = (Owns Home = Yes, Married = No, Gender = Female, Employed = Yes) AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES Owns Home Married Gender Employed Risk Class Yes Yes Male Yes B No No Female Yes A Yes Yes Female Yes C Yes No Male No B No Yes Female Yes C No No Female Yes A No No Male No B Yes No Female Yes A No Yes Female Yes C Yes Yes Female Yes C
  • 71. Example 2: Predicting a class label using naïve Bayesian classification • Solution • There are 10 samples and three classes. • Risk class A = 3 Risk class B = 3 Risk class C = 4 • • The prior probabilities are obtained by dividing these frequencies by the total number in the training data, • P(A) = 3/10 = 0.3 P(B) = 3/10 = 0.3 P(C) = 4/10 = 0.4 AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 72. Example 2: Predicting a class label using naïve Bayesian classification • To compute P(X/Ci) =P {yes, no, female, yes}/Ci) for each of the classes, the conditional probabilities for each: • P(Owns Home = Yes/A) = 1/3 =0.33 • P(Married = No/A) = 3/3 =1 • P(Gender = Female/A) = 3/3 = 1 • P(Employed = Yes/A) = 3/3 = 1 • • P(Owns Home = Yes/B) = 2/3 =0.67 • P(Married = No/B) = 2/3 =0.67 • P(Gender = Female/B) = 0/3 = 0 • P(Employed = Yes/B) = 1/3 = 0.33 • • P(Owns Home = Yes/C) = 2/4 =0.5 • P(Married = No/C) = 0/4 =0 • P(Gender = Female/C) = 4/4 = 1 • P(Employed = Yes/C) = 4/4 = 1 AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 73. Example 2: Predicting a class label using naïve Bayesian classification • Using the above probabilities, we obtain • P(X/A)= P(Owns Home = Yes/A) X • P(Married = No/A) x • P(Gender = Female/A) X • P(Employed = Yes/A) = 0.33 x 1 x 1 x 1 = 0.33 • Similarly, P(X/B)= 0 , P(X/C) =0 • • To find the class, G, that maximizes, P(X/Ci)P(Ci), we compute, • P(X/A) P(A) = 0.33 X 0.3 = 0.099 • P(X/B) P(B) =0 X 0.3 = 0 • P(X/C) P(C) = 0 X 0.4 = 0.0 • Therefore x is assigned to class A AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 74. Advantages and Disadvantages • Advantages: a) Have the minimum error rate in comparison to all other classifiers. b) Easy to implement c) Good results obtained in most of the cases. d) They provide theoretical justification for other classifiers that do not explicitly use • Disadvantages: a) Lack of available probability data. b) Inaccuracies in the assumption. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 75. Any Questions? AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 76. P1WU UNIT – III: CLASSIFICATION Topic 4: SUPERVISED ALGORITHMS AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 77. UNIT III : TEXT CLASSIFICATION AND CLUSTERING 1.A Characterization of Text Classification 2. Unsupervised Algorithms: Clustering 3. Naïve Text Classification 4. Supervised Algorithms 5. Decision Tree 6. k-NN Classifier 7. SVM Classifier 8. Feature Selection or Dimensionality Reduction 9. Evaluation metrics 10. Accuracy and Error 11. Organizing the classes 12. Indexing and Searching 13. Inverted Indexes 14. Sequential Searching 15. Multi-dimensional Indexing AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 78. SUPERVIZED LEARNING AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 79. INTRODUCTION TO SUPERVIZED LEARNING • Supervised learning, also known as supervised machine learning, is a subcategory of machine learning and artificial intelligence. • It is defined by its use of labeled datasets to train algorithms that to classify data or predict outcomes accurately. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 80. What is supervised learning? • Supervised learning, also known as supervised machine learning, is • a subcategory of machine learning and artificial intelligence. • It is defined by • its use of labeled datasets to train algorithms that to classify data or predict outcomes accurately. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 81. Supervised Learning • It is defined by its use of labeled datasets to • train algorithms that to classify data or predict outcomes accurately. • As input data is fed into the model, it adjusts its weights until the model has been fitted appropriately, • which occurs as part of the cross validation process. • Supervised learning helps organizations solve for • a variety of real-world problems at scale, such as classifying spam in a separate folder from your inbox. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 82. What is the type of supervised learning? • There are two types of Supervised Learning techniques: 1. Regression and 2. Classification. • Classification separates the data, Regression fits the data. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 83. Example of Supervised Learning • A great example of supervised learning is text classification problems. • In this set of problems, the goal is to • predict the class label of a given piece of text. • One particularly popular topic in text classification is to • predict the sentiment of a piece of text, like a tweet or a product review. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 84. SUPERVIZED ALGORITHMS AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 85. INTRODUCTION TO SUPERVIZED ALGORITHMS • Which are supervised algorithm? • A supervised learning algorithm takes • a known set of input data (the learning set) and known responses to the data (the output), and forms a model to generate reasonable predictions for the response to the new input data. • Use supervised learning if you have existing data for the output you are trying to predict. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 86. SUPERVIZED ALGORITHMS EXAMPLE AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 87. INTRODUCTION TO SUPERVIZED ALGORITHMS • Various algorithms and computation techniques are used in supervised machine learning processes. • Most commonly used learning methods, typically calculated through use of programs like R or Python are: 1) Neural networks • Primarily leveraged for deep learning algorithms, neural networks process training data by mimicking the interconnectivity of the human brain through layers of nodes AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 88. INTRODUCTION TO SUPERVIZED ALGORITHMS 2) Naive Bayes • Naive Bayes is classification approach that adopts the principle of class conditional independence from the Bayes Theorem. 3) Linear regression • Linear regression is used to identify the relationship between a dependent variable and one or more independent variables and is typically leveraged to make predictions about future outcomes. 4) Logistic regression • While linear regression is leveraged when dependent variables are continuous, logistical regression is selected when the dependent variable is categorical, meaning they have binary outputs, such as "true" and "false" or "yes" and "no." AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 89. INTRODUCTION TO SUPERVIZED ALGORITHMS 5) Support vector machine (SVM) • A support vector machine is a popular supervised learning model developed by Vladimir Vapnik, used for both data classification and regression. 6) K-nearest neighbor • K-nearest neighbor, also known as the KNN algorithm, is a non-parametric algorithm that classifies data points based on their proximity and association to other available data. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 90. INTRODUCTION TO SUPERVIZED ALGORITHMS 7) Random forest • Random forest is another flexible supervised machine learning algorithm used for both classification and regression purposes. • The "forest" references a collection of uncorrelated decision trees, which are then merged together to reduce variance and create more accurate data predictions. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 91. Any Questions? AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 92. P1WU UNIT – III: CLASSIFICATION Topic 5: DECISION TREES AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 93. UNIT III : TEXT CLASSIFICATION AND CLUSTERING 1.A Characterization of Text Classification 2. Unsupervised Algorithms: Clustering 3. Naïve Text Classification 4. Supervised Algorithms 5. Decision Tree 6. k-NN Classifier 7. SVM Classifier 8. Feature Selection or Dimensionality Reduction 9. Evaluation metrics 10. Accuracy and Error 11. Organizing the classes 12. Indexing and Searching 13. Inverted Indexes 14. Sequential Searching 15. Multi-dimensional Indexing AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 94. DECISION TREES AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 95. INTRODUCTION TO DECISION TREES • What is a decision tree? • A decision tree is a structure that includes a root node, branches, and leaf nodes. a) Each internal node denotes a test on an attribute, b) each branch denotes the outcome of a test, and c) each leaf node holds a class label. • The topmost node in the tree is the root node. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 96. INTRODUCTION TO DECISION TREES • ID3, C4.5, and CART adopt a greedy (i.e., non-backtracking) approach in which decision trees are constructed in a top- down recursive divide-and-conquer manner. • Most algorithms for decision tree induction also follow such a top-down approach, which starts with a training set of tuples and their associated class labels. • The training set is recursively partitioned into smaller subsets as the tree is being built. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 97. INTRODUCTION TO DECISION TREES • Decision tree induction is the learning of decision trees from class-labeled training tuples. • A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an attribute, • each branch represents an outcome of the test, and • each leaf node (or terminal node) holds a class label. • The top most node in a tree is the root node. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 98. INTRODUCTION TO DECISION TREES •A decision tree is a tree where • internal node = a test on an attribute • tree branch = an outcome of the test • leaf node = class label or class distribution AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 99. Benefits of Decision Trees •The benefits of having a decision tree are as follows − a) It does not require any domain knowledge. b) It is easy to comprehend. c) The learning and classification steps of a decision tree are simple and fast. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 100. Brief History of Decision Trees AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES CLS (Hunt etal. 1966)--- cost driven ID3 (Quinlan, 1986 MLJ) --- Information-driven C4.5 (Quinlan, 1993) --- Gain ratio + Pruning ideas CART (Breiman et al. 1984) --- Gini Index
  • 101. Elegance of Decision Trees AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 102. Structure of Decision Trees • If x1 > a1 & x2 > a2, then it’s A class • C4.5, CART, two of the most widely used • Easy interpretation, but accuracy generally unattractive AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES Leaf nodes Internal nodes Root node A B B A A x1 x2 x4 x3 > a1 > a2
  • 103. Example of Decision Tree AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 104. Another Example of Decision Tree AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 105. Decision Tree classification Tasks AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 106. 5/16/2022 Data Mining: Concepts and Techniques 15 Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data Assign Cheat to “No”
  • 107. 5/16/2022 Data Mining: Concepts and Techniques 16 Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data
  • 108. 5/16/2022 Data Mining: Concepts and Techniques 17 Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data
  • 109. 5/16/2022 Data Mining: Concepts and Techniques 18 Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data
  • 110. 5/16/2022 Data Mining: Concepts and Techniques 19 Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data
  • 111. 5/16/2022 Data Mining: Concepts and Techniques 20 Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data Assign Cheat to “No”
  • 112. 5/16/2022 Data Mining: Concepts and Techniques 21 Decision Tree Classification Task Apply Model Induction Deduction Learn Model Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 10 Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Tree Induction algorithm Training Set Decision Tree
  • 113. Constructing a Decision Tree AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 114. Constructing a Decision Tree • Two phases of decision tree generation: 1. tree construction • at start, all the training examples at the root • partition examples based on selected attributes • test attributes are selected based on a heuristic or a statistical measure 2. tree pruning • identify and remove branches that reflect noise or outliers AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 115. Constructing a Decision Tree • Basic step: Determination of the root node of the tree and the root node of its sub-trees • Most Discriminatory Feature • Every feature can be used to partition the training data • If the partitions contain a pure class of training instances, then this feature is most discriminatory AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 116. Constructing a Decision Tree:- Example of Partitions • Categorical feature • Number of partitions of the training data is equal to the number of values of this feature • Numerical feature • Two partitions AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 117. Decision Tree Induction Algorithm • A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm known as ID3 (Iterative Dichotomiser). • Later, he presented C4.5, which was the successor of ID3. ID3 and C4.5 adopt a greedy approach. • In this algorithm, there is no backtracking; the trees are constructed in a top-down recursive divide-and-conquer manner. • Generating a decision tree form training tuples of data partition D AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 118. Algorithm : Generate_decision_tree • Input: • Data partition, D, which is a set of training tuples and their associated class labels. • attribute_list, the set of candidate attributes. • Attribute selection method, a procedure to determine the splitting criterion that best partitions that the data tuples into individual classes. This criterion includes a splitting_attribute and either a splitting point or splitting subset. Output: A Decision Tree AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 119. Algorithm : Generate_decision_tree • Method 1) create a node N; 2) if tuples in D are all of the same class, C then 3) return N as leaf node labeled with class C; 4) if attribute_list is empty then 5) return N as leaf node with labeled with majority class in D;//majority voting 6) apply attribute_selection_method(D, attribute_list) to find the best splitting_criterion; 7) label node N with splitting_criterion; 8) if splitting_attribute is discrete-valued and multiway splits allowed then // not restricted to binary trees 9) attribute_list = attribute_list - splitting attribute; // remove splitting attribute 10) for each outcome j of splitting criterion 11) // partition the tuples and grow subtrees for each partition 12) let Dj be the set of data tuples in D satisfying outcome j; // a partition 13) if Dj is empty then 14) attach a leaf labeled with the majority class in D to node N; else attach the node returned by Generate_decision tree(Dj, attribute list) to node N; end for 15) return N; AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 120. Constructing Decision Tree Example :- Weather Forecasting AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 121. Constructing Decision Tree :- A Simple Dataset AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES 9 Play samples 5 Don’t A total of 14.
  • 122. Constructing Decision Tree :- A Simple Dataset AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES Outlook Temp Humidity Windy class Sunny 75 70 true Play Sunny 80 90 true Don’t Sunny 85 85 false Don’t Sunny 72 95 true Don’t Sunny 69 70 false Play Overcast 72 90 true Play Overcast 83 78 false Play Overcast 64 65 true Play Overcast 81 75 false Play Rain 71 80 true Don’t Rain 65 70 true Don’t Rain 75 80 false Play Rain 68 80 false Play Rain 70 96 false Play Instance # 1 2 3 4 5 6 7 8 9 10 11 12 13 14
  • 123. Constructing Decision Tree :- A Simple Dataset AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES 2 outlook windy humidity Play Play Play Don’t Don’t sunny overcast rain <= 75 > 75 false true 2 4 3 3
  • 124. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES Total 14 training instances 1,2,3,4,5 P,D,D,D,P 6,7,8,9 P,P,P,P 10,11,12,13,14 D, D, P, P, P Outlook = sunny Outlook = overcast Outlook = rain
  • 125. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES Total 14 training instances 5,8,11,13,14 P,P, D, P, P 1,2,3,4,6,7,9,10,12 P,D,D,D,P,P,P,D,P Temperature <= 70 Temperature > 70
  • 126. Constructing Decision Tree :- A Simple Dataset AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES 2 outlook windy humidity Play Play Play Don’t Don’t sunny overcast rain <= 75 > 75 false true 2 4 3 3
  • 127. Constructing Decision Tree Example :- Decision on Buying a Computer / customer likely to purchase a computer AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 128. Constructing Decision Tree Example :- Decision on Buying a Computer AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES The following decision tree is for the concept buy_computer that indicates : Whether a customer at a company is likely to buy a computer or not? Each internal node represents a test on an attribute. Each leaf node represents a class.
  • 129. Constructing Decision Tree :- Training Dataset AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no This follows an example of Quinlan’s ID3 (Playing Tennis)
  • 130. Constructing Decision Tree :- Output: A Decision Tree for “buys_computer” AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES age? overcast student? credit rating? <=30 >40 no yes yes yes 31..40 fair excellent yes no From the training dataset , calculate entropy value, which indicates that splitting attribute is: age
  • 131. A Decision Tree for “buys_computer” AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 132. A Decision Tree for “buys_computer” AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES From the training data set , age= youth has 2 classes based on student attribute
  • 133. A Decision Tree for “buys_computer” AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES based on majority voting in student attribute , RID=3 is grouped under yes group.
  • 134. A Decision Tree for “buys_computer” AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES From the training data set , age= senior has 2 classes based on credit rating.
  • 135. A Decision Tree for “buys_computer” AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES Final Decision Tree
  • 136. Classification by Decision Tree AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 137. Classification by Decision Tree • A typical decision tree that represents the concept buys computer, that is, it predicts whether a customer at AllElectronics is likely to purchase a computer. • Internal nodes are denoted by rectangles, and leaf nodes are denoted by ovals. • Some decision tree algorithms produce only binary trees (where each internal node branches to exactly two other nodes), whereas others can produce non binary trees. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 138. Classification by Decision Tree • “How are decision trees used for classification?” • Given a tuple, X, for which the associated class label is unknown, the attribute values of the tuple are tested against the decision tree. • A path is traced from the root to a leaf node, which holds the class prediction for that tuple. • Decision trees can easily be converted to classification rules. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 139. Classification by Decision Tree Why are decision tree classifiers so popular? • The construction of decision tree classifiers does not require any domain knowledge or parameter setting, and therefore is appropriate for exploratory knowledge discovery. • Decision trees can handle high dimensional data. • Their representation of acquired knowledge in tree form is intuitive and generally easy to assimilate by humans. • The learning and classification steps of decision tree induction are simple and fast. • In general, decision tree classifiers have good accuracy. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 140. Classification by Decision Tree Extracting Classification Rules from Trees • Represent the knowledge in the form of IF-THEN rules • One rule is created for each path from the root to a leaf • Each attribute-value pair along a path forms a conjunction • The leaf node holds the class prediction • Rules are easier for humans to understand • Example IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no” AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 141. Classification by Decision Tree Training Set and Its AVC Sets AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES student Buy_Computer yes no yes 6 1 no 3 4 Age Buy_Computer yes no <=30 3 2 31..40 4 0 >40 3 2 Credit rating Buy_Computer yes no fair 6 2 excellent 3 3 age income studentcredit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no AVC-set on income AVC-set on Age AVC-set on Student Training Examples income Buy_Computer yes no high 2 2 medium 4 2 low 3 1 AVC-set on credit_rating
  • 142. Any Questions? AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 143. P1WU UNIT – III: CLASSIFICATION Topic 6: K-NN CLASSIFIER AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 144. UNIT III : TEXT CLASSIFICATION AND CLUSTERING 1.A Characterization of Text Classification 2. Unsupervised Algorithms: Clustering 3. Naïve Text Classification 4. Supervised Algorithms 5. Decision Tree 6. k-NN Classifier 7. SVM Classifier 8. Feature Selection or Dimensionality Reduction 9. Evaluation metrics 10. Accuracy and Error 11. Organizing the classes 12. Indexing and Searching 13. Inverted Indexes 14. Sequential Searching 15. Multi-dimensional Indexing AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 145. K-NN CLASSIFIER AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 146. K-NN CLASSIFIER • K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique. • K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is most similar to the available categories. • K-NN algorithm stores all the available data and classifies a new data point based on the similarity. • This means when new data appears then it can be easily classified into a well suite category by using K- NN algorithm. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 147. K-NN CLASSIFIER • supervised ML classification algorithm-KNN(K Nearest Neighbors) algorithm. • It is one of the simplest and widely used classification algorithms in which a new data point is classified based on similarity in the specific group of neighboring data points. • This gives a competitive result. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 148. K-NN CLASSIFIER EXAMPLE • Example: Suppose, we have an image of a creature that looks similar to cat and dog, • but we want to know either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it works on a similarity measure. • Our KNN model will find the similar features of the new data set to the cats and dogs images and based on the most similar features it will put it in either cat or dog category. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 149. K-NN CLASSIFIER EXAMPLE AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 150. K Nearest Neighbor Classification AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 151. INTRODUCTION TO K-NN CLASSIFIER • K nearest neighbors is a simple algorithm that stores • all available cases and classifies new cases based on a similarity measure (e.g., distance functions). • K represents number of nearest neighbors. • It classify an unknown example with the most common class among k closest examples. • KNN is based on • “tell me who your neighbors are, and I’ll tell you who you are” AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 152. INTRODUCTION TO K-NN CLASSIFIER :- Example If K = 5, then in this case query instance xq will be classified as negative since three of its nearest neighbors are classified as negative. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 153. Different Schemes of KNN • 1-Nearest Neighbor • K-Nearest Neighbor using a majority voting scheme • K-NN using a weighted-sum voting Scheme AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 154. Different Schemes of KNN AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 155. Different Schemes of KNN AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 156. kNN: How to Choose k? • In theory, if infinite number of samples available, the larger is k, the better is classification • The limitation is that all k neighbors have to be close • Possible when infinite no of samples available • Impossible in practice since no of samples is finite k = 1 is often used for efficiency, but sensitive to “noise” AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 157. kNN: How to Choose k? AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 158. kNN: How to Choose k? • Larger k gives smoother boundaries, better for generalization But only if locality is preserved. Locality is not preserved if end up looking at samples too far away, not from the same class. • Interesting theoretical properties if k < sqrt(n), n is # of examples . AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES Find a heuristically optimal number k of nearest neighbors, based on RMSE(root-mean-square error). This is done using cross validation. Cross-validation is another way to retrospectively determine a good K value by using an independent dataset to validate the K value. Historically, the optimal K for most datasets has been between 3-10. That produces much better results than 1NN.
  • 159. Distance Measure in KNN • There are three distance measures are valid for continuous variables. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 160. Distance Measure in KNN • It should also be noted that all In the instance of categorical variables the Hamming distance must be used. • It also brings up the issue of standardization of the numerical variables between 0 and 1 when there is a mixture of numerical and categorical variables in the dataset. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 161. Simple KNN - Algorithm: • For each training example , add the example to the list of training_examples. • Given a query instance xq to be classified, • Let x1 ,x2….xk denote the k instances from training_examples that are nearest to xq . • Return the class that represents the maximum of the k instances • Steps: 1. Determine parameter k= no of nearest neighbor 2. Calculate the distance between the query instance and all the training samples. 3. Sort the distance and determine nearest neighbor based on the k –th minimum distance 4. Gather the category of the nearest neighbors 5. Use simple majority of the category of nearest neighbors as the prediction value of the query instance. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 162. Simple KNN - Algorithm: • K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the Classification problems. • K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data. • It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the dataset and at the time of classification, it performs an action on the dataset. • KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data into a category that is much similar to the new data. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 163. Simple KNN – Algorithm Example • Example: • Consider the following data concerning credit default. Age and Loan are two numerical variables (predictors) and Default is the target. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 164. Simple KNN – Algorithm Example • Given Training Data set : AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 165. Simple KNN – Algorithm Example • Data to Classify: • to classify an unknown case (Age=48 and Loan=$142,000) using Euclidean distance. • • Step1: Determine parameter k • K=3 • • Step 2: Calculate the distance • D = Sqrt[(48-33)^2 + (142000-150000)^2] = 8000.01 >> Default=Y AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 166. Simple KNN – Algorithm Example AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 167. Simple KNN – Algorithm Example • Step 3: Sort the distance ( refer above diagram) and mark upto kth rank i.e 1 to 3. • • Step 4: Gather the category of the nearest neighbors AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES Age Loan Default Distance 33 $150000 Y 8000 35 $120000 N 22000 60 $100000 Y 42000 With K=3, there are two Default=Y and one Default=N out of three closest neighbors. The prediction for the unknown case is Default=Y.
  • 168. Standardized Distance ( Feature Normalization) • One major drawback in calculating distance measures directly from the training set is in the case where variables have different measurement scales or there is a mixture of numerical and categorical variables. • For example, if one variable is based on annual income in dollars, and the other is based on age in years then income will have a much higher influence on the distance calculated. One solution is to standardize the training set as shown below. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 169. Standardized Distance ( Feature Normalization) • For ex loan , X =$ 40000 , • Xs = 40000- 20000 = 0.11 • 220000-20000 • Same way , calculate the standardized values for age and loan attributes, then apply the KNN algorithm. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 170. Simple KNN – Algorithm • Advantages • Can be applied to the data from any distribution • for example, data does not have to be separable with a linear boundary • Very simple and intuitive • Good classification if the number of samples is large enough • • Disadvantages • Choosing k may be tricky • Test stage is computationally expensive • No training stage, all the work is done during the test stage • This is actually the opposite of what we want. Usually we can afford training step to take a long time, but we want fast test step • Need large number of samples for accuracy AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 171. How does K-NN work? • he K-NN working can be explained on the basis of the below algorithm: • Step-1: Select the number K of the neighbors • Step-2: Calculate the Euclidean distance of K number of neighbors • Step-3: Take the K nearest neighbors as per the calculated Euclidean distance. • Step-4: Among these k neighbors, count the number of the data points in each category. • Step-5: Assign the new data points to that category for which the number of the neighbor is maximum. • Step-6: Our model is ready. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 172. How does K-NN work? • Suppose we have a new data point and we need to put it in the required category. Consider the below image: AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 173. How does K-NN work? • Firstly, we will choose the number of neighbors, so we will choose the k=5. • Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the distance between two points, which we have already studied in geometry. It can be calculated as: AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 174. How does K-NN work? • By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in category A and two nearest neighbors in category B. Consider the below image: AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 175. Why do we need a K-NN Algorithm? •Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so this data point will lie in which of these categories. •To solve this type of problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the category or class of a particular dataset. : AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 176. Why do we need a K-NN Algorithm? •Consider the below diagram: AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 177. Any Questions? AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 178. P1WU UNIT – III: CLASSIFICATION Topic 7: SVM CLASSIFIER AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 179. UNIT III : TEXT CLASSIFICATION AND CLUSTERING 1.A Characterization of Text Classification 2. Unsupervised Algorithms: Clustering 3. Naïve Text Classification 4. Supervised Algorithms 5. Decision Tree 6. k-NN Classifier 7. SVM Classifier 8. Feature Selection or Dimensionality Reduction 9. Evaluation metrics 10. Accuracy and Error 11. Organizing the classes 12. Indexing and Searching 13. Inverted Indexes 14. Sequential Searching 15. Multi-dimensional Indexing AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 180. SUPPORT VECTOR MACHINE (SVM) AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 181. INTRODUCTION TO SVM • A new classification method for both linear and nonlinear data • It uses a nonlinear mapping to transform the original training data into a higher dimension • With the new dimension, it searches for the linear optimal separating hyperplane (i.e., “decision boundary”) • With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can always be separated by a hyperplane • SVM finds this hyperplane using support vectors (“essential” training tuples) and margins (defined by the support vectors) AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 182. INTRODUCTION TO SVM • A support vector machine (SVM) is a supervised machine learning model that uses classification algorithms. • It is more preferred for classification but is sometimes very useful for regression as well. • Basically, SVM finds a hyper-plane that creates a boundary between the types of data. • In 2- dimensional space, this hyper-plane is nothing but a line. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 183. SVM—History and Applications • Vapnik and colleagues (1992)—groundwork from Vapnik & Chervonenkis’ statistical learning theory in 1960s • Features: training can be slow but accuracy is high owing to their ability to model complex nonlinear decision boundaries (margin maximization) • Used both for classification and prediction • Applications: • handwritten digit recognition, object recognition, speaker identification, benchmarking time- series prediction tests AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 184. SVM—General Philosophy AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 185. SVM—Margins and Support Vectors AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 186. INTRODUCTION TO SVM • In SVM, we plot each data item in the dataset in an N- dimensional space, where N is the number of features/attributes in the data. • Next, find the optimal hyperplane to separate the data. • So by this, you must have understood that inherently, SVM can only perform binary classification (i.e., choose between two classes). • However, there are various techniques to use for multi-class problems. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 187. Support Vector Machine for Multi- class Problems • To perform SVM on multi-class problems, we can create a binary classifier for each class of the data. • The two results of each classifier will be : • The data point belongs to that class OR • The data point does not belong to that class. • For example, in a class of fruits, to perform multi-class classification, we can create a binary classifier for each fruit. • For say, the ‘mango’ class, • there will be a binary classifier to predict if it IS a mango OR it is NOT a mango. • The classifier with the highest score is chosen as the output of the SVM. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 188. SVM—Linearly Separable • A separating hyperplane can be written as • W ● X + b = 0 • where W={w1, w2, …, wn} is a weight vector and b a scalar (bias) • For 2-D it can be written as • w0 + w1 x1 + w2 x2 = 0 • The hyperplane defining the sides of the margin: • H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and • H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1 • Any training tuples that fall on hyperplanes H1 or H2 (i.e., the sides defining the margin) are support vectors • This becomes a constrained (convex) quadratic optimization problem: Quadratic objective function and linear constraints  Quadratic Programming (QP)  Lagrangian multipliers AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 189. SVM—When Data Is Linearly Separable • Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples associated with the class labels yi • There are infinite lines (hyperplanes) separating the two classes but we want to find the best one (the one that minimizes classification error on unseen data) • SVM searches for the hyperplane with the largest margin, i.e., maximum marginal hyperplane (MMH) AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 190. SVM for complex (Non Linearly Separable) SVM for complex (Non Linearly Separable) SVM works very well without any modifications for linearly separable data. Linearly Separable Data is any data that can be plotted in a graph and can be separated into classes using a straight line. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES A: Linearly Separable Data B: Non-Linearly Separable Data
  • 191. SVM CLASSIFIER AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 192. SVM CLASSIFIER • A vector space method for binary classification problems documents represented in t-dimensional space • find a decision surface (hyperplane) that best separate documents of two classes new document classified by its position relative to hyperplane. • Simple 2D example: training documents linearly separable AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 193. SVM CLASSIFIER • Simple 2D example: training documents linearly separable AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 194. SVM CLASSIFIER • Line s—The Decision Hyperplane • maximizes distances to closest docs of each class • it is the best separating hyperplane • Delimiting Hyperplanes • parallel dashed lines that delimit region where to look for a solution AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 195. SVM CLASSIFIER • Lines that cross the delimiting hyperplanes. • candidates to be selected as the decision hyperplane • lines that are parallel to delimiting hyperplanes: best candidates AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 196. SVM CLASSIFIER • Support vectors: documents that belong to, and define, the delimiting hyperplanes Our example in a 2-dimensional system of coordinates AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 197. SVM CLASSIFIER AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES