CS8080_IRT_UNIT - III T1 A CHARACTERIZATION OF TEXT CLASSIFICATION.pdf

P1WU
UNIT – III: CLASSIFICATION
Topic 1: A CHARACTERIZATION OF TEXT
CLASSIFICATION
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES

UNIT III
1.A Characterization of
Text Classification
2. Unsupervised
Algorithms: Clustering
3. Naïve Text Classification
4. Supervised Algorithms
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
8. Feature Selection or
Dimensionality Reduction
9. Evaluation metrics
10. Accuracy and Error
11. Organizing the classes
12. Indexing and Searching
13. Inverted Indexes
14. Sequential Searching
15. Multi-dimensional
Indexing
SEMESTER – VIII

INTRODUCTION TO CLASSIFICATION
SEMESTER – VIII

• Scientists became very serious about addressing the question:
• “Can we build a model that learns from available data and
automatically makes the right decisions and predictions?”
• Answer can be found in numerous applications that are emerging
from the fields of
1. pattern classification,
2. machine learning, and
3. artificial intelligence.
SEMESTER – VIII

• Data from various sensoring devices combined with powerful
learning algorithms and domain knowledge led to :
• many great inventions that we now take for granted in our
everyday life:
• Internet queries via search engines like Google,
• text recognition at the post office,
• barcode scanners at the supermarket, the diagnosis of diseases,
• speech recognition by Siri or
• Google Now on our mobile phone, just to name a few.
SEMESTER – VIII

• Classification is:
• the data mining process of
• finding a model (or function) that
• describes and distinguishes data classes or concepts,
• for the purpose of being able to use the model to predict the class of objects
whose class label is unknown.
• That is, predicts categorical class labels (discrete or nominal).
• Classifies the data (constructs a model) based on the training set.
• It predict group membership for data instances.
SEMESTER – VIII

What is CLASSIFICATION?
• Classification and prediction are :
• two forms of data analysis that can used to extract models describing
important data classes or to predict the future data trends.
• C & P help us to provide a better understanding of large data.
• Classification predicts categorical (discrete, unordered) labels.
• Prediction models continuous valued functions.
SEMESTER – VIII

• How can we classify?
• The trick here is Machine Learning which requires us to make classifications based on past
observations (the learning part).
• We give the machine a set of data having texts with labels tagged to it and then we let the model
to learn on all these data which will later give us some useful insight on the categories of text
input we feed.
SEMESTER – VIII

Applications of Classification
• Classification of (potential) customers for:
• Credit approval, risk prediction, selective marketing
• Performance prediction based on
• selected indicators
• Medical diagnosis based on symptoms or reactions to Therapy
• Application areas:
• Credit approval
• Target marketing
• Medical diagnosis
• Treatment effectiveness analysis
• Performance prediction
SEMESTER – VIII

When is classification needed?
• Scenarios:
• In each of these examples, the data analysis task is classification,
• where a model or classifier is constructed to predict categorical labels, such as
• “safe” or “risky” for the loan application data;
• “yes” or “no” for the marketing data; or
• “treatment A,” “treatment B,” or “treatment C” for the medical data.
• These categories can be represented by discrete values, where the ordering among values
has no meaning.
• For example,
• the values 1, 2, and 3 may be used to represent treatments A, B, and C,
• where there is no ordering implied among this group of treatment regimes.
SEMESTER – VIII

Aim: predict categorical class labels
for new tuples/samples
Input: a training set of tuples/samples,
each with a class label
Output: a model (a classifier) based on
the training set and the class labels
SEMESTER – VIII

Why Classification?
• A classical problem extensively studied by
• statisticians and machine learning researchers
• Predicts categorical class labels.
• Produces a model (classifier).
SEMESTER – VIII

Typical Applications of Classification
• Example:
• {credit history, salary} credit approval ( Yes/No)
• {Temp, Humidity}  Rain (Yes/No)
• A set of documents  sports, technology, etc.
SEMESTER – VIII
• Another Example:
• If x >= 90 then grade =A.
• If 80<=x<90 then grade =B.
• If 70<=x<80 then grade =C.
• If 60<=x<70 then grade =D.
• If x<50 then grade =F.

WHAT ARE TEXT CLASSIFICATION?
• Text classification is a machine
learning technique that assigns a
set of predefined categories
to open-ended text.
• Text classifiers can be used to
organize, structure, and categorize
pretty much any kind of text –
from documents, medical studies
and files, and all over the web.
SEMESTER – VIII

What is meant by text classification?
• Text classification or Text Categorization
is the activity of labeling natural
language texts with relevant categories
from a predefined set.
• In laymen terms, text classification is a
process of extracting generic tags from
unstructured text.
• These generic tags come from a set of
pre-defined categories.
SEMESTER – VIII

What is meant by text classification or Document classification ?
• Document classification or document categorization is
• a problem in library science, information science and
computer science.
• The task is to assign a document to one or more classes or
categories.
• This may be done "manually" or algorithmically.
•Wikipedia
SEMESTER – VIII

What is meant by text classification?
• Text classification also known as text tagging or text
categorization is the process of categorizing text into
organized groups.
• By using Natural Language Processing (NLP), text
classifiers can automatically analyze text and then
assign a set of pre-defined tags or categories based on
its content.
SEMESTER – VIII

Text Classification Examples
• Text classification is becoming
• an increasingly important part of businesses as it allows to
easily get insights from data and automate business processes.
• Some of the most common examples and use cases for
automatic text classification include the following:
a) Sentiment Analysis
b) Topic Detection
c) Language Detection
SEMESTER – VIII

Text Classification Examples
a) Sentiment Analysis: the process of understanding if a given text is
talking positively or negatively about a given subject
(e.g. for brand monitoring purposes).
b) Topic Detection: the task of identifying the theme or topic of a piece
of text
(e.g. know if a product review is about Ease of Use, Customer Support,
or Pricing when analyzing customer feedback).
c) Language Detection: the procedure of detecting the language of a
given text
(e.g. know if an incoming support ticket is written in English or Spanish for
automatically routing tickets to the appropriate team).
SEMESTER – VIII

A Characterization of Text Classification
• For example,
• new articles can be organized by topics;
• support tickets can be organized by urgency;
• chat conversations can be organized by language;
• brand mentions can be organized by sentiment; and so on.
• Text classification is
• one of the fundamental tasks in natural language processing with broad applications such
as sentiment analysis, topic labeling, spam detection, and intent detection.
• Here’s an example of how it works:
• “The user interface is quite straightforward and easy to use.”
• A text classifier can take this phrase as an input, analyze its content, and then automatically
assign relevant tags, such as UI and Easy To Use.
SEMESTER – VIII

• First tactic for categorizing documents is to assign a
label to each document,
• but this solve the problem only when the users know the
labels of the documents they looking for.
• This tactic does not solve more generic problem of
finding documents on specific topic or subject.
SEMESTER – VIII

• For that case, better solution is to
• group documents by common generic topics and label each group
with a meaningful name.
• Each labeled group is called category or class.
• Document classification is
• the process of categorizing documents under a given cluster or
category using fully supervised learning process.
SEMESTER – VIII

Why is Text Classification Important?
• It’s estimated that around 80% of all information is unstructured, with text
being one of the most common types of unstructured data.
• Because of the messy nature of text,
• analyzing, understanding, organizing, and sorting through text data is hard and time-consuming, so
most companies fail to use it to its full potential.
• This is where text classification with machine learning comes in.
• Using text classifiers, companies can automatically structure all manner of
relevant text, from
• , legal documents, social media, chatbots, surveys, and more in a fast and cost-effective way.
• This allows companies to
• save time analyzing text data, automate business processes, and make data-driven business
decisions.
SEMESTER – VIII

Reasons for: Text Classification Important
a) Scalability
• Manually analyzing and organizing is slow and much less accurate..
• Machine learning can automatically analyze millions of surveys, comments, emails,
etc., at a fraction of the cost, often in just a few minutes.
• Text classification tools are scalable to any business needs, large or small.
b) Real-time analysis
• There are critical situations that companies need to identify as soon as possible and
take immediate action (e.g., PR crises on social media).
• Machine learning text classification can follow your brand mentions constantly and in
real time, so you'll identify critical information and be able to take action right away.
SEMESTER – VIII

Reasons for: Text Classification Important
c) Consistent criteria
• Human annotators make mistakes when classifying text data due to
distractions, fatigue, and boredom, and human subjectivity creates inconsistent
criteria.
• Machine learning, on the other hand, applies the same lens and criteria to all
data and results.
• Once a text classification model is properly trained it performs with
unsurpassed accuracy.
SEMESTER – VIII

• Classification could be performed
1. manually by domain experts or
2. automatically using well- known and
• widely used classification algorithms such as decision tree and
Naïve Bayes.
• Documents are classified according to
• other attributes (e.g. author, document type, publishing year
etc.) or according to their subjects.
SEMESTER – VIII

• there are two main kind of subject classification of documents:
1. The content based approach and
2. the request based approach.
• In Content based classification,
• the weight that is given to subjects in a document decides the class to which the document is assigned.
• For example, it is a rule in some library classification that at least 15% of the content of a book
should be about the class to which the book is assigned.
• In automatic classification, the number of times given words appears in a document determine the
class.
SEMESTER – VIII

• In Request oriented
classification, the anticipated
request from users is impacting
how documents are being
classified.
• The classifier asks himself:
• “Under which description should this
entity be found?” and
• “think of all the possible queries and
decide for which ones the entity at
hand is relevant”.
SEMESTER – VIII

Text Classification Applications
• With the help of text classification, businesses can make sense of large
amounts of data using techniques like
• aspect-based sentiment analysis to understand what people are talking about
and how they’re talking about each aspect.
• Text classification can help support teams provide a stellar experience
by
• automating tasks that are better left to computers, saving precious time that
can be spent on more important things.
SEMESTER – VIII

• models can help you analyze survey results to discover patterns and
insights like:
• What do people like about our product or service?
• What should we improve?
• What do we need to change?
• By combining both quantitative results and qualitative analyses,
• teams can make more informed decisions without having to spend hours
manually analyzing every single open-ended response.
SEMESTER – VIII

• Text classification has thousands of use cases and is applied to a wide range
of tasks.
• In some cases, data classification tools work behind the scenes to enhance
app features we interact with on a daily basis (like email spam filtering).
• In some other cases, classifiers are used by marketers, product managers,
engineers, and salespeople to automate business processes and save
hundreds of hours of manual data processing.
• Some of the top applications and use cases of text classification include:
1. Detecting urgent issues
2. Automating customer support processes
3. Listening to the Voice of customer (VoC)
SEMESTER – VIII

• Automatic document classification tasks can be divided into three
types
1. Unsupervised document classification (document clustering): the
classification must be done totally without reference to external information.
2. Semi-supervised document classification: parts of the documents are labeled
by the external method.
3. Supervised document classification where some external method (such as
human feedback) provides information on the correct classification for
documents
SEMESTER – VIII

Computational Supervised Learning
• Computational Supervised Learning is also called classification aimed
to:
• Learn from past experience, and
• use the learned knowledge to classify new data
• Knowledge learned by intelligent algorithms
• Examples:
• Clinical diagnosis for patients
• Cell type classification
SEMESTER – VIII

Overall Picture of Supervised Learning
SEMESTER – VIII
Biomedical
Financial
Government
Scientific
Decision trees
Emerging patterns
SVM
Neural networks
Classifiers (M-Doctors)

Unsupervised Learning
• Unsupervised learning is a machine learning technique in which
models are not supervised using training dataset. Instead, models itself
find the hidden patterns and insights from the given data. It can be
compared to learning which takes place in the human brain while
learning new things. It can be defined as:
• “Unsupervised learning is a type of machine learning in which models
are trained using unlabeled dataset and are allowed to act on that data
without any supervision”.
SEMESTER – VIII

Unsupervised learning cannot be directly applied to a regression or
classification problem because unlike supervised learning, we have the
input data but no corresponding output data.
The goal of unsupervised learning is to
find the underlying structure of dataset, group that data according to
similarities, and represent that dataset in a compressed format.
SEMESTER – VIII

Example: Suppose the unsupervised learning algorithm is given an input
dataset containing images of different types of cats and dogs.
The algorithm is never trained upon the given dataset, which means it
does not have any idea about the features of the dataset.
The task of the unsupervised learning algorithm is to identify the image
features on their own.
SEMESTER – VIII

• . Unsupervised learning algorithm will
• perform this task by clustering the image dataset into the groups according to
similarities between images.
• By Simply,
• no training data is provided Examples:
• neural network models
• independent component analysis
• clustering
SEMESTER – VIII

Supervised vs. Unsupervised Learning
classification Vs clustering
• Supervised learning (classification)
• Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
• New data is classified based on the training set
• Unsupervised learning (clustering)
• The class labels of training data is unknown
• Given a set of measurements, observations, etc. with the aim of establishing
the existence of classes or clusters in the data
SEMESTER – VIII

Any Questions?
SEMESTER – VIII

CS8080_IRT_UNIT - III T1 A CHARACTERIZATION OF TEXT CLASSIFICATION.pdf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to CS8080_IRT_UNIT - III T1 A CHARACTERIZATION OF TEXT CLASSIFICATION.pdf

Similar to CS8080_IRT_UNIT - III T1 A CHARACTERIZATION OF TEXT CLASSIFICATION.pdf (20)

More from AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING

More from AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING (14)

Recently uploaded

Recently uploaded (20)

CS8080_IRT_UNIT - III T1 A CHARACTERIZATION OF TEXT CLASSIFICATION.pdf