SlideShare a Scribd company logo
AUTOMATIC
CLASSIFICATION
Outline
Introduction
Classification for Information Retrieval
Classification methods
Measures of association
Clustering
The use of clustering in information retrieval
graph theoretic
 'Single-Pass Algorithm'
Single-link
Introduction
What is classification?
A formal definition of classification will not be attempted.
The word 'classification' is used to describe the result of
such a process.
Classification is one of the core areas in machine learning.
Classification for Information
Retrieval
In the context of information retrieval, a classification is
required for a purpose.
The purpose may be to group the documents in such a
way that retrieval will be faster or alternatively it may be to
construct a thesaurus automatically.
classification is used in various subtasks of Preprocessing,
content filtering, sorting, ranking, et.
There are two main areas of application of classification
methods in IR:
• keyword clustering;
• document clustering.
In the main, people have achieved the 'logical
organization' in two different ways.
1.direct classification of the documents
2.The intermediate calculation of a measure of closeness
between documents.
-The first approach has proved theoretically to be intractable
so that any experimental test results cannot be considered
to be reliable.
-The second approach to classification is fairly well
documented now.
Classification methods
The data consists of objects and their corresponding descriptions.
The objects may be documents, keywords, hand written characters,
or species (in the last case the objects themselves are classes as
opposed to individuals)
The descriptors come under various names depending on their
structure:
•(1) multi-state attributes (e.g. color)
•(2) binary-state (e.g. keywords)
•(3) numerical (e.g. hardness scale, or weighted keywords)
•(4) probability distributions.
•The fourth category of descriptors is applicable when the objects are
classes.
Measures of association
Some classification methods are based on a binary relationship
between objects, basis of this relationship a classification method can
construct a system of clusters.
The relationship is described variously as 'similarity', 'association' and
'dissimilarity'.
There are five commonly used measures of association in
information retrieval. Since in information retrieval documents and
requests are most commonly represented by term or keyword lists, an
object is represented by a set of keywords and that the counting
measure | . | gives the size of the set.
|X υ Y | Simple matching coefficient
Which is the number of shared index terms.
Dice’s Coefficient
Jaccard’s Coefficient
Cosine’s
Coefficient
Overlap’s
Coefficient
Clustering in information retrieval
 Clustering is used in information retrieval systems to 
enhance the efficiency and effectiveness of the retrieval 
process.
Clustering is achieved by partitioning the documents in a 
collection into classes such that documents that are 
associated with each other are assigned to the same 
cluster.
In order to cluster the items in a data set, some means of 
quantifying the degree of association between them is 
required.
A cluster method depending only on the rank-ordering of 
the association values would given identical clusterings for 
Cluster hypothesis
In information retrieval, it states that documents that are 
clustered together "behave similarly with respect to 
relevance to information needs".
In terms of classification, it states that if points are in the 
same cluster.
This hypothesis may be simply stated as follows: closely 
associated documents tend to be relevant to the same 
requests.
The use of clustering in information retrieval 
            “ theoretical soundness of the method”
•method should satisfy certain criteria of adequacy.  To list some 
of the more important of these: 
1.the method produces a clustering which is unlikely to be altered drastically
when further objects are incorporated, i.e. it is stable under growth.
2.the method is stable in the sense that small errors in the description of the
objects lead to small changes in the clustering.
3.the method is independent of the initial ordering of the objects
“the efficiency”
• We know much about the behavior of clustered files in terms of the effectiveness of
retrieval (i.e. the ability to retrieve wanted and hold back unwanted documents.)
•Efficiency is really a property of the algorithm implementing the cluster method
•It is sometimes useful to distinguish the cluster method from its algorithm
•the context of IR this distinction becomes slightly less than useful since many cluster
methods are defined by their algorithm.
•two distinct approaches to clustering can be identified:
1.the clustering is based on a measure of similarity between the objects to be clustered;
2.the cluster method proceeds directly from the object descriptions
graph theoretic
• define clusters in terms of a graph derived from the measure of similarity
• A string is a connected sequence of objects from some starting point
• A connected component is a set of objects such that each object is 
connected to at least one other member of the set and the set is maximal 
with respect to this property.
• A maximal complete subgraph is a subgraph such that each node is 
connected to every other node in the subgraph and the set is maximal 
with respect to this property
A large class of hierarchic cluster methods
• is based on the initial measurement of similarity.
The most important of these is single-link which is
the only one to have extensively used in document
retrieval. It satisfies all the criteria of adequacy
mentioned.
• A further class of cluster methods based on
measurement of similarity is the class of so called
'clump' methods.
• The algorithms also use a number of empirically determined
parameters such as:
1. The number of clusters desired;
2. A minimum and maximum size for each cluster;
3. A threshold value on the matching function, below which an
object will not be included in a cluster;
4. the control of overlap between clusters;
5. An arbitrarily chosen objective function which is optimized.
'Single-Pass Algorithm'
(1) The object descriptions are processed serially;
(2) The first object becomes the cluster representative of the first
cluster;
(3) Each subsequent object is matched against all cluster representatives
existing at its processing time;
(4) A given object is assigned to one cluster (or more if overlap is
allowed) according to some condition on the matching function;
(5) When an object is assigned to a cluster the representative for that
cluster is recomputed;
(6) If an object fails a certain test it becomes the cluster representative
of a new cluster.
Single-link
• The output is a hierarchy with associated numerical levels called a dendrogram.
• Frequently the hierarchy is represented by a tree structure such that each node
represents a cluster.
The appropriateness of stratified hierarchic cluster methods
The appropriateness of stratified hierarchic cluster methods
Single-link and the minimum spanning tree
• The single-link tree is closely related to another kind of tree: the minimum
spanning tree, or MST, also derived from a dissimilarity coefficient .
• This second tree is quite different from the first, the nodes instead of
representing clusters represent the individual objects to be clustered.
• The MST is the tree of minimum length connecting the objects, where by 'length'
I mean the sum of the weights of the connecting links in the tree.
• we can define a maximum spanning tree as one of maximum length.. maximum
spanning tree based on the expected mutual information measure.
• Given the minimum spanning tree then the single-link clusters are obtained by
deleting links from the MST in order of decreasing length;
• The connected sets after each deletion are the single-link clusters.
• The order of deletion and the structure of the MST ensure that the clusters will
be nested into a hierarchy.
Implication of classification methods
• classification process can usually be speeded up by using extra
storage.
• In experiments classification structure is keot in fast store but
it’s impossible in operational system where the document
collections are so much bigger.
• In experiment we want to vary cluster representatives at search
time but in operational classification cluster representative
would be constructed once and for all cluster time.
• In IR classification file structure is:
– Easily updated - Easily search - Reasonably compact
Reference
Classification for Information Retrieval
http://bit.ly/2zff6cA
Conceptual clustering in information retrieval
https://ieeexplore.ieee.org/document/678640
CLUSTERING ALGORITHMS
http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap16.htm

More Related Content

What's hot

Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
Primya Tamil
 
Temporal databases
Temporal databasesTemporal databases
Temporal databases
Dabbal Singh Mahara
 
Distributed DBMS - Unit 6 - Query Processing
Distributed DBMS - Unit 6 - Query ProcessingDistributed DBMS - Unit 6 - Query Processing
Distributed DBMS - Unit 6 - Query Processing
Gyanmanjari Institute Of Technology
 
2.3 bayesian classification
2.3 bayesian classification2.3 bayesian classification
2.3 bayesian classification
Krish_ver2
 
Distributed System ppt
Distributed System pptDistributed System ppt
Structure of shared memory space
Structure of shared memory spaceStructure of shared memory space
Structure of shared memory space
Coder Tech
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
Anandh Arumugakan
 
Cs6703 grid and cloud computing unit 3
Cs6703 grid and cloud computing unit 3Cs6703 grid and cloud computing unit 3
Cs6703 grid and cloud computing unit 3
RMK ENGINEERING COLLEGE, CHENNAI
 
Consistency protocols
Consistency protocolsConsistency protocols
Consistency protocols
ZongYing Lyu
 
Architecture of data mining system
Architecture of data mining systemArchitecture of data mining system
Architecture of data mining system
ramya marichamy
 
Congestion control
Congestion controlCongestion control
Congestion control
Aman Jaiswal
 
CLOUD COMPUTING UNIT-1
CLOUD COMPUTING UNIT-1CLOUD COMPUTING UNIT-1
CLOUD COMPUTING UNIT-1
Dr K V Subba Reddy
 
Distributed File Systems
Distributed File Systems Distributed File Systems
Distributed File Systems
Maurvi04
 
Information retrieval 9 tf idf weights
Information retrieval 9 tf idf weightsInformation retrieval 9 tf idf weights
Information retrieval 9 tf idf weights
Vaibhav Khanna
 
Common Standards in Cloud Computing
Common Standards in Cloud ComputingCommon Standards in Cloud Computing
Common Standards in Cloud Computing
mrzahidfaiz.blogspot.com
 
Data Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalizationData Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalization
DataminingTools Inc
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsSelman Bozkır
 
Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notes
BAIRAVI T
 
CS6010 Social Network Analysis Unit III
CS6010 Social Network Analysis   Unit IIICS6010 Social Network Analysis   Unit III
CS6010 Social Network Analysis Unit III
pkaviya
 

What's hot (20)

Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
 
Temporal databases
Temporal databasesTemporal databases
Temporal databases
 
Distributed DBMS - Unit 6 - Query Processing
Distributed DBMS - Unit 6 - Query ProcessingDistributed DBMS - Unit 6 - Query Processing
Distributed DBMS - Unit 6 - Query Processing
 
2.3 bayesian classification
2.3 bayesian classification2.3 bayesian classification
2.3 bayesian classification
 
Cloud Reference Model
Cloud Reference ModelCloud Reference Model
Cloud Reference Model
 
Distributed System ppt
Distributed System pptDistributed System ppt
Distributed System ppt
 
Structure of shared memory space
Structure of shared memory spaceStructure of shared memory space
Structure of shared memory space
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
Cs6703 grid and cloud computing unit 3
Cs6703 grid and cloud computing unit 3Cs6703 grid and cloud computing unit 3
Cs6703 grid and cloud computing unit 3
 
Consistency protocols
Consistency protocolsConsistency protocols
Consistency protocols
 
Architecture of data mining system
Architecture of data mining systemArchitecture of data mining system
Architecture of data mining system
 
Congestion control
Congestion controlCongestion control
Congestion control
 
CLOUD COMPUTING UNIT-1
CLOUD COMPUTING UNIT-1CLOUD COMPUTING UNIT-1
CLOUD COMPUTING UNIT-1
 
Distributed File Systems
Distributed File Systems Distributed File Systems
Distributed File Systems
 
Information retrieval 9 tf idf weights
Information retrieval 9 tf idf weightsInformation retrieval 9 tf idf weights
Information retrieval 9 tf idf weights
 
Common Standards in Cloud Computing
Common Standards in Cloud ComputingCommon Standards in Cloud Computing
Common Standards in Cloud Computing
 
Data Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalizationData Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalization
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systems
 
Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notes
 
CS6010 Social Network Analysis Unit III
CS6010 Social Network Analysis   Unit IIICS6010 Social Network Analysis   Unit III
CS6010 Social Network Analysis Unit III
 

Similar to automatic classification in information retrieval

Literature Survey On Clustering Techniques
Literature Survey On Clustering TechniquesLiterature Survey On Clustering Techniques
Literature Survey On Clustering Techniques
IOSR Journals
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online Data
IOSR Journals
 
UNIT - 4: Data Warehousing and Data Mining
UNIT - 4: Data Warehousing and Data MiningUNIT - 4: Data Warehousing and Data Mining
UNIT - 4: Data Warehousing and Data Mining
Nandakumar P
 
A0310112
A0310112A0310112
A0310112
iosrjournals
 
A Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed ClusteringA Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed Clustering
IRJET Journal
 
Du35687693
Du35687693Du35687693
Du35687693
IJERA Editor
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Editor IJARCET
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Editor IJARCET
 
Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...
IRJET Journal
 
Literature Survey: Clustering Technique
Literature Survey: Clustering TechniqueLiterature Survey: Clustering Technique
Literature Survey: Clustering Technique
Editor IJCATR
 
F04463437
F04463437F04463437
F04463437
IOSR-JEN
 
Bs31267274
Bs31267274Bs31267274
Bs31267274IJMER
 
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
IRJET Journal
 
Paper id 26201478
Paper id 26201478Paper id 26201478
Paper id 26201478
IJRAT
 
Cluster analysis foundations.docx
Cluster analysis foundations.docxCluster analysis foundations.docx
Cluster analysis foundations.docx
YaseenRashid4
 
A Novel Clustering Method for Similarity Measuring in Text Documents
A Novel Clustering Method for Similarity Measuring in Text DocumentsA Novel Clustering Method for Similarity Measuring in Text Documents
A Novel Clustering Method for Similarity Measuring in Text Documents
IJMER
 
Dp33701704
Dp33701704Dp33701704
Dp33701704
IJERA Editor
 
Dp33701704
Dp33701704Dp33701704
Dp33701704
IJERA Editor
 
A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...
IJERA Editor
 

Similar to automatic classification in information retrieval (20)

Literature Survey On Clustering Techniques
Literature Survey On Clustering TechniquesLiterature Survey On Clustering Techniques
Literature Survey On Clustering Techniques
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online Data
 
UNIT - 4: Data Warehousing and Data Mining
UNIT - 4: Data Warehousing and Data MiningUNIT - 4: Data Warehousing and Data Mining
UNIT - 4: Data Warehousing and Data Mining
 
A0310112
A0310112A0310112
A0310112
 
A Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed ClusteringA Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed Clustering
 
Du35687693
Du35687693Du35687693
Du35687693
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973
 
Ir3116271633
Ir3116271633Ir3116271633
Ir3116271633
 
Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...
 
Literature Survey: Clustering Technique
Literature Survey: Clustering TechniqueLiterature Survey: Clustering Technique
Literature Survey: Clustering Technique
 
F04463437
F04463437F04463437
F04463437
 
Bs31267274
Bs31267274Bs31267274
Bs31267274
 
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
 
Paper id 26201478
Paper id 26201478Paper id 26201478
Paper id 26201478
 
Cluster analysis foundations.docx
Cluster analysis foundations.docxCluster analysis foundations.docx
Cluster analysis foundations.docx
 
A Novel Clustering Method for Similarity Measuring in Text Documents
A Novel Clustering Method for Similarity Measuring in Text DocumentsA Novel Clustering Method for Similarity Measuring in Text Documents
A Novel Clustering Method for Similarity Measuring in Text Documents
 
Dp33701704
Dp33701704Dp33701704
Dp33701704
 
Dp33701704
Dp33701704Dp33701704
Dp33701704
 
A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...
 

Recently uploaded

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 

Recently uploaded (20)

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 

automatic classification in information retrieval

  • 2. Outline Introduction Classification for Information Retrieval Classification methods Measures of association Clustering The use of clustering in information retrieval graph theoretic  'Single-Pass Algorithm' Single-link
  • 3. Introduction What is classification? A formal definition of classification will not be attempted. The word 'classification' is used to describe the result of such a process. Classification is one of the core areas in machine learning.
  • 4. Classification for Information Retrieval In the context of information retrieval, a classification is required for a purpose. The purpose may be to group the documents in such a way that retrieval will be faster or alternatively it may be to construct a thesaurus automatically. classification is used in various subtasks of Preprocessing, content filtering, sorting, ranking, et. There are two main areas of application of classification methods in IR: • keyword clustering; • document clustering.
  • 5. In the main, people have achieved the 'logical organization' in two different ways. 1.direct classification of the documents 2.The intermediate calculation of a measure of closeness between documents. -The first approach has proved theoretically to be intractable so that any experimental test results cannot be considered to be reliable. -The second approach to classification is fairly well documented now.
  • 6. Classification methods The data consists of objects and their corresponding descriptions. The objects may be documents, keywords, hand written characters, or species (in the last case the objects themselves are classes as opposed to individuals) The descriptors come under various names depending on their structure: •(1) multi-state attributes (e.g. color) •(2) binary-state (e.g. keywords) •(3) numerical (e.g. hardness scale, or weighted keywords) •(4) probability distributions. •The fourth category of descriptors is applicable when the objects are classes.
  • 7. Measures of association Some classification methods are based on a binary relationship between objects, basis of this relationship a classification method can construct a system of clusters. The relationship is described variously as 'similarity', 'association' and 'dissimilarity'. There are five commonly used measures of association in information retrieval. Since in information retrieval documents and requests are most commonly represented by term or keyword lists, an object is represented by a set of keywords and that the counting measure | . | gives the size of the set. |X υ Y | Simple matching coefficient Which is the number of shared index terms.
  • 9. Clustering in information retrieval  Clustering is used in information retrieval systems to  enhance the efficiency and effectiveness of the retrieval  process. Clustering is achieved by partitioning the documents in a  collection into classes such that documents that are  associated with each other are assigned to the same  cluster. In order to cluster the items in a data set, some means of  quantifying the degree of association between them is  required. A cluster method depending only on the rank-ordering of  the association values would given identical clusterings for 
  • 11. The use of clustering in information retrieval              “ theoretical soundness of the method” •method should satisfy certain criteria of adequacy.  To list some  of the more important of these:  1.the method produces a clustering which is unlikely to be altered drastically when further objects are incorporated, i.e. it is stable under growth. 2.the method is stable in the sense that small errors in the description of the objects lead to small changes in the clustering. 3.the method is independent of the initial ordering of the objects
  • 12. “the efficiency” • We know much about the behavior of clustered files in terms of the effectiveness of retrieval (i.e. the ability to retrieve wanted and hold back unwanted documents.) •Efficiency is really a property of the algorithm implementing the cluster method •It is sometimes useful to distinguish the cluster method from its algorithm •the context of IR this distinction becomes slightly less than useful since many cluster methods are defined by their algorithm. •two distinct approaches to clustering can be identified: 1.the clustering is based on a measure of similarity between the objects to be clustered; 2.the cluster method proceeds directly from the object descriptions
  • 13. graph theoretic • define clusters in terms of a graph derived from the measure of similarity • A string is a connected sequence of objects from some starting point • A connected component is a set of objects such that each object is  connected to at least one other member of the set and the set is maximal  with respect to this property. • A maximal complete subgraph is a subgraph such that each node is  connected to every other node in the subgraph and the set is maximal  with respect to this property
  • 14.
  • 15. A large class of hierarchic cluster methods • is based on the initial measurement of similarity. The most important of these is single-link which is the only one to have extensively used in document retrieval. It satisfies all the criteria of adequacy mentioned. • A further class of cluster methods based on measurement of similarity is the class of so called 'clump' methods.
  • 16.
  • 17.
  • 18. • The algorithms also use a number of empirically determined parameters such as: 1. The number of clusters desired; 2. A minimum and maximum size for each cluster; 3. A threshold value on the matching function, below which an object will not be included in a cluster; 4. the control of overlap between clusters; 5. An arbitrarily chosen objective function which is optimized.
  • 19. 'Single-Pass Algorithm' (1) The object descriptions are processed serially; (2) The first object becomes the cluster representative of the first cluster; (3) Each subsequent object is matched against all cluster representatives existing at its processing time; (4) A given object is assigned to one cluster (or more if overlap is allowed) according to some condition on the matching function; (5) When an object is assigned to a cluster the representative for that cluster is recomputed; (6) If an object fails a certain test it becomes the cluster representative of a new cluster.
  • 20. Single-link • The output is a hierarchy with associated numerical levels called a dendrogram. • Frequently the hierarchy is represented by a tree structure such that each node represents a cluster.
  • 21. The appropriateness of stratified hierarchic cluster methods
  • 22. The appropriateness of stratified hierarchic cluster methods
  • 23. Single-link and the minimum spanning tree • The single-link tree is closely related to another kind of tree: the minimum spanning tree, or MST, also derived from a dissimilarity coefficient . • This second tree is quite different from the first, the nodes instead of representing clusters represent the individual objects to be clustered. • The MST is the tree of minimum length connecting the objects, where by 'length' I mean the sum of the weights of the connecting links in the tree. • we can define a maximum spanning tree as one of maximum length.. maximum spanning tree based on the expected mutual information measure.
  • 24. • Given the minimum spanning tree then the single-link clusters are obtained by deleting links from the MST in order of decreasing length; • The connected sets after each deletion are the single-link clusters. • The order of deletion and the structure of the MST ensure that the clusters will be nested into a hierarchy.
  • 25. Implication of classification methods • classification process can usually be speeded up by using extra storage. • In experiments classification structure is keot in fast store but it’s impossible in operational system where the document collections are so much bigger. • In experiment we want to vary cluster representatives at search time but in operational classification cluster representative would be constructed once and for all cluster time. • In IR classification file structure is: – Easily updated - Easily search - Reasonably compact
  • 26. Reference Classification for Information Retrieval http://bit.ly/2zff6cA Conceptual clustering in information retrieval https://ieeexplore.ieee.org/document/678640 CLUSTERING ALGORITHMS http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap16.htm

Editor's Notes

  1. Core لب
  2. thesaurus ذاكرة/قاموس
  3. fairly مقبول/لائق intractable غير مقبول /عسير closeness قرب reliable فعال/جاد/ثقه
  4. coefficient معامل/درجة
  5. Quantify يحدد المقدار/الكمية association ترابط
  6. Hypothesis فرضية assumption افتراض/ادعاء behave سلك/تصرف state يعرض/يوضح respect تصرف/اعتبار behave سلك/تصرف
  7. Empirically تجريبى threshold بداية , below دون overlap تداخل – تراكب arbitrarily استبعاد-تحكم
  8. Serially متسلسل subsequent لاحق – تابع against ضد fails قصر عن وظيفتة
  9. Appropriateness ملائم – مناسب stratified قسم الى طبقات – تراصف