SlideShare a Scribd company logo
1 of 28
Motivation
• With the proliferation of digital documents it is
important to have sound organization – i.e.
Categorization
– Faceted Search, Exploratory Search, Navigational
Search, Diversifying Search Results, Ranking, etc.
• Yahoo! employs 200 (?) people for manual
labeling of Web pages and managing a hierarchy
of 500,000+ categories*
• MEDLINE (National Library of Medicine) spends
$2 million/year for manual indexing of journal
articles and evolving MEdical Subject Headings
(18,000+ categories)*
* Source: www.cs.purdue.edu/homes/lsi/CS547_2013_Spring/.../IR_TC_I_Big.pdf
(Department of Computer Science, Purdue University)
Challenges
• What categories choose?
• Predefined?
– Reuters, DMOZ, Yahoo categories
• Relevant to organization?
– Personalized categories
Assumptions
• We assume that a knowledge graph exists
with all possible categories
– that can cover the terminology of nearly any
document collection;
– for example, Wikipedia
• Nodes are categories
• Edges are relationship between them
– Association (related)
• Organization receives documents in batches
– Monthly, Weekly, etc.
A part of Knowledge Graph (KnG)
Problem Definition
• Learning a personalized model for the
association of the categories in KnG to a
document collection through active learning
and feature design
• Building an evolving multi-label categorization
system to categorize documents into
Categories Specific to an Organization
– Personalization of categories
Scope of Work
Overall Architecture
• We evolve the personalized classifier based on
– Documents seen so far
– Categories referenced from Knowledge Graph
– Feedback provided by the user
Step 1: Spotting
• Spot the key words from documents
– Key phrase identification Techniques
– NLP (noun phrases)
Step 2: Candidate Categories
• Key words are the indicatives of document topics
• Identify the Categories from KnG based on
keyword look ups
– Title Match, Gloss match with Wikipedia categories
• Add categories in Markov Blanket
– Observe that categories that get assigned to a
document exhibit semantic relations such as
“associations”
– E.g.: category “Linear Classifier” is related to
categories such as “Kernel Methods in Classifiers,”
“Machine Learning,” and the like
– Refer our paper for more details
Candidate Categories
• Not all candidate categories are relevant to
the document
– The document is not about that category
– The category is not of interest to the user
• We need to select only most appropriate
categories from these candidate categories
Step 3: Associative Markov Network
formation
• Two types of informative features available
– a feature that is a function of the document and a
category, such as the category-specific classifier
scoring function evaluated on a document
– a feature that is a function of two categories, such
as their co-occurrence frequency or textual
overlap between their descriptions.
• Associative Markov Network (AMN), a very
natural way of modeling these two types of
features.
Associative Markov Network
• The candidate categories for a journal article
taken from arXiv.org
• Only some are actually relevant due to
– Relevance to the document
– User preferences
Step 4: Collectively Inferring
Categories of a Document
• Node features
– Capture the similarity of a node (category) to
document
– E.g: Kernels, SVM / Naïve Bayes classifier scores
• Edge features
– Capture the similarity between nodes
– E.g: Title match, gloss match, etc.
Collectively Inferring Categories of a
Document
• This is the MAP inference in standard Markov Network with
only node and edge potentials
• Using the indicator variable, we can express the potentials
as and
• Note, we have separate feature weights for 0 and 1 labels
x0
x6
x3
x8
x2
x9
x5
x1
x4
x7
1
0
Training
• The training process involves learning
– The AMN feature weights (Wn an We)
• Node specific classifier (SVM, Naïve Bayes, etc)
weights
• Training is done as part of Personalization
explained in coming slides
Personalization
• Process of learning to categorize with
categories that are of interest to an
organization
• We achieve this by soliciting feedback from a
human oracle on the system-suggested
categories and using it to retrain the system
parameters.
• The feedback is solicited as “correct”,
“incorrect” or “never again” for the categories
assigned to a document by the system.
Personalization: Constraints
• Users can indicate (via feedback) that a
category suggested by the system should
never reappear in future categorization
– E.g. Computer Science department may not be
interested in detailed categorization of documents
based on types of Viruses
• The system remembers this feedback as a
hard constraint which are applied during the
inference process
Personalization: Constraints
• Due to the AMN’s associative property, the
constraints naturally propagate
– Users do not have to apply constraints on every
unwanted category on KnG
By applying a “never again” constraint on node N, the label of Node N is
forced to 0. This forces labels of strongly associated neighbors (O,P,Q,R) to 0.
This is due to the AMN MAP inference, which attains maximum value when
the labels of these neighbors (with high edge potentials) are assigned label 0.
Personalization: Active Feedback
• To improve the categorization accuracy, users
can train the system by providing feedback
(“correct”, “incorrect”) on select categories of
select documents.
• System uses this feedback to retrain AMN,
SVM (and other classifiers – Naïve Bayes, etc)
• System chooses the documents and categories
for feedback that can help the system learn
best parameters with as few feedback as
possible
Active Learning
• We prove a claim “There exists a feature space
and a hyperplane in the feature space that
separates AMN nodes with label 1 from the
nodes that have label 0 and that passes
through the origin”
• This claim helps us transforming the AMN
model to the hyperplane based two class
classification problem and apply uncertainty
based principles to determine the most
uncertain category for a document
Active Learning
• ai : gain in selecting category i based on the distance for
hyper plane
• bj : gain in selecting document j based on the categories it
has
• Feedback is sought from the user for the documents with zj
= 1 and only for those categories that are identified as the
most uncertain for that document (yi = 1).
Evaluation
• Warm Start
• RCV1-v2 categories and documents
• Demonstrates our system on standard dataset
• 5000 documents in batches of 50 docs
• 2000 held-out test documents for F1 score
• Compared against
• SVM
• HICLASS from Shantanu et al.
• Cold Start
• User Evaluation using Wikipedia categories and
arXiv articles
• Compared against
• WikipeidaMiner
Warm Start Results
Comparison with SVM
Active Learning with
different algorithms
Cold Start Experiments and Results
• 263 arxiv docs
• Annotated by 8 human annotators using
Wikipedia titles
• 5 fold cross validation
– Trained AMN, SVM weights in each fold
To be addressed…
• Each document is assigned categories separately. This
leads to many accumulated categories at the
organization level
– Over specified number of categories
• AMN inference over thousands of candidate categories
is time consuming. Hence we cannot use this system in
a real time fashion
• KnG evolving over time
– Documents that are already assigned with categories need
to be updated wisely
Questions?
Thank you

More Related Content

What's hot

Improving Social Recommendations by applying a Personalized Item Clustering P...
Improving Social Recommendations by applying a Personalized Item Clustering P...Improving Social Recommendations by applying a Personalized Item Clustering P...
Improving Social Recommendations by applying a Personalized Item Clustering P...Γιώργος Αλεξανδρίδης
 
Catégorisation automatisée de contenus documentaires : la ...
Catégorisation automatisée de contenus documentaires : la ...Catégorisation automatisée de contenus documentaires : la ...
Catégorisation automatisée de contenus documentaires : la ...butest
 
Overview of Indexing In Object Oriented Database
Overview of Indexing In Object Oriented DatabaseOverview of Indexing In Object Oriented Database
Overview of Indexing In Object Oriented DatabaseEditor IJMTER
 
Advanced strategies for Metabolomics Data Analysis
Advanced strategies for Metabolomics Data AnalysisAdvanced strategies for Metabolomics Data Analysis
Advanced strategies for Metabolomics Data AnalysisDmitry Grapov
 
Metabolomic data analysis and visualization tools
Metabolomic data analysis and visualization toolsMetabolomic data analysis and visualization tools
Metabolomic data analysis and visualization toolsDmitry Grapov
 

What's hot (6)

Improving Social Recommendations by applying a Personalized Item Clustering P...
Improving Social Recommendations by applying a Personalized Item Clustering P...Improving Social Recommendations by applying a Personalized Item Clustering P...
Improving Social Recommendations by applying a Personalized Item Clustering P...
 
Catégorisation automatisée de contenus documentaires : la ...
Catégorisation automatisée de contenus documentaires : la ...Catégorisation automatisée de contenus documentaires : la ...
Catégorisation automatisée de contenus documentaires : la ...
 
Overview of Indexing In Object Oriented Database
Overview of Indexing In Object Oriented DatabaseOverview of Indexing In Object Oriented Database
Overview of Indexing In Object Oriented Database
 
Advanced strategies for Metabolomics Data Analysis
Advanced strategies for Metabolomics Data AnalysisAdvanced strategies for Metabolomics Data Analysis
Advanced strategies for Metabolomics Data Analysis
 
Metabolomic data analysis and visualization tools
Metabolomic data analysis and visualization toolsMetabolomic data analysis and visualization tools
Metabolomic data analysis and visualization tools
 
Slideshow ire
Slideshow ireSlideshow ire
Slideshow ire
 

Viewers also liked

Maria RodríGuez
Maria RodríGuezMaria RodríGuez
Maria RodríGuezElena Blay
 
Clean &g g 2
Clean &g g 2Clean &g g 2
Clean &g g 2Jauhan Po
 
Unternehmenspräsentation von Jenifer: Kinderclub junior sc
Unternehmenspräsentation von Jenifer: Kinderclub junior scUnternehmenspräsentation von Jenifer: Kinderclub junior sc
Unternehmenspräsentation von Jenifer: Kinderclub junior scandreasblau
 
Diversidades de los seres vivos
Diversidades de los seres vivosDiversidades de los seres vivos
Diversidades de los seres vivosAndrés Huserman
 
Firmaycertificadoelectronico
FirmaycertificadoelectronicoFirmaycertificadoelectronico
FirmaycertificadoelectronicoMilton Canuza
 
Startup in Action - Appandmap pitch
Startup in Action - Appandmap pitchStartup in Action - Appandmap pitch
Startup in Action - Appandmap pitchCodemotion
 
Casting Your Web - Learning Resources 2010
Casting Your Web - Learning Resources 2010Casting Your Web - Learning Resources 2010
Casting Your Web - Learning Resources 2010Kevin CW
 
10 ideas de Sabias que?
10 ideas de Sabias que?10 ideas de Sabias que?
10 ideas de Sabias que?Landinidanisa1
 
усвоение второго ия
усвоение второго ияусвоение второго ия
усвоение второго ияTa4ka
 
Sommerdeutschkurs EOI Vall d'Hebron 2012
Sommerdeutschkurs EOI Vall d'Hebron 2012Sommerdeutschkurs EOI Vall d'Hebron 2012
Sommerdeutschkurs EOI Vall d'Hebron 2012jsiemers
 
Audio in the hall 333
Audio in the hall 333Audio in the hall 333
Audio in the hall 333lucielindsay
 
How We Got Here
How We Got HereHow We Got Here
How We Got Herecdufault
 
SGO Themenabend Social Media Praxis - Brack und DayDeal
SGO Themenabend Social Media Praxis - Brack und DayDealSGO Themenabend Social Media Praxis - Brack und DayDeal
SGO Themenabend Social Media Praxis - Brack und DayDealMalte Polzin
 

Viewers also liked (20)

Ember.js
Ember.jsEmber.js
Ember.js
 
Chinese New Year
Chinese New YearChinese New Year
Chinese New Year
 
Maria RodríGuez
Maria RodríGuezMaria RodríGuez
Maria RodríGuez
 
Clean &g g 2
Clean &g g 2Clean &g g 2
Clean &g g 2
 
Unternehmenspräsentation von Jenifer: Kinderclub junior sc
Unternehmenspräsentation von Jenifer: Kinderclub junior scUnternehmenspräsentation von Jenifer: Kinderclub junior sc
Unternehmenspräsentation von Jenifer: Kinderclub junior sc
 
Avlm Training 2
Avlm Training 2Avlm Training 2
Avlm Training 2
 
Diversidades de los seres vivos
Diversidades de los seres vivosDiversidades de los seres vivos
Diversidades de los seres vivos
 
Firmaycertificadoelectronico
FirmaycertificadoelectronicoFirmaycertificadoelectronico
Firmaycertificadoelectronico
 
Presentation1
Presentation1Presentation1
Presentation1
 
Startup in Action - Appandmap pitch
Startup in Action - Appandmap pitchStartup in Action - Appandmap pitch
Startup in Action - Appandmap pitch
 
Casting Your Web - Learning Resources 2010
Casting Your Web - Learning Resources 2010Casting Your Web - Learning Resources 2010
Casting Your Web - Learning Resources 2010
 
Momentum linier
Momentum linierMomentum linier
Momentum linier
 
10 ideas de Sabias que?
10 ideas de Sabias que?10 ideas de Sabias que?
10 ideas de Sabias que?
 
усвоение второго ия
усвоение второго ияусвоение второго ия
усвоение второго ия
 
Sommerdeutschkurs EOI Vall d'Hebron 2012
Sommerdeutschkurs EOI Vall d'Hebron 2012Sommerdeutschkurs EOI Vall d'Hebron 2012
Sommerdeutschkurs EOI Vall d'Hebron 2012
 
Act1 - HESII
Act1 - HESIIAct1 - HESII
Act1 - HESII
 
Audio in the hall 333
Audio in the hall 333Audio in the hall 333
Audio in the hall 333
 
How We Got Here
How We Got HereHow We Got Here
How We Got Here
 
SGO Themenabend Social Media Praxis - Brack und DayDeal
SGO Themenabend Social Media Praxis - Brack und DayDealSGO Themenabend Social Media Praxis - Brack und DayDeal
SGO Themenabend Social Media Praxis - Brack und DayDeal
 
HTCwalls
HTCwallsHTCwalls
HTCwalls
 

Similar to Personalized classifiers

Tag based recommender system
Tag based recommender systemTag based recommender system
Tag based recommender systemKaren Li
 
10-System-ModelingFL22-sketch-19122022-091234am.pptx
10-System-ModelingFL22-sketch-19122022-091234am.pptx10-System-ModelingFL22-sketch-19122022-091234am.pptx
10-System-ModelingFL22-sketch-19122022-091234am.pptxhuzaifaahmed79
 
Introduction to Machine learning ppt
Introduction to Machine learning pptIntroduction to Machine learning ppt
Introduction to Machine learning pptshubhamshirke12
 
Overview of Object-Oriented Concepts Characteristics by vikas jagtap
Overview of Object-Oriented Concepts Characteristics by vikas jagtapOverview of Object-Oriented Concepts Characteristics by vikas jagtap
Overview of Object-Oriented Concepts Characteristics by vikas jagtapVikas Jagtap
 
Dq2644974501
Dq2644974501Dq2644974501
Dq2644974501IJMER
 
Recommendation system based on adaptive ontological graphs and weighted ranking
Recommendation system based on adaptive ontological graphs and weighted rankingRecommendation system based on adaptive ontological graphs and weighted ranking
Recommendation system based on adaptive ontological graphs and weighted rankingvikramadityajakkula
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Cataldo Musto
 
Analyzing a system and specifying the requirements
Analyzing a system and specifying the requirementsAnalyzing a system and specifying the requirements
Analyzing a system and specifying the requirementsvikramgopale2
 
8 oo approach&uml-23_feb
8 oo approach&uml-23_feb8 oo approach&uml-23_feb
8 oo approach&uml-23_febRaj Shah
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Lucidworks
 
Email Classification
Email ClassificationEmail Classification
Email ClassificationXi Chen
 
artrec.pptx
artrec.pptxartrec.pptx
artrec.pptxAuraHub
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningJoaquin Delgado PhD.
 

Similar to Personalized classifiers (20)

Tag based recommender system
Tag based recommender systemTag based recommender system
Tag based recommender system
 
10-System-ModelingFL22-sketch-19122022-091234am.pptx
10-System-ModelingFL22-sketch-19122022-091234am.pptx10-System-ModelingFL22-sketch-19122022-091234am.pptx
10-System-ModelingFL22-sketch-19122022-091234am.pptx
 
Introduction to Machine learning ppt
Introduction to Machine learning pptIntroduction to Machine learning ppt
Introduction to Machine learning ppt
 
Unit iii
Unit iiiUnit iii
Unit iii
 
Filtering content bbased crs
Filtering content bbased crsFiltering content bbased crs
Filtering content bbased crs
 
Overview of Object-Oriented Concepts Characteristics by vikas jagtap
Overview of Object-Oriented Concepts Characteristics by vikas jagtapOverview of Object-Oriented Concepts Characteristics by vikas jagtap
Overview of Object-Oriented Concepts Characteristics by vikas jagtap
 
Dq2644974501
Dq2644974501Dq2644974501
Dq2644974501
 
Analysis
AnalysisAnalysis
Analysis
 
Recommendation system based on adaptive ontological graphs and weighted ranking
Recommendation system based on adaptive ontological graphs and weighted rankingRecommendation system based on adaptive ontological graphs and weighted ranking
Recommendation system based on adaptive ontological graphs and weighted ranking
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
WORD
WORDWORD
WORD
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014
 
Analyzing a system and specifying the requirements
Analyzing a system and specifying the requirementsAnalyzing a system and specifying the requirements
Analyzing a system and specifying the requirements
 
8 oo approach&uml-23_feb
8 oo approach&uml-23_feb8 oo approach&uml-23_feb
8 oo approach&uml-23_feb
 
The Genopolis Microarray database
The Genopolis Microarray databaseThe Genopolis Microarray database
The Genopolis Microarray database
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
 
Email Classification
Email ClassificationEmail Classification
Email Classification
 
artrec.pptx
artrec.pptxartrec.pptx
artrec.pptx
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
 

Recently uploaded

Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 

Recently uploaded (20)

Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 

Personalized classifiers

  • 1.
  • 2. Motivation • With the proliferation of digital documents it is important to have sound organization – i.e. Categorization – Faceted Search, Exploratory Search, Navigational Search, Diversifying Search Results, Ranking, etc. • Yahoo! employs 200 (?) people for manual labeling of Web pages and managing a hierarchy of 500,000+ categories* • MEDLINE (National Library of Medicine) spends $2 million/year for manual indexing of journal articles and evolving MEdical Subject Headings (18,000+ categories)* * Source: www.cs.purdue.edu/homes/lsi/CS547_2013_Spring/.../IR_TC_I_Big.pdf (Department of Computer Science, Purdue University)
  • 3. Challenges • What categories choose? • Predefined? – Reuters, DMOZ, Yahoo categories • Relevant to organization? – Personalized categories
  • 4. Assumptions • We assume that a knowledge graph exists with all possible categories – that can cover the terminology of nearly any document collection; – for example, Wikipedia • Nodes are categories • Edges are relationship between them – Association (related) • Organization receives documents in batches – Monthly, Weekly, etc.
  • 5. A part of Knowledge Graph (KnG)
  • 6. Problem Definition • Learning a personalized model for the association of the categories in KnG to a document collection through active learning and feature design • Building an evolving multi-label categorization system to categorize documents into Categories Specific to an Organization – Personalization of categories
  • 8. Overall Architecture • We evolve the personalized classifier based on – Documents seen so far – Categories referenced from Knowledge Graph – Feedback provided by the user
  • 9. Step 1: Spotting • Spot the key words from documents – Key phrase identification Techniques – NLP (noun phrases)
  • 10. Step 2: Candidate Categories • Key words are the indicatives of document topics • Identify the Categories from KnG based on keyword look ups – Title Match, Gloss match with Wikipedia categories • Add categories in Markov Blanket – Observe that categories that get assigned to a document exhibit semantic relations such as “associations” – E.g.: category “Linear Classifier” is related to categories such as “Kernel Methods in Classifiers,” “Machine Learning,” and the like – Refer our paper for more details
  • 11. Candidate Categories • Not all candidate categories are relevant to the document – The document is not about that category – The category is not of interest to the user • We need to select only most appropriate categories from these candidate categories
  • 12. Step 3: Associative Markov Network formation • Two types of informative features available – a feature that is a function of the document and a category, such as the category-specific classifier scoring function evaluated on a document – a feature that is a function of two categories, such as their co-occurrence frequency or textual overlap between their descriptions. • Associative Markov Network (AMN), a very natural way of modeling these two types of features.
  • 13. Associative Markov Network • The candidate categories for a journal article taken from arXiv.org • Only some are actually relevant due to – Relevance to the document – User preferences
  • 14. Step 4: Collectively Inferring Categories of a Document • Node features – Capture the similarity of a node (category) to document – E.g: Kernels, SVM / Naïve Bayes classifier scores • Edge features – Capture the similarity between nodes – E.g: Title match, gloss match, etc.
  • 15. Collectively Inferring Categories of a Document • This is the MAP inference in standard Markov Network with only node and edge potentials • Using the indicator variable, we can express the potentials as and • Note, we have separate feature weights for 0 and 1 labels x0 x6 x3 x8 x2 x9 x5 x1 x4 x7 1 0
  • 16. Training • The training process involves learning – The AMN feature weights (Wn an We) • Node specific classifier (SVM, Naïve Bayes, etc) weights • Training is done as part of Personalization explained in coming slides
  • 17. Personalization • Process of learning to categorize with categories that are of interest to an organization • We achieve this by soliciting feedback from a human oracle on the system-suggested categories and using it to retrain the system parameters. • The feedback is solicited as “correct”, “incorrect” or “never again” for the categories assigned to a document by the system.
  • 18. Personalization: Constraints • Users can indicate (via feedback) that a category suggested by the system should never reappear in future categorization – E.g. Computer Science department may not be interested in detailed categorization of documents based on types of Viruses • The system remembers this feedback as a hard constraint which are applied during the inference process
  • 19. Personalization: Constraints • Due to the AMN’s associative property, the constraints naturally propagate – Users do not have to apply constraints on every unwanted category on KnG By applying a “never again” constraint on node N, the label of Node N is forced to 0. This forces labels of strongly associated neighbors (O,P,Q,R) to 0. This is due to the AMN MAP inference, which attains maximum value when the labels of these neighbors (with high edge potentials) are assigned label 0.
  • 20. Personalization: Active Feedback • To improve the categorization accuracy, users can train the system by providing feedback (“correct”, “incorrect”) on select categories of select documents. • System uses this feedback to retrain AMN, SVM (and other classifiers – Naïve Bayes, etc) • System chooses the documents and categories for feedback that can help the system learn best parameters with as few feedback as possible
  • 21. Active Learning • We prove a claim “There exists a feature space and a hyperplane in the feature space that separates AMN nodes with label 1 from the nodes that have label 0 and that passes through the origin” • This claim helps us transforming the AMN model to the hyperplane based two class classification problem and apply uncertainty based principles to determine the most uncertain category for a document
  • 22. Active Learning • ai : gain in selecting category i based on the distance for hyper plane • bj : gain in selecting document j based on the categories it has • Feedback is sought from the user for the documents with zj = 1 and only for those categories that are identified as the most uncertain for that document (yi = 1).
  • 23. Evaluation • Warm Start • RCV1-v2 categories and documents • Demonstrates our system on standard dataset • 5000 documents in batches of 50 docs • 2000 held-out test documents for F1 score • Compared against • SVM • HICLASS from Shantanu et al. • Cold Start • User Evaluation using Wikipedia categories and arXiv articles • Compared against • WikipeidaMiner
  • 24. Warm Start Results Comparison with SVM Active Learning with different algorithms
  • 25. Cold Start Experiments and Results • 263 arxiv docs • Annotated by 8 human annotators using Wikipedia titles • 5 fold cross validation – Trained AMN, SVM weights in each fold
  • 26. To be addressed… • Each document is assigned categories separately. This leads to many accumulated categories at the organization level – Over specified number of categories • AMN inference over thousands of candidate categories is time consuming. Hence we cannot use this system in a real time fashion • KnG evolving over time – Documents that are already assigned with categories need to be updated wisely

Editor's Notes

  1. Hi, my name is Ramakrishna. First of all, I apologize for not being present at the conference at this time. Due to some unfortunate medical conditions, I had to cancel my travel at the last minute. So, I will try to present our work on Evolving Personalized Classifier through this audio/video recording. Please send all your questions to my email and I will ensure that all your queries are answered.
  2. Let me start with the motivation for our work. The need for automatic evolution of a categorization system lies in the enormous amount of digital documents that we accumulate in our organizations every day. Unless we categorize them effectively, it becomes hard to find the required piece of information. Apart from this, the categorization information is also helpful for various other tasks, such as Faceted Search, Exploratory Search, Navigational Search, Diversifying Search Results, Ranking search results, to name a few. Some other interesting facts show that, many companies spend enormous amount of human effort and money to build effective document categorization system. So, it is an import area of reasearch.
  3. What are the challenges in building an automated categorization system? The first question is : what should the representative categories be? Too generic categories like News, Entertainment, Technical, Politics, Sports, etc. might not help. Each of such categories accumulates thousands of articles and searching for the required piece of information continues to be challenging. On the other hand, fine grained category creation needs domain experts and is a laborious task. What’s the next option? We look for some predefined category systems. For example, categories from Reuters RCV1-v2 Text Categorization Test Collection, Yahoo Directories or DMOZ [5] can be adopted and hope those categories to fit the existing documents in library. How good is such an adoption? We conducted a small experiment on Reuters-21578 data-set. We selected some of the Reuters documents and detected topics for them using Wikipedia-Miner, which detects topics for the input text using Wikipedia titles. The results are shown in this table. Interestingly, in some cases categories are very distinct and in some cases, Wikipedia-Miner categories are very Fine grained. This indicates that adoption of predefined categories may not work well always. Apart from this, an organization might have its personal choices in building the categories. For example, a computer science department, may not like to create a detailed categorization of bacterias or genes even if the documents describe about the application of some clustering or classification methods for bacteria or genes. So we want to have a personalized categorization system. /* We also would like to have the categorization system to detect emergence of new categories as the new documents are seen. Currently, we do not address this in our work directly. As will be evident later, some part of it gets address due to the choice of very large knowledge graph (Wikipedia in our experiments) that keeps growing over time. */ In next few slides, we briefly explain our technique of achieving these goals.
  4. We assume that, there exists a catalog with very large number of categories. This is not a bad assumption. For e.g. 5M Wikipedia titles can almost describe the vocabulary of any document collection. There are many such collaboratively built ontologies, which can serve as a category catalog. Often, these sources have much more information such as descriptions, association relations between various titles, etc. Hence, we treat them as Knowledge Graph and make use of all those additional information in deriving Personalized categories. The nodes in this knowledge graph are the categories and relationship between them are the edges. Further more, we assume that the organization receives documents in batches. We think this is fairly descent assumption, for e.g. the library of a university receives hundreds of student thesis at the end of a semester/year, periodically.
  5. This picture shows a part of knowledge graph, with various category nodes and relationship edges. In our case, we only consider association/related relation between the categories. For e.g. “Loss Function” is related to “Support Vector Machine”, since SVM uses a loss function. The number on the edge shows the strength of relationship.
  6. OK… Here is the precise statement of our problem and contribution. We try to learn a model which identifies a subset of categories from the Knowledge graph that are relevant to the documents, considering the organization preferences, that is personalized to the organization. We do this with the help of new Active Learning technique that we proposed. We will explain this in detail next.
  7. This figure pictorially represents our goal of evolving a personalized classifier, by making use of knowledge graph and user interests.
  8. Lets have a quick look at the overall architecture and then we will get into the specifics of our model, personalization techniques and active learning. The main component, “Personalized Classifier” attaches categories to the documents using the Knowledge Graph and the learnt model parameters. The “Active Learner” component choses specific documents and the specific categories in those documents and asks the user what she thinks. Is it right or wrong or of no interest to the organization. This feedback helps the system learn organization specific categorization. The choice of documents and categories for user feedback is done wisely to maximize the learning rate and minimize the user cognitive load, as we see later.
  9. The very first step in identifying the categories for a document is to spot important keywords. We use some of the known techniques in literature for key phrase extraction. You can refer to our paper for more details on this.
  10. Once we identify the keywords, we can retrieve the subset of categories from Knowledge Graph that match these keywords. We used the category names and the gloss (ie. First few sentences of category description in Wikipedia) to get these categories. We then expand the spotted category list by adding the categories in the Markov Blanket of these categories in the Knowledge Graph. We call the resulting list as Candidate Categories. The reason for this expansion is that, the categories that get assigned to a document usually exhibit semantic relations such as “associations”. For example, consider a document about Linear Classifiers. Category “Linear Classifier” is related to categories such as “Kernel Methods in Classifiers,” “Machine Learning”. Our observation shows that all these categories usually appear together in a document talking about say Linear Classifiers. You may refer to our paper for more examples and details. We next select a subset of edges from KnG such that the nodes connected by them are in Candidate Categories List. This gives us a small sub graph of knowledge graph, which we call candidate category subgraph and use for further processing.
  11. Note that, Not all the categories in the candidate list of categories are relevant to the document. Because, the document may not be talking about a topic that is spotted or the spotted category is of no interest to the organization. We need to select only most appropriate categories from these candidate categories and assign them to the document. We use Associative Markov Network for this.
  12. The reason for choosing AMN is due to our observation of two classes of informative features are present in the candidate category subgraph: Firstly, a feature that is a function of the document and a category, such as the category-specific classifier scoring function evaluated on a document And secondly, a feature that is a function of two categories, such as their co-occurrence frequency or textual overlap between their descriptions, etc. This naturally, makes Associative Markov Network (AMN) very suitable model for inferring the categories from the candidate category subgraph. The next slide illustrates this with an example
  13. All the categories shown here are the candidate categories for a journal article taken from arXiv.org talking about classification of cancers. But only few of them such as Binary Classifier, Support Vector Machine are relevant due to the relevance to the content of the document and user preferences.
  14. Next step is to choose the relevant categories using AMN inference. The AMN inference is based on the node and edge features. Node features help in capturing the similarity of a node, that is category to the input document through a combination of different kernels such as Bag of Words kernels, N-gram kernels, Relational kernels and most importantly, node specific Classifiers SVM or Naïve Bayes classifiers. Edge features Capture the similarity between the nodes using Cosine or Jaccard similarity between their titles, descriptions, etc.
  15. The goal of AMN inference is to assign the labels 0 or 1 to the nodes of candidate category graph that maximizes the joint probability of assignment. If a node is labeled 1, that category is attached to the document. If 0, it is discarded. Hence, the set of categories assigned to the documents are those nodes that are inferred as 1. In this optimization problem, variables X_i stand for the node feature vector for the i_th node. Similarly, X_ij is the edge feature vector for the edge connecting nodes i,j. The node feature weights wn combine the node feature values. Similarly, edge feature weights w_e combine the edge feature values. Note that, superscript k on the weights indicates that we have separate feature weights for the features corresponding to label 0 and label 1. The constraint summation yik=1 ensures that only label 0 or 1 is assigned to a node, not both. The next constraint, this one, forces two adjacent nodes with high edge potential (that is similar nodes) to assume similar labels. The constraint yi = 1 for the hard constrained nodes force the node label 0. These hard constraints are part of our personalization task, which we explain later. Note that, this is the MAP inference in the Markov Network where only cliques of size 1 or 2 are considered. Again, we request our audience to refer to our paper for further details.
  16. One of the important aspect of this model is to learn node feature weights wn, edge feature weights we and node specific SVM classifier weights. Our system learns these weights as it evolves based on the feedback and training by the user as part of personalization describe in the next slide.
  17. Now we need to train our model that can categorize the documents as per the organization’s interests. This involves training node specific classifiers such as SVM and learning the AMN feature weights based on the training examples. The training examples are provided by the user, in the form of feedback. That is, system first chooses few documents and categories that it thinks are most “confusing” to the system and asks the user if that category assignment of the document “correct”, “incorrect” or that category is not of interest to organization. Based on these 3 kinds of feedback, the system trains its SVM classifiers and AMN weights.
  18. If the users indicates the category is of not interest to the organization, system applies a constraint that forces the label 0 for the category during the AMN inference process that we saw earlier.
  19. Note that, the user need not apply this kind of constraints to all the categories. Due the AMN’s associative property, the constraint automatically propagates to a node similar to this node. For e.g. if the organization tells the system that the category “Cancer” is of no interest to them, the system automatically suppresses the related categories such as malignancy, Hormone Therapy, etc, which are related categories during AMN inference. This happens naturally because, the joint probability will be maximized by assigning similar labels to the highly related categories . This is precisely the reason why we chose AMN to model our problem.
  20. The feedback “Correct” or “Incorrect” given by the user is used by the system to generate +ve and –ve examples for the node specific classifiers such as SVM and learn it parameters. It also retrains the AMN parameters whenever SVM parameters are updated so that the feature weights are recalibrated.
  21. As said earlier, the system chooses documents and categories to seek user feedback. To choose specific document and the categories as part of Active Learning, we first transform the AMN model to a hyperplane based two class classification problem. The proof and the details of the transformation are given in our paper. Using the uncertainty based approach, we choose the categories that are close to the hyperplane as most confusing categories for the system for that document.
  22. We now have most uncertain categories for a given document. But we have multiple documents in a batch and we cannot seek feedback for all the documents. Because that will put lot of cognitive load on the users. We need to choose a set of documents that maximize the learning on uncertain categories. Hence we propose a joint active learning model. Here we view the document and categories as bipartite graph with documents on one side and categories on the other side. The optimization problem chooses a set of documents with zj=1 and a set of categories with yi=1 that are most uncertain and for which seeking feedback from the user is most beneficial. The weights a_i tells us how much uncertain the category i is. The weight b_j tells us what is the contribution of document j in terms of providing the uncertain categories for feedback. Our paper gives couple ways in which this contribution can be calculated. The value P here is the number of documents for which the user is willing to give feedback. Say, 3 or 5 or 10, whatever… it depends on when the user gets bored. Based on the outcome of this optimization, we present the documents and its most uncertain categories to the user for the feedback.
  23. Coming to evaluation, we report our results in two different settings: (i) Warm start (ii) Cold start. In the Warm start setting, we use Reuters RCV1-v2 text collection and use only the categories present in RCV1-v2 and created a Knowledge Graph of those categories. Such a setting helps us demonstrate how, on a standard classification dataset, the Markov network helps propagate learnings from a category to other related categories. We picked 5000 training documents and 2000 test documents using clustered sampling procedure. We further divided the training set into 100 batches of 50 documents each. We iterated through the batches and in the kth iteration, we trained our model (SVMs, AMN feature weights) using training documents from all batches up to the kth batch. For each iteration, we performed AMN inference on the sample of 2000 test documents. We also compared our proposed joint active learning technique with other techniques from literature, namely, HIClass and Tong’s procedure adapted to multi-class classification.
  24. Here are the results of Warm start tests. The X axis shows the growing document size due to incoming batches. The y-axis is the F1 score of categorizing 2000 held out Reuter’s documents. We observe that our system learns faster and provides better F1 score compared SVM. Similarly, joint active learning is also observed to learn faster than other methods.
  25. In cold start setting, using the Amazon S3 service, we downloaded 263 technical documents from arXiv under different streams of Computer Science. With the help of eight human annotators, we assigned categories to each document using Wikipedia article names. We carried out five-fold cross-validation with each fold containing 210 training documents and 53 test documents. In each fold, we trained our model (SVMs, AMN feature weights) using the training set and evaluated Consistency, Precision and Recall on the test set. Note that, during the training phase, we also applied localization techniques in which we recorded feedback for the system suggested categories in three forms: “Correct,”, “Incorrect” and “Never again.” as recorded by the human annotators. We find that, the Consistency of our system with the human annotators better than that of Wikipedia Miner by about 12%.
  26. There are few issues that our system fails to handle currently. Since it generates categories for each document separately, it may end up in generating too many categories AMN inference over thousands of candidate categories is time consuming. Hence we cannot use this system in a real time fashion When KnG gets updated, Documents that are already assigned with categories need to be updated wisely. We plan to take these as future improvements.
  27. Thanks everyone for listening to my presentation. Please send all your questions to my email address shown here and I will answer them promptly. Once again, thank you for listening to me and have a good day.