SlideShare a Scribd company logo
1 of 36
Concept-based Short Text Classification
and Ranking
Date:2015/05/21
Author:Fang Wang, Zhongyuan Wang, Zhoujun Li, Ji-Rong Wen
Source:CIKM '14
Advisor:Jia-ling Koh
Spearker:LIN,CI-JIE
1
Outline
Introduction
Method
Experiment
Conclusion
2
Outline
Introduction
Method
Experiment
Conclusion
3
Introduction
 Most existing approaches for text classification represent texts
as vectors of words, namely “Bag-of-Words”
 This text representation results in a very high dimensionality
of feature space and frequently suffers from surface
mismatching
4
Jeep、Honda Car
Introduction
 Goal:
1. using “Bag-of-Concepts” in short text representation, aiming to avoid the
surface mismatching and handle the synonym and polysemy problem
5
Bag of words Bag of concepts
Introduction
 Goal:
2. Short text classification is based on “Bag-of-Concepts”
6
Beyonce named People’s most beautiful woman
Lady Gaga Responds to Concert Band
Classify Music
Outline
Introduction
Method
Experiment
Conclusion
7
Framework
8
Framework
9
Entity Recognition
1. Documents are first split to sentences
2. Use all instances in Probase as the matching dictionary for detecting the entities
from each sentence
3. Stemming is performed to assist in the matching process
4. Extracted entities are merged together and weighted by idf based on different
classes
10
Beyonce named People’s most beautiful woman
Beyonce named People’s most beautiful woman
Set={beyonce}, Idf(Beyonce)=2
Candidates Generation
 Given entity 𝑒𝑗 , we select its top 𝑁𝑡 concepts ranked by the its typical concept P(c|e)
 Merge all the typical concepts as the primary candidate set
 Computing the idf value for each concept in the class level
 Removing stop concepts , which tend to be too general to represent a class
11
c1,c2,...c
20
𝑒𝑗
c1,c2,...
cn
U 𝑒 𝑗
c1,c2,...
cn
Idf(c1,c3,...
cn)
Merge Removing stop conceptsComputing idf
Concept Weighting
 The top 𝑁𝑡 concepts still contain noise
 Weight the candidates to measure their representative strengths for each
class
12
Given entity “python” in class Technique, mapping method
will result in its top 𝑁𝑡 concepts list including animal
Typicality
 Use a probabilistic way to measure the Is-A relations
 given an instance e, which has Is-A relationship with concept c
 penguin is-a bird
 Take Probase as a Knowledge database in this paper
 terms in Probase are connected by a variety of relationships
 <concept>t<entity>t<frequency>t<popularity>t<ConceptFrequency>t<ConceptSize>
t<ConceptVagueness>t<Zipf_Slope>t<Zipf_Pearson_Coefficient>t<EntityFrequency>
t<EntitySize>
13
Typicality
14
1. n(e, c) denotes the co-occur frequency of e and c
2. n(e) is the frequency of e
penguin is-a bird
<concept>t<entity>t<frequency>t<EntityFrequency>
<bird>t<penguin>t<50>t<100>
𝑃 𝑏𝑖𝑟𝑑 𝑝𝑒𝑛𝑔𝑢𝑖𝑛 =
𝑛(𝑝𝑒𝑛𝑔𝑢𝑖𝑛, 𝑏𝑖𝑟𝑑)
𝑛(𝑝𝑒𝑛𝑔𝑢𝑖𝑛)
Framework
15
Short Text Conceptualization
 Short Text Conceptualization aims to abstract a set of most
representative concepts that can best describe the short text
16
apple ipad
?
Short Text Conceptualization
1. detect all possible entities and then remove those contained by others
 given the short text “windows phone app,” the recognized entity set will be {“windows
phone,” “phone app”}, while “windows,” “phone,” and “app” are removed
 the entity list 𝐸𝑠𝑡 𝑖
= {𝑒𝑗 , j = 1, 2, ..., M} for a short text 𝑠𝑡𝑖
2. Sense Detection
 detect different senses for each entity in 𝐸𝑠𝑡 𝑖
, so as to determine whether the entity is
ambiguous
3. Disambiguation
 disambiguate vague entity by leveraging its unambiguous context entities
17
Sense Detection
 Denote 𝐶𝑒 𝑗
= {𝑐 𝑘, k = 1, 2, ..., 𝑁𝑡} is 𝑒𝑗′s typical concept list
 Denote 𝐶𝐶𝑙 𝑒 𝑗
= {𝑐𝑐𝑙 𝑚 , m = 1, 2, ...} is 𝑒𝑗′s concept cluster set
18
Beyonce
歌手
作詞人
模特兒
時裝設計師
演藝
𝑒𝑗
𝑐 𝑘
𝑐𝑐𝑙 𝑚
設計
Sense Detection
19
Entropy越高,𝑒𝑗的意義越模糊
Entropy越低,𝑒𝑗的意義越明確
Beyonce
歌手
作詞人
模特兒
時裝設計師
演藝
設計
𝑒𝑗
𝑐 𝑘
𝑐𝑐𝑙 𝑚
𝑃 演藝 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 = 0.3 + 0.3 + 0.3
0.3
0.3
0.3
0.1
Disambiguation
• Denote the vague entity as 𝑒𝑖
𝑣
, and unambiguous entity 𝑒𝑗
𝑢
20
Disambiguation
• Denote the vague entity as 𝑒𝑖
𝑣
, and unambiguous entity 𝑒𝑗
𝑢
21
Beyonce music and songs
音樂學演藝
P(演藝|Beyonce)=0.5 P(音樂學|music)=1
P(音樂學|songs)=1
設計
P(設計|Beyonce)=0.5
𝑃′
演藝 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 = 𝑃 演藝 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 ∗ 𝑃 音樂學 𝑚𝑢𝑠𝑖𝑐 ∗ 𝐶𝑆(演藝, 音樂學)
+ 𝑃 演藝 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 ∗ 𝑃 音樂學 songs ∗ 𝐶𝑆 演藝, 音樂學
= 0.2*0.9*0.9 + 0.2*0.9*0.9 = 0.324
𝑐𝑐𝑙 𝑛 = {音樂學}𝑐𝑐𝑙 𝑚 = {設計, 演藝}
𝑃′
設計 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 = 𝑃 設計 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 ∗ 𝑃 音樂學 𝑚𝑢𝑠𝑖𝑐 ∗ 𝐶𝑆(設計, 音樂學)
+ 𝑃 設計 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 ∗ 𝑃 音樂學 songs ∗ 𝐶𝑆 設計, 音樂學
= 0.2*0.9*0.1 + 0.2*0.9*0.1 = 0.036
Disambiguation
• Denote the vague entity as 𝑒𝑖
𝑣
, and unambiguous entity 𝑒𝑗
𝑢
22
Beyonce music and songs
音樂學演藝
P(演藝|Beyonce)=0.5 P(音樂學|music)=1
P(音樂學|songs)=1
設計
P(設計|Beyonce)=0.5
𝑃′
演藝 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 = 𝑃 演藝 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 ∗ 𝑃 音樂學 𝑚𝑢𝑠𝑖𝑐 ∗ 𝐶𝑆(演藝, 音樂學)
+ 𝑃 演藝 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 ∗ 𝑃 音樂學 songs ∗ 𝐶𝑆 演藝, 音樂學
= 0.2*0.9*0.9 + 0.2*0.9*0.9 = 0.324
𝑐𝑐𝑙 𝑛 = {音樂學}𝑐𝑐𝑙 𝑚 = {設計, 演藝}
𝑃′
設計 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 = 𝑃 設計 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 ∗ 𝑃 音樂學 𝑚𝑢𝑠𝑖𝑐 ∗ 𝐶𝑆(設計, 音樂學)
+ 𝑃 設計 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 ∗ 𝑃 音樂學 songs ∗ 𝐶𝑆 設計, 音樂學
= 0.2*0.9*0.1 + 0.2*0.9*0.1 = 0.036
P(演藝|Beyonce)=0.5 𝑃′
演藝 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 = 0.324
P(設計|Beyonce)=0.5 𝑃′ 設計 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 = 0.036
Disambiguation
 CS(𝑐𝑐𝑙 𝑚, 𝑐𝑐𝑙 𝑛) denotes the concept cluster similarity
23
演藝音樂學
民族音樂學
系統音樂學
歷史音樂學
民族歌手
鄉村歌手
民族歌手
𝑒𝑖 𝑒𝑖+1 ... 𝑒 𝑘
民族音樂學
𝑒+1 ... 𝑒𝑙𝑒𝑗
Framework
24
Classification
 classify the short 𝑡𝑒𝑥𝑡 𝑠𝑡𝑖 to the class 𝐶𝐿𝑙 that is most similar with 𝑠𝑡𝑖
 𝑠𝑡𝑖’s concept expression 𝐶𝑠𝑡 𝑖
= {Cj , j = 1, 2,...,M}
25
Beyonce music and songs
音樂學演藝
演藝
C1
C2
C3
𝐶𝑀𝑙
C2
C3
C4
𝐶𝑠𝑡𝑖
= {演藝、音樂學}
C 𝑘
Ranking
 Ranking by Similarity
 each short text 𝑠𝑡𝑖 assigned to 𝐶𝐿𝑙 has a similarity score, we can rank
them directly by their scores
 Ranking with Diversity
 diversify the short texts by subtopic Proportionality(PM-2) [12]
26
Outline
Introduction
Method
Experiment
Conclusion
27
Experiment
 evaluate the performance of BocSTC(Bag-of-Concepts - Short Text
Classification) on the real application - Channel-based query
recommendation
28
Query recommendation for Channel Living
Experiment
 Four commonly used channels are selected as targeted channels
 Money, Movie, Music and TV
 Training dataset
 randomly select 6,000 documents for each channel
 The titles are used as training data for BocSTC
29
Experiment
 Test dataset
 841 labeled queries, from which, 200 are selected randomly for verification and
600 for testing
30
Experiment
31
Performance on query classification
Experiment
32
Precision performance on each channel
Experiment
 manually annotate top 20 queries with the guidelines
 Unrelated、Related but Uninteresting、Related and Interesting
33
Diversity performance on each channel
Outline
Introduction
Method
Experiment
Conclusion
34
Conclusion
 propose a novel framework for short text classification and
ranking applications
 It measures the semantic similarities between short texts from
the angle of concepts, so as to avoid surface mismatch
35
Thanks for listening.
36

More Related Content

What's hot

Integrals - definite integral and fundamental theorem
Integrals - definite integral and fundamental theoremIntegrals - definite integral and fundamental theorem
Integrals - definite integral and fundamental theorem
LiveOnlineClassesInd
 
Jarrar.lecture notes.aai.2011s.descriptionlogic
Jarrar.lecture notes.aai.2011s.descriptionlogicJarrar.lecture notes.aai.2011s.descriptionlogic
Jarrar.lecture notes.aai.2011s.descriptionlogic
PalGov
 

What's hot (7)

Integrals - definite integral and fundamental theorem
Integrals - definite integral and fundamental theoremIntegrals - definite integral and fundamental theorem
Integrals - definite integral and fundamental theorem
 
Bayesnetwork
BayesnetworkBayesnetwork
Bayesnetwork
 
Slides Workshopon Explainable Logic-Based Knowledge Representation (XLoKR 2020)
Slides Workshopon Explainable Logic-Based Knowledge Representation (XLoKR 2020)Slides Workshopon Explainable Logic-Based Knowledge Representation (XLoKR 2020)
Slides Workshopon Explainable Logic-Based Knowledge Representation (XLoKR 2020)
 
Jarrar.lecture notes.aai.2011s.descriptionlogic
Jarrar.lecture notes.aai.2011s.descriptionlogicJarrar.lecture notes.aai.2011s.descriptionlogic
Jarrar.lecture notes.aai.2011s.descriptionlogic
 
ON SEMI-  -CONTINUITY WHERE   {L, M, R, S}
ON SEMI-  -CONTINUITY WHERE   {L, M, R, S}ON SEMI-  -CONTINUITY WHERE   {L, M, R, S}
ON SEMI-  -CONTINUITY WHERE   {L, M, R, S}
 
Formal Concept Analysis
Formal Concept AnalysisFormal Concept Analysis
Formal Concept Analysis
 
Jarrar: First Order Logic- Inference Methods
Jarrar: First Order Logic- Inference MethodsJarrar: First Order Logic- Inference Methods
Jarrar: First Order Logic- Inference Methods
 

Viewers also liked

Effective string processing and matching for author entity
Effective string processing and matching for author entityEffective string processing and matching for author entity
Effective string processing and matching for author entity
祺傑 林
 
Text classification
Text classificationText classification
Text classification
James Wong
 
Language Detection Library for Java
Language Detection Library for Java Language Detection Library for Java
Language Detection Library for Java
Shuyo Nakatani
 
Short Text Language Detection with Infinity-Gram
Short Text Language Detection with Infinity-GramShort Text Language Detection with Infinity-Gram
Short Text Language Detection with Infinity-Gram
Shuyo Nakatani
 

Viewers also liked (17)

A coverage criterion for spaced seeds and its applications to SVM string-ker...
A coverage criterion for spaced seeds and its applications to SVM string-ker...A coverage criterion for spaced seeds and its applications to SVM string-ker...
A coverage criterion for spaced seeds and its applications to SVM string-ker...
 
Generic Short Text Classifier
Generic Short Text ClassifierGeneric Short Text Classifier
Generic Short Text Classifier
 
Using Linked Data to Mine RDF from Wikipedia's Tables
Using Linked Data to Mine RDF from Wikipedia's TablesUsing Linked Data to Mine RDF from Wikipedia's Tables
Using Linked Data to Mine RDF from Wikipedia's Tables
 
How Many Folders Do You Really Need? Classifying Email into a Handful of Cate...
How Many Folders Do You Really Need?Classifying Email into a Handful of Cate...How Many Folders Do You Really Need?Classifying Email into a Handful of Cate...
How Many Folders Do You Really Need? Classifying Email into a Handful of Cate...
 
Using Java reflection to break  encapsulation
Using Java reflection to break  encapsulationUsing Java reflection to break  encapsulation
Using Java reflection to break  encapsulation
 
Untitled
UntitledUntitled
Untitled
 
Effective string processing and matching for author entity
Effective string processing and matching for author entityEffective string processing and matching for author entity
Effective string processing and matching for author entity
 
Fine-Grained Location Extraction from Tweets with Temporal Awareness
Fine-Grained Location Extraction from Tweets with Temporal AwarenessFine-Grained Location Extraction from Tweets with Temporal Awareness
Fine-Grained Location Extraction from Tweets with Temporal Awareness
 
Mobile query reformulations
Mobile query reformulationsMobile query reformulations
Mobile query reformulations
 
Forecasting fine grained air quality based on big data
Forecasting fine grained air quality based on big dataForecasting fine grained air quality based on big data
Forecasting fine grained air quality based on big data
 
Leveraging Knowledge Bases for Contextual Entity Exploration Categories
Leveraging Knowledge Basesfor Contextual Entity Exploration CategoriesLeveraging Knowledge Basesfor Contextual Entity Exploration Categories
Leveraging Knowledge Bases for Contextual Entity Exploration Categories
 
Extending facet search to the general web
Extending facet search to the general webExtending facet search to the general web
Extending facet search to the general web
 
Text classification
Text classificationText classification
Text classification
 
Language Detection Library for Java
Language Detection Library for Java Language Detection Library for Java
Language Detection Library for Java
 
Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learn
 
Short Text Language Detection with Infinity-Gram
Short Text Language Detection with Infinity-GramShort Text Language Detection with Infinity-Gram
Short Text Language Detection with Infinity-Gram
 
Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayes
 

Similar to Concept based short text classification and ranking

ch03-2018.02.02.pptx
ch03-2018.02.02.pptxch03-2018.02.02.pptx
ch03-2018.02.02.pptx
SYAMDAVULURI
 
Tensor-based Models of Natural Language Semantics
Tensor-based Models of Natural Language SemanticsTensor-based Models of Natural Language Semantics
Tensor-based Models of Natural Language Semantics
Dimitrios Kartsaklis
 
Jarrar.lecture notes.aai.2011s.ch7.p logic
Jarrar.lecture notes.aai.2011s.ch7.p logicJarrar.lecture notes.aai.2011s.ch7.p logic
Jarrar.lecture notes.aai.2011s.ch7.p logic
PalGov
 
Moshe Guttmann's slides on eigenface
Moshe Guttmann's slides on eigenfaceMoshe Guttmann's slides on eigenface
Moshe Guttmann's slides on eigenface
wolf
 

Similar to Concept based short text classification and ranking (20)

ch03-2018.02.02.pptx
ch03-2018.02.02.pptxch03-2018.02.02.pptx
ch03-2018.02.02.pptx
 
Designing, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural NetworksDesigning, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural Networks
 
Lesson 26
Lesson 26Lesson 26
Lesson 26
 
AI Lesson 26
AI Lesson 26AI Lesson 26
AI Lesson 26
 
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
 
NLP in Practice - Part II
NLP in Practice - Part IINLP in Practice - Part II
NLP in Practice - Part II
 
Information retrieval to recommender systems
Information retrieval to recommender systemsInformation retrieval to recommender systems
Information retrieval to recommender systems
 
Set Theory
Set TheorySet Theory
Set Theory
 
Sentence generation
Sentence generationSentence generation
Sentence generation
 
Link Discovery Tutorial Part II: Accuracy
Link Discovery Tutorial Part II: AccuracyLink Discovery Tutorial Part II: Accuracy
Link Discovery Tutorial Part II: Accuracy
 
Bootstrap Regularization by Stefan Wager, Stanford
Bootstrap Regularization by Stefan Wager, StanfordBootstrap Regularization by Stefan Wager, Stanford
Bootstrap Regularization by Stefan Wager, Stanford
 
Tensor-based Models of Natural Language Semantics
Tensor-based Models of Natural Language SemanticsTensor-based Models of Natural Language Semantics
Tensor-based Models of Natural Language Semantics
 
Jarrar.lecture notes.aai.2011s.ch7.p logic
Jarrar.lecture notes.aai.2011s.ch7.p logicJarrar.lecture notes.aai.2011s.ch7.p logic
Jarrar.lecture notes.aai.2011s.ch7.p logic
 
Musical Information Retrieval Take 2: Interval Hashing Based Ranking
Musical Information Retrieval Take 2: Interval Hashing Based RankingMusical Information Retrieval Take 2: Interval Hashing Based Ranking
Musical Information Retrieval Take 2: Interval Hashing Based Ranking
 
Interval Hashing Based Ranking
Interval Hashing Based RankingInterval Hashing Based Ranking
Interval Hashing Based Ranking
 
Pertemuan 5_Relation Matriks_01 (17)
Pertemuan 5_Relation Matriks_01 (17)Pertemuan 5_Relation Matriks_01 (17)
Pertemuan 5_Relation Matriks_01 (17)
 
L03 ai - knowledge representation using logic
L03 ai - knowledge representation using logicL03 ai - knowledge representation using logic
L03 ai - knowledge representation using logic
 
Comparing Index Structures for Completeness Reasoning
Comparing Index Structures for Completeness ReasoningComparing Index Structures for Completeness Reasoning
Comparing Index Structures for Completeness Reasoning
 
Moshe Guttmann's slides on eigenface
Moshe Guttmann's slides on eigenfaceMoshe Guttmann's slides on eigenface
Moshe Guttmann's slides on eigenface
 
深層意味表現学習 (Deep Semantic Representations)
深層意味表現学習 (Deep Semantic Representations)深層意味表現学習 (Deep Semantic Representations)
深層意味表現学習 (Deep Semantic Representations)
 

Recently uploaded

DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
LeenakshiTyagi
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 

Recently uploaded (20)

Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 

Concept based short text classification and ranking

  • 1. Concept-based Short Text Classification and Ranking Date:2015/05/21 Author:Fang Wang, Zhongyuan Wang, Zhoujun Li, Ji-Rong Wen Source:CIKM '14 Advisor:Jia-ling Koh Spearker:LIN,CI-JIE 1
  • 4. Introduction  Most existing approaches for text classification represent texts as vectors of words, namely “Bag-of-Words”  This text representation results in a very high dimensionality of feature space and frequently suffers from surface mismatching 4
  • 5. Jeep、Honda Car Introduction  Goal: 1. using “Bag-of-Concepts” in short text representation, aiming to avoid the surface mismatching and handle the synonym and polysemy problem 5 Bag of words Bag of concepts
  • 6. Introduction  Goal: 2. Short text classification is based on “Bag-of-Concepts” 6 Beyonce named People’s most beautiful woman Lady Gaga Responds to Concert Band Classify Music
  • 10. Entity Recognition 1. Documents are first split to sentences 2. Use all instances in Probase as the matching dictionary for detecting the entities from each sentence 3. Stemming is performed to assist in the matching process 4. Extracted entities are merged together and weighted by idf based on different classes 10 Beyonce named People’s most beautiful woman Beyonce named People’s most beautiful woman Set={beyonce}, Idf(Beyonce)=2
  • 11. Candidates Generation  Given entity 𝑒𝑗 , we select its top 𝑁𝑡 concepts ranked by the its typical concept P(c|e)  Merge all the typical concepts as the primary candidate set  Computing the idf value for each concept in the class level  Removing stop concepts , which tend to be too general to represent a class 11 c1,c2,...c 20 𝑒𝑗 c1,c2,... cn U 𝑒 𝑗 c1,c2,... cn Idf(c1,c3,... cn) Merge Removing stop conceptsComputing idf
  • 12. Concept Weighting  The top 𝑁𝑡 concepts still contain noise  Weight the candidates to measure their representative strengths for each class 12 Given entity “python” in class Technique, mapping method will result in its top 𝑁𝑡 concepts list including animal
  • 13. Typicality  Use a probabilistic way to measure the Is-A relations  given an instance e, which has Is-A relationship with concept c  penguin is-a bird  Take Probase as a Knowledge database in this paper  terms in Probase are connected by a variety of relationships  <concept>t<entity>t<frequency>t<popularity>t<ConceptFrequency>t<ConceptSize> t<ConceptVagueness>t<Zipf_Slope>t<Zipf_Pearson_Coefficient>t<EntityFrequency> t<EntitySize> 13
  • 14. Typicality 14 1. n(e, c) denotes the co-occur frequency of e and c 2. n(e) is the frequency of e penguin is-a bird <concept>t<entity>t<frequency>t<EntityFrequency> <bird>t<penguin>t<50>t<100> 𝑃 𝑏𝑖𝑟𝑑 𝑝𝑒𝑛𝑔𝑢𝑖𝑛 = 𝑛(𝑝𝑒𝑛𝑔𝑢𝑖𝑛, 𝑏𝑖𝑟𝑑) 𝑛(𝑝𝑒𝑛𝑔𝑢𝑖𝑛)
  • 16. Short Text Conceptualization  Short Text Conceptualization aims to abstract a set of most representative concepts that can best describe the short text 16 apple ipad ?
  • 17. Short Text Conceptualization 1. detect all possible entities and then remove those contained by others  given the short text “windows phone app,” the recognized entity set will be {“windows phone,” “phone app”}, while “windows,” “phone,” and “app” are removed  the entity list 𝐸𝑠𝑡 𝑖 = {𝑒𝑗 , j = 1, 2, ..., M} for a short text 𝑠𝑡𝑖 2. Sense Detection  detect different senses for each entity in 𝐸𝑠𝑡 𝑖 , so as to determine whether the entity is ambiguous 3. Disambiguation  disambiguate vague entity by leveraging its unambiguous context entities 17
  • 18. Sense Detection  Denote 𝐶𝑒 𝑗 = {𝑐 𝑘, k = 1, 2, ..., 𝑁𝑡} is 𝑒𝑗′s typical concept list  Denote 𝐶𝐶𝑙 𝑒 𝑗 = {𝑐𝑐𝑙 𝑚 , m = 1, 2, ...} is 𝑒𝑗′s concept cluster set 18 Beyonce 歌手 作詞人 模特兒 時裝設計師 演藝 𝑒𝑗 𝑐 𝑘 𝑐𝑐𝑙 𝑚 設計
  • 20. Disambiguation • Denote the vague entity as 𝑒𝑖 𝑣 , and unambiguous entity 𝑒𝑗 𝑢 20
  • 21. Disambiguation • Denote the vague entity as 𝑒𝑖 𝑣 , and unambiguous entity 𝑒𝑗 𝑢 21 Beyonce music and songs 音樂學演藝 P(演藝|Beyonce)=0.5 P(音樂學|music)=1 P(音樂學|songs)=1 設計 P(設計|Beyonce)=0.5 𝑃′ 演藝 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 = 𝑃 演藝 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 ∗ 𝑃 音樂學 𝑚𝑢𝑠𝑖𝑐 ∗ 𝐶𝑆(演藝, 音樂學) + 𝑃 演藝 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 ∗ 𝑃 音樂學 songs ∗ 𝐶𝑆 演藝, 音樂學 = 0.2*0.9*0.9 + 0.2*0.9*0.9 = 0.324 𝑐𝑐𝑙 𝑛 = {音樂學}𝑐𝑐𝑙 𝑚 = {設計, 演藝} 𝑃′ 設計 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 = 𝑃 設計 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 ∗ 𝑃 音樂學 𝑚𝑢𝑠𝑖𝑐 ∗ 𝐶𝑆(設計, 音樂學) + 𝑃 設計 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 ∗ 𝑃 音樂學 songs ∗ 𝐶𝑆 設計, 音樂學 = 0.2*0.9*0.1 + 0.2*0.9*0.1 = 0.036
  • 22. Disambiguation • Denote the vague entity as 𝑒𝑖 𝑣 , and unambiguous entity 𝑒𝑗 𝑢 22 Beyonce music and songs 音樂學演藝 P(演藝|Beyonce)=0.5 P(音樂學|music)=1 P(音樂學|songs)=1 設計 P(設計|Beyonce)=0.5 𝑃′ 演藝 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 = 𝑃 演藝 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 ∗ 𝑃 音樂學 𝑚𝑢𝑠𝑖𝑐 ∗ 𝐶𝑆(演藝, 音樂學) + 𝑃 演藝 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 ∗ 𝑃 音樂學 songs ∗ 𝐶𝑆 演藝, 音樂學 = 0.2*0.9*0.9 + 0.2*0.9*0.9 = 0.324 𝑐𝑐𝑙 𝑛 = {音樂學}𝑐𝑐𝑙 𝑚 = {設計, 演藝} 𝑃′ 設計 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 = 𝑃 設計 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 ∗ 𝑃 音樂學 𝑚𝑢𝑠𝑖𝑐 ∗ 𝐶𝑆(設計, 音樂學) + 𝑃 設計 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 ∗ 𝑃 音樂學 songs ∗ 𝐶𝑆 設計, 音樂學 = 0.2*0.9*0.1 + 0.2*0.9*0.1 = 0.036 P(演藝|Beyonce)=0.5 𝑃′ 演藝 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 = 0.324 P(設計|Beyonce)=0.5 𝑃′ 設計 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 = 0.036
  • 23. Disambiguation  CS(𝑐𝑐𝑙 𝑚, 𝑐𝑐𝑙 𝑛) denotes the concept cluster similarity 23 演藝音樂學 民族音樂學 系統音樂學 歷史音樂學 民族歌手 鄉村歌手 民族歌手 𝑒𝑖 𝑒𝑖+1 ... 𝑒 𝑘 民族音樂學 𝑒+1 ... 𝑒𝑙𝑒𝑗
  • 25. Classification  classify the short 𝑡𝑒𝑥𝑡 𝑠𝑡𝑖 to the class 𝐶𝐿𝑙 that is most similar with 𝑠𝑡𝑖  𝑠𝑡𝑖’s concept expression 𝐶𝑠𝑡 𝑖 = {Cj , j = 1, 2,...,M} 25 Beyonce music and songs 音樂學演藝 演藝 C1 C2 C3 𝐶𝑀𝑙 C2 C3 C4 𝐶𝑠𝑡𝑖 = {演藝、音樂學} C 𝑘
  • 26. Ranking  Ranking by Similarity  each short text 𝑠𝑡𝑖 assigned to 𝐶𝐿𝑙 has a similarity score, we can rank them directly by their scores  Ranking with Diversity  diversify the short texts by subtopic Proportionality(PM-2) [12] 26
  • 28. Experiment  evaluate the performance of BocSTC(Bag-of-Concepts - Short Text Classification) on the real application - Channel-based query recommendation 28 Query recommendation for Channel Living
  • 29. Experiment  Four commonly used channels are selected as targeted channels  Money, Movie, Music and TV  Training dataset  randomly select 6,000 documents for each channel  The titles are used as training data for BocSTC 29
  • 30. Experiment  Test dataset  841 labeled queries, from which, 200 are selected randomly for verification and 600 for testing 30
  • 33. Experiment  manually annotate top 20 queries with the guidelines  Unrelated、Related but Uninteresting、Related and Interesting 33 Diversity performance on each channel
  • 35. Conclusion  propose a novel framework for short text classification and ranking applications  It measures the semantic similarities between short texts from the angle of concepts, so as to avoid surface mismatch 35