1.
Context Based Search By Shatabdi Kundu (2010EET2553) Computer Technology,M.Tech IIT Delhi Email ID:shatabdikundu@live.com Project Guide: Prof.Santanu Chaudhury Electrical Engineering Department IIT Delhi Email ID:santanuc@ee.iitd.ac.inShatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 1of 16
2.
Outline Introduction to Topic Models- Probabilistic Modelling Latent Dirichlet Allocation Topic Discovery using Wordnet Work Done Results Conclusion and Future Work References Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 2of 16
3.
Probabilistic Modelling Treat data as observations that arise from a generative probabilistic process that includes hidden variables For documents, the hidden variables reﬂect the thematic structure of the collection Infer the hidden structure using posterior inference What are the topics that describe this collection? Situate new data into the estimated model How does this query or new document ﬁt into the estimated topic structure? Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 3of 16
5.
Generative Process Cast these intuitions into a generative probabilistic process Each document is a random mixture of corpus-wide topics Each word is drawn from one of those topics Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 5of 16
6.
Graphical Models Nodes are random variables Edges denote possible dependence Observed variables are shaded Plates denote replicated structure Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 6of 16
7.
Graphical Models Structure of the graph deﬁnes the pattern of conditional dependence between the ensemble of random variables. Eg. this graph corressponds to N p(y , x1 ...xN ) = p(y ) p(xn | y ) (1) n=1 Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 7of 16
8.
Latent Dirichlet Allocation 1 Draw each topic βk ∼ Dir(η), for k {1,.....,K} 2 For each document: 1 Draw topic proportions θd ∼ Dir(α) 2 For each word: 1 Draw Zd,n ∼ Mult(θd ) 2 Draw Wd,n ∼ Mult(βZd,n ) Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 8of 16
9.
Latent Dirichlet Allocation From a collection of documents, infer Per-word topic assignment Zd,n Per-document topic proportions θd Per-corpus topic distributions βk Use posterior expectations to perform the task at hand, e.g information retrieval,document similarity, etc. Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 9of 16
10.
Topic Discovery using Wordnet Lexical relations used for ﬁnding out the latent topics synsets(synonym sets) as basic units hyponymy a semantic relation between word meanings Eg. {maple} is a hyponym of {tree} hypernymy inverse of hyponym Eg.{tree} is a hypernym of {maple} Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 10of 16
11.
Work Done I took a collection of 10 documents that had a total of around 28K words I removed the stop words and rare words along with punctuation marks and numbers. Then I modeled a 7-topic LDA model with this corpus Now I had 7 topics with 5 most highly probable occuring words from each topic. I then used the lexical relations of Wordnet to identify the hidden topics using common parents of all the words in each topic. Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 11of 16
12.
Results after training LDA model This model only selects appropriate words within a topic but does not name the topic Discovering the topic name is done using Wordnet Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 12of 16
13.
Results after applying to Wordnet The above result gives us the hidden topic names of the words that comprised the documents. This kind of model can be used for identifying topics when given only a word. Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 13of 16
14.
Conclusion and Future Work Now we will be working on searching based on topics(context) using this model. Basically we will be dealing with geo-intent of the queries and decide on the topic to which they belong for better retrieval of information. Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 14of 16
15.
References Latent Dirichlet allocation. D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3:993-1022, January 2003. Jun Fu Cai, Wee Sun Lee, Yee Whye Teh. NUS-ML: Improving Word Sense Disambiguation Using Topic Features. SEMEVAL (2007). David M. Blei, Jon D. McAuliﬀe. Supervised Topic Models. NIPS (2007). Wordnet. http://www.shiﬀman.net/teaching/a2z/wordnet Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 15of 16