Your SlideShare is downloading. ×
Minor Project
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Minor Project

1,112
views

Published on

Latent Dirichlet Allocation for Topic Modelling.

Latent Dirichlet Allocation for Topic Modelling.

Published in: Education, Technology

3 Comments
0 Likes
Statistics
Notes
  • Be the first to like this

No Downloads
Views
Total Views
1,112
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
24
Comments
3
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Context Based Search By Shatabdi Kundu (2010EET2553) Computer Technology,M.Tech IIT Delhi Email ID:shatabdikundu@live.com Project Guide: Prof.Santanu Chaudhury Electrical Engineering Department IIT Delhi Email ID:santanuc@ee.iitd.ac.inShatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 1of 16
  • 2. Outline Introduction to Topic Models- Probabilistic Modelling Latent Dirichlet Allocation Topic Discovery using Wordnet Work Done Results Conclusion and Future Work References Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 2of 16
  • 3. Probabilistic Modelling Treat data as observations that arise from a generative probabilistic process that includes hidden variables For documents, the hidden variables reflect the thematic structure of the collection Infer the hidden structure using posterior inference What are the topics that describe this collection? Situate new data into the estimated model How does this query or new document fit into the estimated topic structure? Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 3of 16
  • 4. Intuition behind LDA Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 4of 16
  • 5. Generative Process Cast these intuitions into a generative probabilistic process Each document is a random mixture of corpus-wide topics Each word is drawn from one of those topics Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 5of 16
  • 6. Graphical Models Nodes are random variables Edges denote possible dependence Observed variables are shaded Plates denote replicated structure Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 6of 16
  • 7. Graphical Models Structure of the graph defines the pattern of conditional dependence between the ensemble of random variables. Eg. this graph corressponds to N p(y , x1 ...xN ) = p(y ) p(xn | y ) (1) n=1 Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 7of 16
  • 8. Latent Dirichlet Allocation 1 Draw each topic βk ∼ Dir(η), for k {1,.....,K} 2 For each document: 1 Draw topic proportions θd ∼ Dir(α) 2 For each word: 1 Draw Zd,n ∼ Mult(θd ) 2 Draw Wd,n ∼ Mult(βZd,n ) Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 8of 16
  • 9. Latent Dirichlet Allocation From a collection of documents, infer Per-word topic assignment Zd,n Per-document topic proportions θd Per-corpus topic distributions βk Use posterior expectations to perform the task at hand, e.g information retrieval,document similarity, etc. Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 9of 16
  • 10. Topic Discovery using Wordnet Lexical relations used for finding out the latent topics synsets(synonym sets) as basic units hyponymy a semantic relation between word meanings Eg. {maple} is a hyponym of {tree} hypernymy inverse of hyponym Eg.{tree} is a hypernym of {maple} Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 10of 16
  • 11. Work Done I took a collection of 10 documents that had a total of around 28K words I removed the stop words and rare words along with punctuation marks and numbers. Then I modeled a 7-topic LDA model with this corpus Now I had 7 topics with 5 most highly probable occuring words from each topic. I then used the lexical relations of Wordnet to identify the hidden topics using common parents of all the words in each topic. Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 11of 16
  • 12. Results after training LDA model This model only selects appropriate words within a topic but does not name the topic Discovering the topic name is done using Wordnet Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 12of 16
  • 13. Results after applying to Wordnet The above result gives us the hidden topic names of the words that comprised the documents. This kind of model can be used for identifying topics when given only a word. Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 13of 16
  • 14. Conclusion and Future Work Now we will be working on searching based on topics(context) using this model. Basically we will be dealing with geo-intent of the queries and decide on the topic to which they belong for better retrieval of information. Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 14of 16
  • 15. References Latent Dirichlet allocation. D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3:993-1022, January 2003. Jun Fu Cai, Wee Sun Lee, Yee Whye Teh. NUS-ML: Improving Word Sense Disambiguation Using Topic Features. SEMEVAL (2007). David M. Blei, Jon D. McAuliffe. Supervised Topic Models. NIPS (2007). Wordnet. http://www.shiffman.net/teaching/a2z/wordnet Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 15of 16
  • 16. Thank YouShatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 16of 16

×