Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Movie topics- Efficient features for movie recommendation systems

User written movie reviews carry substantial amounts of movie related features such as description of location, time period, genres, characters, etc. Using natural language processing and topic modeling based techniques, it is possible to extract features from movie reviews and find movies with similar features.

  • Login to see the comments

Movie topics- Efficient features for movie recommendation systems

  1. 1. Efficient Features for Movie Recommendation Systems Project presentation Suvir Bhargav
  2. 2. Outline ● Motivation and Why movie reviews ● Problem statement ● How? or the overall system ● Text preprocessing approaches ● Postprocessing: movie topics from a reviews corpus ● Similarity ● Experimental setup and results
  3. 3. Thanks to Sean Lind, source: Motivation
  4. 4. Motivation ● movie genres are not enough. ● classify movies ○ keywords ○ moods ○ imdb ratings ○ micro genres
  5. 5. micro genres source:
  6. 6. Why movie reviews? Source: a sample user written movie review from imdb
  7. 7. Problem statement ● Feature extraction from user reviews of movies ● Use extracted features to find similar movies.
  8. 8. The overall system Movie reviews corpus ● preprocessing ○ tokenization, stopwords, lemmatized. ● post processing ○ topic modeling: Movie topics from a reviews corpus ● similarity measure ○ return movies with similar topics distribution
  9. 9. Text preprocessing tokenization, stopwords, lemmatized. Simple information extraction Figure credit to nltk book.
  10. 10. Post processing Document representation: Vector Space Model (VSM) Picture credit: pyevolve
  11. 11. Post processing: generative model source: David blei’s slide
  12. 12. Post processing: LDA For each document in the collection, the words can be generated in two stage process 1) Randomly choose a distribution over topics. 2) For each word in the document a) Randomly choose a topic from the distribution over topics in step 1. b) Randomly choose a word from the corresponding distribution over the vocabulary Documents exhibit multiple topics
  13. 13. Movie topics from a reviews corpus
  14. 14. Similarity Measure ● Cosine Similarity ● KL divergence ● Hellinger distance
  15. 15. Similarity Measure Cosine Similarity
  16. 16. Similarity Measure Hellinger Distance
  17. 17. The overall system: implementation Movie reviews corpus ● preprocessing ○ nltk and gensim’s simple preprocessing. ● post processing ○ gensim python wrapper to MALLET ○ index topic distribution of query movies, q and 1k movies corpus, C. ● similarity measure ○ python numpy implementation ○ apply distance metric on indexed q and C. ○ sort and pick top 5 movies.
  18. 18. Experimental setup Movie reviews corpus of 1k movies reviews data source: imdb
  19. 19. Experimental setup Evaluation criteria
  20. 20. Conclusion ● Movie topics as efficient features for RS ○ represents movies by underlying semantic patterns ○ useful for capturing movie genre and mood. ○ but not so well with plot. ○ user written movie reviews are useful movie meta-data. ● The developed prototype ○ easy to add more movie meta-data ○ python allows scalability. ○ Topics as an explanation needs further tuning.
  21. 21. Future directions ● Movie review preprocessing ○ bigram, trigrams. ○ create multi-word movie keywords or language construction ● Building complex topic models ○ Hierarchical LDA ○ author-topic model ■ include authorship information. ■ similarity between authors
  22. 22. Thank You Questions ? Image src:
  23. 23. Extra slides List of extra slides and notes ● Original LDA paper ● introduction to probabilistic topic modeling ● and A. Huang’s Similarity measures for text document clustering ● Another good LDA description ● Integrating out multinomial parameters in LDA ● language construction in micro genres
  24. 24. LDA