SlideShare a Scribd company logo
Topic Extraction using Machine Learning
Sanjib Basak
Director of Data Science, Digital River
Jan,2016
Twin cities Big Data Analytics and Apache Spark user group meet up
Agenda
• History of Topic Models
• A Use Case
• Demo using R
• Demo using Spark
• Conclusion
History ofTopic Modeling
• TF-IDF model (Salton and McGill, 1983)
• A basic vocabulary of “words” or “terms” is chosen, and, for each
document in the corpus, a count is formed of the number of
occurrences of each word. (TF)
• After suitable normalization, this term frequency count is
compared to an Inverse Document Frequency (IDF) count, which
measures the number of occurrences of a word in the entire
corpus.
• Not a generative model
TF-IDF
History ofTopic Modeling
• To address the shortcomings ofTF-IDF Deerwester et al. 1990
came up with LSI(Latent Semantic Indexing) model.
• LSI uses a singular value decomposition of term document matrix
to identify a linear subspace in the space ofTF-IDF features that
captures most of the variance in the collection
• They claim that the model can capture some aspects of basic
linguistic notions such as synonymy and polysemy
• Still not a useful model to capture distribution of words
LSI
PLSI
• Hofmann (1999), presented the Probabilistic Latent Semantic
Analysis (pLSI) model, also known as the aspect model, as an
alternative to LSI.
• Models each word in a document as a sample from a mixture
model, where the mixture components are multinomial random
variables that can be viewed as representations of “topics.”
• The model is still incomplete
• Not a probabilistic model at the level of documents
• Each document is represented as a list of numbers (the mixing
proportions for topics)
History ofTopic Modeling
• De Finetti (1990) establishes that any
collection of exchangeable random variables
has a representation as a mixture
distribution—in general an infinite mixture..
This line of thinking leads to the latent
Dirichlet allocation (LDA) model
• Blei, Ng and Jordon 2003 explained LDA
• Hierarchical Bayesian Model - Each item or
word is modeled as a finite mixture over an
underlying set of topics. Each topic is, in turn,
modeled as an infinite mixture over an
underlying set of topic probabilities.
LDA
History ofTopic Modeling
Taken fromWikipedia
LDA
• The original paper used a variational Bayes approximation of
the posterior distribution
• Alternative inference techniques use Gibbs sampling,
Expectation Maximization Algorithm, OnlineVariation and
many more.
Model Workflow
Review
Results
Step 3 Apply
Models
Step 2 Create
Document
Term Matrix
Step 1
Preprocessing
K-Means
• Choose number of clusters (K)
• Initialize the clusters. Make one observation as
centroid
• Determine observations that are closest to the
centroid and assign them part of the cluster
• Revise the cluster center as mean of the
assigned observation
• Repeat above steps until convergence
Demo in R
• Use Case
• Model with K-Means
• Model with LDA and visualization
• Github Code Location - https://github.com/sanjibb/R-Code
K-Means Result
Experimentation with Spark
MLLib
• Work with dataset and in Scala
• 2 variations of optimization model –
• EMVariation Optimizer
• online variational inference -
http://www.cs.columbia.edu/~blei/papers/WangPaisleyBlei2011.pdf
Github code Location
• https://github.com/sanjibb/spark_example
Conclusion
1. LDA provides mixture of topics on the words vs K-Means provides
distinct topics
1. In real-life topics may not be distinctively separated
2. Unsupervised LDA model may require to work with SMEs to get
better representation of topics
1. There is a supervised LDA model (sLDA) as well, which I have not covered in this presentation)
Bibliography
https://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.p
df
https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf
http://vis.stanford.edu/files/2012-Termite-AVI.pdf
http://nlp.stanford.edu/events/illvi2014/papers/sievert-
illvi2014.pdf

More Related Content

What's hot

Scientific Computing with Python Webinar --- August 28, 2009
Scientific Computing with Python Webinar --- August 28, 2009Scientific Computing with Python Webinar --- August 28, 2009
Scientific Computing with Python Webinar --- August 28, 2009
Enthought, Inc.
 
Strong Baselines for Neural Semi-supervised Learning under Domain Shift
Strong Baselines for Neural Semi-supervised Learning under Domain ShiftStrong Baselines for Neural Semi-supervised Learning under Domain Shift
Strong Baselines for Neural Semi-supervised Learning under Domain Shift
Sebastian Ruder
 
Domain Ontology Usage Analysis Framework (OUSAF)
Domain Ontology Usage Analysis Framework (OUSAF)Domain Ontology Usage Analysis Framework (OUSAF)
Domain Ontology Usage Analysis Framework (OUSAF)
Jamshaid Ashraf
 
Learning scientific scholar representations using a combination of collaborat...
Learning scientific scholar representations using a combination of collaborat...Learning scientific scholar representations using a combination of collaborat...
Learning scientific scholar representations using a combination of collaborat...
Ankush Khandelwal
 
Asymmetric Tri-training for Unsupervised Domain Adaptation
Asymmetric Tri-training for Unsupervised Domain AdaptationAsymmetric Tri-training for Unsupervised Domain Adaptation
Asymmetric Tri-training for Unsupervised Domain Adaptation
Yoshitaka Ushiku
 
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Johann Petrak
 
An Introduction To Python - Lists, Part 1
An Introduction To Python - Lists, Part 1An Introduction To Python - Lists, Part 1
An Introduction To Python - Lists, Part 1
Blue Elephant Consulting
 
Object reusability in python
Object reusability in pythonObject reusability in python
Object reusability in python
Learnbay Datascience
 
JIST2015-data challenge
JIST2015-data challengeJIST2015-data challenge
JIST2015-data challenge
GUANGYUAN PIAO
 
NIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian RuderNIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian Ruder
Sebastian Ruder
 
Models for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationModels for Information Retrieval and Recommendation
Models for Information Retrieval and Recommendation
Arjen de Vries
 
AjayBhullar_Resume (5)
AjayBhullar_Resume (5)AjayBhullar_Resume (5)
AjayBhullar_Resume (5)Ajay Bhullar
 
Getting started with R
Getting started with RGetting started with R
NFD InterestDigest
NFD InterestDigestNFD InterestDigest
NFD InterestDigest
Shi Junxiao
 
NumPy Roadmap presentation at NumFOCUS Forum
NumPy Roadmap presentation at NumFOCUS ForumNumPy Roadmap presentation at NumFOCUS Forum
NumPy Roadmap presentation at NumFOCUS Forum
Ralf Gommers
 
Adaptive Geographical Search in Networks
Adaptive Geographical Search in NetworksAdaptive Geographical Search in Networks
Adaptive Geographical Search in Networks
Andrea Wiggins
 
S2P Recipe for Success
S2P Recipe for SuccessS2P Recipe for Success
S2P Recipe for Success
Zycus
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for Python
Ralf Gommers
 
P33077080
P33077080P33077080
P33077080
IJERA Editor
 

What's hot (19)

Scientific Computing with Python Webinar --- August 28, 2009
Scientific Computing with Python Webinar --- August 28, 2009Scientific Computing with Python Webinar --- August 28, 2009
Scientific Computing with Python Webinar --- August 28, 2009
 
Strong Baselines for Neural Semi-supervised Learning under Domain Shift
Strong Baselines for Neural Semi-supervised Learning under Domain ShiftStrong Baselines for Neural Semi-supervised Learning under Domain Shift
Strong Baselines for Neural Semi-supervised Learning under Domain Shift
 
Domain Ontology Usage Analysis Framework (OUSAF)
Domain Ontology Usage Analysis Framework (OUSAF)Domain Ontology Usage Analysis Framework (OUSAF)
Domain Ontology Usage Analysis Framework (OUSAF)
 
Learning scientific scholar representations using a combination of collaborat...
Learning scientific scholar representations using a combination of collaborat...Learning scientific scholar representations using a combination of collaborat...
Learning scientific scholar representations using a combination of collaborat...
 
Asymmetric Tri-training for Unsupervised Domain Adaptation
Asymmetric Tri-training for Unsupervised Domain AdaptationAsymmetric Tri-training for Unsupervised Domain Adaptation
Asymmetric Tri-training for Unsupervised Domain Adaptation
 
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
 
An Introduction To Python - Lists, Part 1
An Introduction To Python - Lists, Part 1An Introduction To Python - Lists, Part 1
An Introduction To Python - Lists, Part 1
 
Object reusability in python
Object reusability in pythonObject reusability in python
Object reusability in python
 
JIST2015-data challenge
JIST2015-data challengeJIST2015-data challenge
JIST2015-data challenge
 
NIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian RuderNIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian Ruder
 
Models for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationModels for Information Retrieval and Recommendation
Models for Information Retrieval and Recommendation
 
AjayBhullar_Resume (5)
AjayBhullar_Resume (5)AjayBhullar_Resume (5)
AjayBhullar_Resume (5)
 
Getting started with R
Getting started with RGetting started with R
Getting started with R
 
NFD InterestDigest
NFD InterestDigestNFD InterestDigest
NFD InterestDigest
 
NumPy Roadmap presentation at NumFOCUS Forum
NumPy Roadmap presentation at NumFOCUS ForumNumPy Roadmap presentation at NumFOCUS Forum
NumPy Roadmap presentation at NumFOCUS Forum
 
Adaptive Geographical Search in Networks
Adaptive Geographical Search in NetworksAdaptive Geographical Search in Networks
Adaptive Geographical Search in Networks
 
S2P Recipe for Success
S2P Recipe for SuccessS2P Recipe for Success
S2P Recipe for Success
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for Python
 
P33077080
P33077080P33077080
P33077080
 

Viewers also liked

¡Fiestas patrias viaja por el perú !
¡Fiestas patrias viaja por el perú !¡Fiestas patrias viaja por el perú !
¡Fiestas patrias viaja por el perú !
Jean Pierre Olivera Manrique
 
¡ úLtimos espacios fiestas patrias nacional !
¡ úLtimos espacios fiestas patrias nacional !¡ úLtimos espacios fiestas patrias nacional !
¡ úLtimos espacios fiestas patrias nacional !
Jean Pierre Olivera Manrique
 
Consciencia fonologica
Consciencia fonologicaConsciencia fonologica
Consciencia fonologica
mtorren
 
Senior Project Powerpoint
Senior Project PowerpointSenior Project Powerpoint
Senior Project Powerpointtgaskins4
 
Topic extraction using machine learning
Topic extraction using machine learningTopic extraction using machine learning
Topic extraction using machine learning
Sanjib Basak
 

Viewers also liked (8)

Pole nord
Pole nordPole nord
Pole nord
 
¡Fiestas patrias viaja por el perú !
¡Fiestas patrias viaja por el perú !¡Fiestas patrias viaja por el perú !
¡Fiestas patrias viaja por el perú !
 
¡ úLtimos espacios fiestas patrias nacional !
¡ úLtimos espacios fiestas patrias nacional !¡ úLtimos espacios fiestas patrias nacional !
¡ úLtimos espacios fiestas patrias nacional !
 
Speech
SpeechSpeech
Speech
 
Consciencia fonologica
Consciencia fonologicaConsciencia fonologica
Consciencia fonologica
 
Unitat 03 superior
Unitat 03 superiorUnitat 03 superior
Unitat 03 superior
 
Senior Project Powerpoint
Senior Project PowerpointSenior Project Powerpoint
Senior Project Powerpoint
 
Topic extraction using machine learning
Topic extraction using machine learningTopic extraction using machine learning
Topic extraction using machine learning
 

Similar to Topic Extraction using Machine Learning

TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxKalpit Desai
 
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Sujit Pal
 
Topic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental surveyTopic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental survey
ICDEcCnferenece
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories
WarNik Chow
 
Design Patterns.ppt
Design Patterns.pptDesign Patterns.ppt
Design Patterns.ppt
TanishaKochak
 
The Rhetoric of Research Objects
The Rhetoric of Research ObjectsThe Rhetoric of Research Objects
The Rhetoric of Research Objects
Carole Goble
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
vincent683379
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
Angelo Salatino
 
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Sergey Sosnovsky
 
RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)
Vladimir Alexiev, PhD, PMP
 
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Sri Ambati
 
The science behind predictive analytics a text mining perspective
The science behind predictive analytics  a text mining perspectiveThe science behind predictive analytics  a text mining perspective
The science behind predictive analytics a text mining perspectiveankurpandeyinfo
 
Cork AI Meetup Number 3
Cork AI Meetup Number 3Cork AI Meetup Number 3
Cork AI Meetup Number 3
Nick Grattan
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Association for Computational Linguistics
 
Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...
Kai Li
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Ian Foster
 
Writing a scientific manuscript
Writing a scientific manuscriptWriting a scientific manuscript
Writing a scientific manuscript
Martin McMorrow
 
Data Archiving and Sharing
Data Archiving and SharingData Archiving and Sharing
Data Archiving and Sharing
C. Tobin Magle
 
Final presentation
Final presentationFinal presentation
Final presentation
Nitish Upreti
 

Similar to Topic Extraction using Machine Learning (20)

TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
 
Topic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental surveyTopic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental survey
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories
 
Design Patterns.ppt
Design Patterns.pptDesign Patterns.ppt
Design Patterns.ppt
 
The Rhetoric of Research Objects
The Rhetoric of Research ObjectsThe Rhetoric of Research Objects
The Rhetoric of Research Objects
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
 
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
 
RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)
 
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
 
The science behind predictive analytics a text mining perspective
The science behind predictive analytics  a text mining perspectiveThe science behind predictive analytics  a text mining perspective
The science behind predictive analytics a text mining perspective
 
Cork AI Meetup Number 3
Cork AI Meetup Number 3Cork AI Meetup Number 3
Cork AI Meetup Number 3
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
 
Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 
Writing a scientific manuscript
Writing a scientific manuscriptWriting a scientific manuscript
Writing a scientific manuscript
 
Data Archiving and Sharing
Data Archiving and SharingData Archiving and Sharing
Data Archiving and Sharing
 
Final presentation
Final presentationFinal presentation
Final presentation
 

Topic Extraction using Machine Learning

  • 1. Topic Extraction using Machine Learning Sanjib Basak Director of Data Science, Digital River Jan,2016 Twin cities Big Data Analytics and Apache Spark user group meet up
  • 2. Agenda • History of Topic Models • A Use Case • Demo using R • Demo using Spark • Conclusion
  • 3. History ofTopic Modeling • TF-IDF model (Salton and McGill, 1983) • A basic vocabulary of “words” or “terms” is chosen, and, for each document in the corpus, a count is formed of the number of occurrences of each word. (TF) • After suitable normalization, this term frequency count is compared to an Inverse Document Frequency (IDF) count, which measures the number of occurrences of a word in the entire corpus. • Not a generative model TF-IDF
  • 4. History ofTopic Modeling • To address the shortcomings ofTF-IDF Deerwester et al. 1990 came up with LSI(Latent Semantic Indexing) model. • LSI uses a singular value decomposition of term document matrix to identify a linear subspace in the space ofTF-IDF features that captures most of the variance in the collection • They claim that the model can capture some aspects of basic linguistic notions such as synonymy and polysemy • Still not a useful model to capture distribution of words LSI
  • 5. PLSI • Hofmann (1999), presented the Probabilistic Latent Semantic Analysis (pLSI) model, also known as the aspect model, as an alternative to LSI. • Models each word in a document as a sample from a mixture model, where the mixture components are multinomial random variables that can be viewed as representations of “topics.” • The model is still incomplete • Not a probabilistic model at the level of documents • Each document is represented as a list of numbers (the mixing proportions for topics) History ofTopic Modeling
  • 6. • De Finetti (1990) establishes that any collection of exchangeable random variables has a representation as a mixture distribution—in general an infinite mixture.. This line of thinking leads to the latent Dirichlet allocation (LDA) model • Blei, Ng and Jordon 2003 explained LDA • Hierarchical Bayesian Model - Each item or word is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. LDA History ofTopic Modeling Taken fromWikipedia
  • 7. LDA • The original paper used a variational Bayes approximation of the posterior distribution • Alternative inference techniques use Gibbs sampling, Expectation Maximization Algorithm, OnlineVariation and many more.
  • 8. Model Workflow Review Results Step 3 Apply Models Step 2 Create Document Term Matrix Step 1 Preprocessing
  • 9. K-Means • Choose number of clusters (K) • Initialize the clusters. Make one observation as centroid • Determine observations that are closest to the centroid and assign them part of the cluster • Revise the cluster center as mean of the assigned observation • Repeat above steps until convergence
  • 10. Demo in R • Use Case • Model with K-Means • Model with LDA and visualization • Github Code Location - https://github.com/sanjibb/R-Code
  • 12. Experimentation with Spark MLLib • Work with dataset and in Scala • 2 variations of optimization model – • EMVariation Optimizer • online variational inference - http://www.cs.columbia.edu/~blei/papers/WangPaisleyBlei2011.pdf Github code Location • https://github.com/sanjibb/spark_example
  • 13. Conclusion 1. LDA provides mixture of topics on the words vs K-Means provides distinct topics 1. In real-life topics may not be distinctively separated 2. Unsupervised LDA model may require to work with SMEs to get better representation of topics 1. There is a supervised LDA model (sLDA) as well, which I have not covered in this presentation)

Editor's Notes

  1. Thus, if we wish to consider exchangeable representations for documents and words, we need to consider mixture models that capture the exchangeability of both words and documents