SlideShare a Scribd company logo
©2013 LinkedIn Corporation. All Rights Reserved.
Latent Dirichlet Allocation (LDA)
- for ML-IR Discussion Group
1
Prepared by Wayne Tai Lee, Satpreet Singh
©2013 LinkedIn Corporation. All Rights Reserved.
Latent Dirichlet Allocation:
A Bayesian Unsupervised Learning Model
Roadmap
2
• Unsupervised learning
• Bayesian Statistics
• Mixture Models
• LDA – theory and intuition
• LDA – practice and applications
©2013 LinkedIn Corporation. All Rights Reserved.
Unsupervised Learning
Learning patterns with no labels
3
• Clustering is a form of “Unsupervised learning”
• Classification is known as supervised learning
• Validation is difficult
©2013 LinkedIn Corporation. All Rights Reserved. 4
How would you cluster?
©2013 LinkedIn Corporation. All Rights Reserved. 5
Documents of wikipedia
Now try these ones!
©2013 LinkedIn Corporation. All Rights Reserved.
Bayesian Statistics
A framework to update your beliefs
6
• Probabilities as beliefs
• Updates your belief as data is observed
• Requires a model that describes the data generation
©2013 LinkedIn Corporation. All Rights Reserved. 7
Candidate potential
Example: Evaluating Candidates
©2013 LinkedIn Corporation. All Rights Reserved. 8
Candidate potential
Example: Evaluating Candidates
Schooling
Experience
Interview
Internship
©2013 LinkedIn Corporation. All Rights Reserved. 9
Candidate potential
Example: Evaluating Candidates
Schooling
Experience
Interview
Internship
How to update?!
©2013 LinkedIn Corporation. All Rights Reserved. 10
©2013 LinkedIn Corporation. All Rights Reserved. 11
Model for candidates Model for data generation
©2013 LinkedIn Corporation. All Rights Reserved.
Mixture Models
A popular statistical model
12
• An easy way to build hierarchical relationships
©2013 LinkedIn Corporation. All Rights Reserved.
Mixture models visualized
13
Candidate Quality
High
Low
©2013 LinkedIn Corporation. All Rights Reserved. 14
Marginal Distribution of Candidate Performance: ignore quality
©2013 LinkedIn Corporation. All Rights Reserved. 15
Distribution of Candidate Performance:
©2013 LinkedIn Corporation. All Rights Reserved. 16
Distribution of Candidate Performance:
Mixture Weights
©2013 LinkedIn Corporation. All Rights Reserved. 17
Mixture Weights
Distribution of Candidate Performance:
©2013 LinkedIn Corporation. All Rights Reserved. 18
Distribution of Candidate Performance:
?
? ?
?
©2013 LinkedIn Corporation. All Rights Reserved.
How are words in a document generated?
19
©2013 LinkedIn Corporation. All Rights Reserved.
One possibility:
20
Each word comes from different topics (bag of words: ignore order)
©2013 LinkedIn Corporation. All Rights Reserved.
How are words in a document generated?
21
Each word comes from different topics
Mixture Weight
for Topic k
Multinomial Distribution
over ALL words based
on topic k
©2013 LinkedIn Corporation. All Rights Reserved.
Just a mixture model
22
Word
Topic 1
Topic K
Leadership
Big Data
Machine Learning
©2013 LinkedIn Corporation. All Rights Reserved.
Just a mixture model
23
Word
Topic 1
Topic K
Leadership
Big Data
Machine Learning
1) Pick a topic
2) Pick a word
©2013 LinkedIn Corporation. All Rights Reserved.
Just a mixture model
24
Word
Topic 1
Topic K
Leadership
Big Data
Machine Learning
The chosen
Topic: Z
©2013 LinkedIn Corporation. All Rights Reserved.
Just a mixture model
25
Word
Topic 1
Topic K
Leadership
Big Data
Machine Learning
So we really want to know
1) Z
2) _
3) _
The chosen
Topic: Z
©2013 LinkedIn Corporation. All Rights Reserved.
Just a mixture model
26
Word
Topic 1
Topic K
Leadership
Big Data
Machine Learning
So we really want to know
1) Z (cluster for the word)
2) (document composition)
3) (key words)
The chosen
Topic: Z
©2013 LinkedIn Corporation. All Rights Reserved.
Review!
27
Z W
©2013 LinkedIn Corporation. All Rights Reserved. 28
Zd,n
k=1…K
Wd,n
n=1,…,Nd
d=1,…,D
K: number of topics
Nd: number of words
D: number of documents
©2013 LinkedIn Corporation. All Rights Reserved. 29
Zd,n
k=1…K
Wd,n
n=1,…,Nd
d=1,…,D
K: number of topics
Nd: number of words
D: number of documents
Bayesian: But what about the distribution for and ??
©2013 LinkedIn Corporation. All Rights Reserved. 30
Zd,n
k=1…K
Wd,n
n=1,…,Nd
d=1,…,D
K: number of topics
Nd: number of words
D: number of documents
Bayesian: But what about the distribution for and ??
©2013 LinkedIn Corporation. All Rights Reserved. 31
and control the “sparsity” of the weights for the multinomial.
Implications: a priori we assume
- Topics have few key words
- Documents only have a small subset of topics
©2013 LinkedIn Corporation. All Rights Reserved.
Dirichlet Distribution with Different Sparsity Parameters
32
©2013 LinkedIn Corporation. All Rights Reserved. 33
Latent Dirichlet Allocation!!!
Zd,n
k=1…K
Wd,n
n=1,…,Nd
©2013 LinkedIn Corporation. All Rights Reserved. 34
How do we fit this model?
Want the posterior:
Worst part of Bayesian Analysis…..personally speaking~
©2013 LinkedIn Corporation. All Rights Reserved. 35
Two main ways to get posterior:
- Sampling methods
- Asymtotically correct
- Time consuming
- Lots of black magic in sampling tricks
- Variational methods (practical solution!)
- An approximation with no guarantees
- Faster
- Need math skills
©2013 LinkedIn Corporation. All Rights Reserved. 36
Variational Bayes (specifically mean field variational bayes):
What’s crazy?
- Assumes all the latent variables are independent
What’s not crazy?
- Finds the “best” model within this crazy class.
- Best under KL divergence
Empirically have shown promising results!
For “sufficient” details:
“Explaining Variational Approximations ” by Ormerod and Wand
©2013 LinkedIn Corporation. All Rights Reserved.
LDA Take Home
37
- An intuitively appealing Bayesian unsupervised learning model
- Training is difficult
- Lots of packages exist, main issue is scalability
- Validation is difficult
- Usually cast into a supervised learning framework
- Presentation is difficult
- Visualization for the Bayesian model is hard.

More Related Content

What's hot

Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
Marco Righini
 
Classification Techniques
Classification TechniquesClassification Techniques
Classification Techniques
Kiran Bhowmick
 
Sequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learningSequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learning
Roberto Pereira Silveira
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
Nik Spirin
 
Text clustering
Text clusteringText clustering
Text clusteringKU Leuven
 
Text summarization
Text summarizationText summarization
Text summarization
Akash Karwande
 
Dimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applicationsDimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applications
Viet-Trung TRAN
 
Text Classification
Text ClassificationText Classification
Text Classification
RAX Automation Suite
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLP
Rupak Roy
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIER
Knoldus Inc.
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
Md. Main Uddin Rony
 
5.5 graph mining
5.5 graph mining5.5 graph mining
5.5 graph mining
Krish_ver2
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsSelman Bozkır
 
Linear and Logistics Regression
Linear and Logistics RegressionLinear and Logistics Regression
Linear and Logistics Regression
Mukul Kumar Singh Chauhan
 
Abstractive Text Summarization
Abstractive Text SummarizationAbstractive Text Summarization
Abstractive Text Summarization
Tho Phan
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
Bhaskar Mitra
 
Lecture_16_Self-supervised_Learning.pptx
Lecture_16_Self-supervised_Learning.pptxLecture_16_Self-supervised_Learning.pptx
Lecture_16_Self-supervised_Learning.pptx
Karimdabbabi
 
What is the Expectation Maximization (EM) Algorithm?
What is the Expectation Maximization (EM) Algorithm?What is the Expectation Maximization (EM) Algorithm?
What is the Expectation Maximization (EM) Algorithm?
Kazuki Yoshida
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualization
Dr. Hamdan Al-Sabri
 

What's hot (20)

Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
 
Classification Techniques
Classification TechniquesClassification Techniques
Classification Techniques
 
Topics Modeling
Topics ModelingTopics Modeling
Topics Modeling
 
Sequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learningSequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learning
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Text clustering
Text clusteringText clustering
Text clustering
 
Text summarization
Text summarizationText summarization
Text summarization
 
Dimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applicationsDimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applications
 
Text Classification
Text ClassificationText Classification
Text Classification
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLP
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIER
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
5.5 graph mining
5.5 graph mining5.5 graph mining
5.5 graph mining
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systems
 
Linear and Logistics Regression
Linear and Logistics RegressionLinear and Logistics Regression
Linear and Logistics Regression
 
Abstractive Text Summarization
Abstractive Text SummarizationAbstractive Text Summarization
Abstractive Text Summarization
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
Lecture_16_Self-supervised_Learning.pptx
Lecture_16_Self-supervised_Learning.pptxLecture_16_Self-supervised_Learning.pptx
Lecture_16_Self-supervised_Learning.pptx
 
What is the Expectation Maximization (EM) Algorithm?
What is the Expectation Maximization (EM) Algorithm?What is the Expectation Maximization (EM) Algorithm?
What is the Expectation Maximization (EM) Algorithm?
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualization
 

Similar to LDA Beginner's Tutorial

Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips.
Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips. Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips.
Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips.
Hakka Labs
 
Computing Professional Identity for the Economic Graph
Computing Professional Identity for the Economic GraphComputing Professional Identity for the Economic Graph
Computing Professional Identity for the Economic Graph
Vitaly Gordon
 
Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...
Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...
Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...Vitaly Gordon
 
SF Data Science: Developing Data Products
SF Data Science: Developing Data ProductsSF Data Science: Developing Data Products
SF Data Science: Developing Data Products
Peter Skomoroch
 
Workshop - Neo4j Graph Data Science
Workshop - Neo4j Graph Data ScienceWorkshop - Neo4j Graph Data Science
Workshop - Neo4j Graph Data Science
Neo4j
 
Developing Data Products
Developing Data ProductsDeveloping Data Products
Developing Data Products
Peter Skomoroch
 
MIT Sloan: Intro to Machine Learning
MIT Sloan: Intro to Machine LearningMIT Sloan: Intro to Machine Learning
MIT Sloan: Intro to Machine Learning
Lex Fridman
 
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
Lionel Briand
 
Getstarteddssd12717sd
Getstarteddssd12717sdGetstarteddssd12717sd
Getstarteddssd12717sd
Thinkful
 
Relationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine LearningRelationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine Learning
Neo4j
 
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bhaskar Ghosh
 
Big Data and HR - Talk @SwissHR Congress
Big Data and HR - Talk @SwissHR CongressBig Data and HR - Talk @SwissHR Congress
Big Data and HR - Talk @SwissHR Congress
Marcel Blattner, PhD
 
Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...
Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...
Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...
Heidi Nance
 
Social Search in a Professional Context
Social Search in a Professional ContextSocial Search in a Professional Context
Social Search in a Professional Context
Daniel Tunkelang
 
7 Badass SlideShare Tactics - Jason Miller (Social Fresh WEST 2013)
7 Badass SlideShare Tactics - Jason Miller (Social Fresh WEST 2013)7 Badass SlideShare Tactics - Jason Miller (Social Fresh WEST 2013)
7 Badass SlideShare Tactics - Jason Miller (Social Fresh WEST 2013)Social Fresh Conference
 
Building Enterprise Knowledge Using Semantic Encyclopedias
Building Enterprise Knowledge Using Semantic EncyclopediasBuilding Enterprise Knowledge Using Semantic Encyclopedias
Building Enterprise Knowledge Using Semantic Encyclopedias
Bernadette Clemente
 
Knowledge Graphs and Generative AI
Knowledge Graphs and Generative AIKnowledge Graphs and Generative AI
Knowledge Graphs and Generative AI
Neo4j
 
Data-X-v3.1
Data-X-v3.1Data-X-v3.1
Data-X-v3.1
Ikhlaq Sidhu
 
Data-X-Sparse-v2
Data-X-Sparse-v2Data-X-Sparse-v2
Data-X-Sparse-v2
Ikhlaq Sidhu
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML model
Dataiku
 

Similar to LDA Beginner's Tutorial (20)

Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips.
Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips. Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips.
Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips.
 
Computing Professional Identity for the Economic Graph
Computing Professional Identity for the Economic GraphComputing Professional Identity for the Economic Graph
Computing Professional Identity for the Economic Graph
 
Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...
Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...
Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...
 
SF Data Science: Developing Data Products
SF Data Science: Developing Data ProductsSF Data Science: Developing Data Products
SF Data Science: Developing Data Products
 
Workshop - Neo4j Graph Data Science
Workshop - Neo4j Graph Data ScienceWorkshop - Neo4j Graph Data Science
Workshop - Neo4j Graph Data Science
 
Developing Data Products
Developing Data ProductsDeveloping Data Products
Developing Data Products
 
MIT Sloan: Intro to Machine Learning
MIT Sloan: Intro to Machine LearningMIT Sloan: Intro to Machine Learning
MIT Sloan: Intro to Machine Learning
 
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
 
Getstarteddssd12717sd
Getstarteddssd12717sdGetstarteddssd12717sd
Getstarteddssd12717sd
 
Relationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine LearningRelationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine Learning
 
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
 
Big Data and HR - Talk @SwissHR Congress
Big Data and HR - Talk @SwissHR CongressBig Data and HR - Talk @SwissHR Congress
Big Data and HR - Talk @SwissHR Congress
 
Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...
Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...
Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...
 
Social Search in a Professional Context
Social Search in a Professional ContextSocial Search in a Professional Context
Social Search in a Professional Context
 
7 Badass SlideShare Tactics - Jason Miller (Social Fresh WEST 2013)
7 Badass SlideShare Tactics - Jason Miller (Social Fresh WEST 2013)7 Badass SlideShare Tactics - Jason Miller (Social Fresh WEST 2013)
7 Badass SlideShare Tactics - Jason Miller (Social Fresh WEST 2013)
 
Building Enterprise Knowledge Using Semantic Encyclopedias
Building Enterprise Knowledge Using Semantic EncyclopediasBuilding Enterprise Knowledge Using Semantic Encyclopedias
Building Enterprise Knowledge Using Semantic Encyclopedias
 
Knowledge Graphs and Generative AI
Knowledge Graphs and Generative AIKnowledge Graphs and Generative AI
Knowledge Graphs and Generative AI
 
Data-X-v3.1
Data-X-v3.1Data-X-v3.1
Data-X-v3.1
 
Data-X-Sparse-v2
Data-X-Sparse-v2Data-X-Sparse-v2
Data-X-Sparse-v2
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML model
 

More from Wayne Lee

Feature selection can hurt model inference
Feature selection can hurt model inferenceFeature selection can hurt model inference
Feature selection can hurt model inference
Wayne Lee
 
Explaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for StatisticiansExplaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for Statisticians
Wayne Lee
 
What is bayesian statistics and how is it different?
What is bayesian statistics and how is it different?What is bayesian statistics and how is it different?
What is bayesian statistics and how is it different?
Wayne Lee
 
R merge-tutorial
R merge-tutorialR merge-tutorial
R merge-tutorial
Wayne Lee
 
The Key to Blind Dates - Data Snooping
The Key to Blind Dates - Data SnoopingThe Key to Blind Dates - Data Snooping
The Key to Blind Dates - Data Snooping
Wayne Lee
 
Crash Course in A/B testing
Crash Course in A/B testingCrash Course in A/B testing
Crash Course in A/B testing
Wayne Lee
 
Introduction to Bag of Little Bootstrap
Introduction to Bag of Little Bootstrap Introduction to Bag of Little Bootstrap
Introduction to Bag of Little Bootstrap
Wayne Lee
 

More from Wayne Lee (7)

Feature selection can hurt model inference
Feature selection can hurt model inferenceFeature selection can hurt model inference
Feature selection can hurt model inference
 
Explaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for StatisticiansExplaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for Statisticians
 
What is bayesian statistics and how is it different?
What is bayesian statistics and how is it different?What is bayesian statistics and how is it different?
What is bayesian statistics and how is it different?
 
R merge-tutorial
R merge-tutorialR merge-tutorial
R merge-tutorial
 
The Key to Blind Dates - Data Snooping
The Key to Blind Dates - Data SnoopingThe Key to Blind Dates - Data Snooping
The Key to Blind Dates - Data Snooping
 
Crash Course in A/B testing
Crash Course in A/B testingCrash Course in A/B testing
Crash Course in A/B testing
 
Introduction to Bag of Little Bootstrap
Introduction to Bag of Little Bootstrap Introduction to Bag of Little Bootstrap
Introduction to Bag of Little Bootstrap
 

Recently uploaded

The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
DhatriParmar
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
camakaiclarkmusic
 
Digital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion DesignsDigital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion Designs
chanes7
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
 
Multithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race conditionMultithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race condition
Mohammed Sikander
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
tarandeep35
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
Vivekanand Anglo Vedic Academy
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
Atul Kumar Singh
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
timhan337
 
Chapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdfChapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdf
Kartik Tiwari
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
Peter Windle
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 

Recently uploaded (20)

The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
 
Digital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion DesignsDigital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion Designs
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
 
Multithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race conditionMultithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race condition
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
 
Chapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdfChapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdf
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 

LDA Beginner's Tutorial

  • 1. ©2013 LinkedIn Corporation. All Rights Reserved. Latent Dirichlet Allocation (LDA) - for ML-IR Discussion Group 1 Prepared by Wayne Tai Lee, Satpreet Singh
  • 2. ©2013 LinkedIn Corporation. All Rights Reserved. Latent Dirichlet Allocation: A Bayesian Unsupervised Learning Model Roadmap 2 • Unsupervised learning • Bayesian Statistics • Mixture Models • LDA – theory and intuition • LDA – practice and applications
  • 3. ©2013 LinkedIn Corporation. All Rights Reserved. Unsupervised Learning Learning patterns with no labels 3 • Clustering is a form of “Unsupervised learning” • Classification is known as supervised learning • Validation is difficult
  • 4. ©2013 LinkedIn Corporation. All Rights Reserved. 4 How would you cluster?
  • 5. ©2013 LinkedIn Corporation. All Rights Reserved. 5 Documents of wikipedia Now try these ones!
  • 6. ©2013 LinkedIn Corporation. All Rights Reserved. Bayesian Statistics A framework to update your beliefs 6 • Probabilities as beliefs • Updates your belief as data is observed • Requires a model that describes the data generation
  • 7. ©2013 LinkedIn Corporation. All Rights Reserved. 7 Candidate potential Example: Evaluating Candidates
  • 8. ©2013 LinkedIn Corporation. All Rights Reserved. 8 Candidate potential Example: Evaluating Candidates Schooling Experience Interview Internship
  • 9. ©2013 LinkedIn Corporation. All Rights Reserved. 9 Candidate potential Example: Evaluating Candidates Schooling Experience Interview Internship How to update?!
  • 10. ©2013 LinkedIn Corporation. All Rights Reserved. 10
  • 11. ©2013 LinkedIn Corporation. All Rights Reserved. 11 Model for candidates Model for data generation
  • 12. ©2013 LinkedIn Corporation. All Rights Reserved. Mixture Models A popular statistical model 12 • An easy way to build hierarchical relationships
  • 13. ©2013 LinkedIn Corporation. All Rights Reserved. Mixture models visualized 13 Candidate Quality High Low
  • 14. ©2013 LinkedIn Corporation. All Rights Reserved. 14 Marginal Distribution of Candidate Performance: ignore quality
  • 15. ©2013 LinkedIn Corporation. All Rights Reserved. 15 Distribution of Candidate Performance:
  • 16. ©2013 LinkedIn Corporation. All Rights Reserved. 16 Distribution of Candidate Performance: Mixture Weights
  • 17. ©2013 LinkedIn Corporation. All Rights Reserved. 17 Mixture Weights Distribution of Candidate Performance:
  • 18. ©2013 LinkedIn Corporation. All Rights Reserved. 18 Distribution of Candidate Performance: ? ? ? ?
  • 19. ©2013 LinkedIn Corporation. All Rights Reserved. How are words in a document generated? 19
  • 20. ©2013 LinkedIn Corporation. All Rights Reserved. One possibility: 20 Each word comes from different topics (bag of words: ignore order)
  • 21. ©2013 LinkedIn Corporation. All Rights Reserved. How are words in a document generated? 21 Each word comes from different topics Mixture Weight for Topic k Multinomial Distribution over ALL words based on topic k
  • 22. ©2013 LinkedIn Corporation. All Rights Reserved. Just a mixture model 22 Word Topic 1 Topic K Leadership Big Data Machine Learning
  • 23. ©2013 LinkedIn Corporation. All Rights Reserved. Just a mixture model 23 Word Topic 1 Topic K Leadership Big Data Machine Learning 1) Pick a topic 2) Pick a word
  • 24. ©2013 LinkedIn Corporation. All Rights Reserved. Just a mixture model 24 Word Topic 1 Topic K Leadership Big Data Machine Learning The chosen Topic: Z
  • 25. ©2013 LinkedIn Corporation. All Rights Reserved. Just a mixture model 25 Word Topic 1 Topic K Leadership Big Data Machine Learning So we really want to know 1) Z 2) _ 3) _ The chosen Topic: Z
  • 26. ©2013 LinkedIn Corporation. All Rights Reserved. Just a mixture model 26 Word Topic 1 Topic K Leadership Big Data Machine Learning So we really want to know 1) Z (cluster for the word) 2) (document composition) 3) (key words) The chosen Topic: Z
  • 27. ©2013 LinkedIn Corporation. All Rights Reserved. Review! 27 Z W
  • 28. ©2013 LinkedIn Corporation. All Rights Reserved. 28 Zd,n k=1…K Wd,n n=1,…,Nd d=1,…,D K: number of topics Nd: number of words D: number of documents
  • 29. ©2013 LinkedIn Corporation. All Rights Reserved. 29 Zd,n k=1…K Wd,n n=1,…,Nd d=1,…,D K: number of topics Nd: number of words D: number of documents Bayesian: But what about the distribution for and ??
  • 30. ©2013 LinkedIn Corporation. All Rights Reserved. 30 Zd,n k=1…K Wd,n n=1,…,Nd d=1,…,D K: number of topics Nd: number of words D: number of documents Bayesian: But what about the distribution for and ??
  • 31. ©2013 LinkedIn Corporation. All Rights Reserved. 31 and control the “sparsity” of the weights for the multinomial. Implications: a priori we assume - Topics have few key words - Documents only have a small subset of topics
  • 32. ©2013 LinkedIn Corporation. All Rights Reserved. Dirichlet Distribution with Different Sparsity Parameters 32
  • 33. ©2013 LinkedIn Corporation. All Rights Reserved. 33 Latent Dirichlet Allocation!!! Zd,n k=1…K Wd,n n=1,…,Nd
  • 34. ©2013 LinkedIn Corporation. All Rights Reserved. 34 How do we fit this model? Want the posterior: Worst part of Bayesian Analysis…..personally speaking~
  • 35. ©2013 LinkedIn Corporation. All Rights Reserved. 35 Two main ways to get posterior: - Sampling methods - Asymtotically correct - Time consuming - Lots of black magic in sampling tricks - Variational methods (practical solution!) - An approximation with no guarantees - Faster - Need math skills
  • 36. ©2013 LinkedIn Corporation. All Rights Reserved. 36 Variational Bayes (specifically mean field variational bayes): What’s crazy? - Assumes all the latent variables are independent What’s not crazy? - Finds the “best” model within this crazy class. - Best under KL divergence Empirically have shown promising results! For “sufficient” details: “Explaining Variational Approximations ” by Ormerod and Wand
  • 37. ©2013 LinkedIn Corporation. All Rights Reserved. LDA Take Home 37 - An intuitively appealing Bayesian unsupervised learning model - Training is difficult - Lots of packages exist, main issue is scalability - Validation is difficult - Usually cast into a supervised learning framework - Presentation is difficult - Visualization for the Bayesian model is hard.

Editor's Notes

  1. Take home: validation is difficult….no true answer here.
  2. Clustering documents is difficult because many repeated words are used. Some documents may be similar to one another on different topics. So we might want to cluster allowing membership.
  3. 2 stage process
  4. Example: the word usage of “professional” is probably higher in the topic of professional network than a social network.
  5. 2 stage process
  6. 2 stage process
  7. 2 stage process
  8. 2 stage process
  9. 2 stage process
  10. Example: the word usage of “professional” is probably higher in the topic of professional network than a social network.
  11. Example: the word usage of “professional” is probably higher in the topic of professional network than a social network.
  12. Example: the word usage of “professional” is probably higher in the topic of professional network than a social network.
  13. Example: the word usage of “professional” is probably higher in the topic of professional network than a social network.
  14. Example: the word usage of “professional” is probably higher in the topic of professional network than a social network.
  15. Example: the word usage of “professional” is probably higher in the topic of professional network than a social network.
  16. Example: the word usage of “professional” is probably higher in the topic of professional network than a social network.
  17. Example: the word usage of “professional” is probably higher in the topic of professional network than a social network.
  18. Example: the word usage of “professional” is probably higher in the topic of professional network than a social network.