SlideShare a Scribd company logo
Copyright 2017 Aaron Li (aaron@potatos.io)
Modelling
Aaron Li
aaron@potatos.io
for news recommendation, user behaviour modelling, and many more
About me
• Working on a stealth startup
• Former lead inference engineer at Scaled Inference
• Did AI / Machine Learning at Google Research,
NICTA, CMU, ANU, etc.
• https://www.linkedin.com/in/aaronqli/
Copyright 2017 Aaron Li (aaron@potatos.io)2
Copyright 2017 Aaron Li
(aaron@potatos.io)
Overview
• Theory (2 classes, 2h each)
• work out the problem & solutions & why
• discuss the math & models & NLP fundamentals
• Industry use cases & systems & applications
• Practice (2 classes, 2h each)
• live demo + coding + debugging
• data sets, open source tools, Q & A
Copyright 2017 Aaron Li (aaron@potatos.io)3
Copyright 2017 Aaron Li
(aaron@potatos.io)
Overview
• Background Knowledge
• Linear Algebra
• Probability Theory
• Calculus
• Scala / Go / Node / C++ (please vote)
Copyright 2017 Aaron Li (aaron@potatos.io)4
Copyright 2017 Aaron Li
(aaron@potatos.io)
Theory 1
What is news recommendation?
What is topic modeling? Why?
Basic architecture
NLP foundamentals
Basic model: LDA
Practice 1
LDA live demo
NLP tools introduction
Preprocessed Datasets
Code LDA + Experiments
Open source tools for industry
Theory 2
LDA Inference
Gibbs sampling
SparseLDA, AliasLDA, LightLDA
Applications & Industrial use cases
Practice 2
Set up NLP pipeline
SparseLDA, AliasLDA, LightLDA
Train & use the model

News recommendation demo
Schedule
Copyright 2017 Aaron Li (aaron@potatos.io)5
Copyright 2017 Aaron Li
(aaron@potatos.io)
News Recommendation
Copyright 2017 Aaron Li (aaron@potatos.io)6
Copyright 2017 Aaron Li
(aaron@potatos.io)
7
• A lot of people read news every day
• Flipboard, CNN, Facebook, WeChat …

• How do we make people more engaged?
• Personalise & Recommendation
• learn preference and show relevant content
• recommend articles based on the current one
News Recommendation
Copyright 2017 Aaron Li (aaron@potatos.io)
Copyright 2017 Aaron Li
(aaron@potatos.io)
8
• Top websites / apps already doing this
News Recommendation
Copyright 2017 Aaron Li (aaron@potatos.io)
Copyright 2017 Aaron Li
(aaron@potatos.io)
News Recommendation
Copyright 2017 Aaron Li (aaron@potatos.io)9
Copyright 2017 Aaron Li
(aaron@potatos.io)
Flipboard
Yahoo! News (now “Oath” News)
News Recommendation
Copyright 2017 Aaron Li (aaron@potatos.io)10
Copyright 2017 Aaron Li
(aaron@potatos.io)
11
• Many websites don’t do it (e.g CNN)
• Why not? It’s not a easy problem
• Challenges
• News article vocabulary is large (100k ~ 1M)
• Documents are represented by high-dimensional
vector, based on count of vocabulary
• Traditional similarity measures don’t work
News Recommendation
Copyright 2017 Aaron Li (aaron@potatos.io)
Copyright 2017 Aaron Li
(aaron@potatos.io)
Example
In 1996 Linus Torvalds, the Finnish creator of the Open Source
operating system Linux, visited the National Zoo and Aquarium with
members of the Canberra Linux Users Group, and was captivated
by one of the Zoo's little Penguins. Legend has it that Linus was
infected with a mythical disease called Penguinitis. Penguinitis
makes you stay awake at night thinking about Penguins and feeling
great love towards them.
Not long after this event the Open Source Software community
decided they needed a logo for Linux. They were looking for
something fun and after Linus mentioned his fondness of penguins,
a slightly overweighted penguin sitting down after having a great
meal seemed to fit the bill perfectly. Hence, Tux the penguin was
created and now when people think of Linux they think of Tux.
Copyright 2017 Aaron Li (aaron@potatos.io)12
Copyright 2017 Aaron Li
(aaron@potatos.io)
Example
• Word count = 132, unique words = 91
• Very hard to measure its distance to other articles in our
database talking about Linux, Linus Torvalds, and the
creation of Tux
• Distance for low-dimensional space aren’t effective
• e.g. cosine similarity won’t make sense
• Need to represent things in low-dimensional vectors
• Capture semantics / topics efficiently
Copyright 2017 Aaron Li (aaron@potatos.io)13
Copyright 2017 Aaron Li
(aaron@potatos.io)
Solutions
Copyright 2017 Aaron Li (aaron@potatos.io)14
Copyright 2017 Aaron Li
(aaron@potatos.io)
Step 1. Get text data


Step 2. ??? (Machine can’t read text)


Step 3. Model & Train
Step 4. Deploy & Predict
News articles
Emails
Legal docs
Resume
…
(i.e. documents)
15
Step 2: NLP Preprocessing - common pipeline
Solutions
Sentence splitting
Tokenisation
Stop words removal
Stemming (optional)
POS Tagging
Lemmatisation
Form bag of words
There are a lot more…
(used in advanced NLP tasks)
Chunking
Named Entity Recognition
Sentiment Analysis
Syntactic Analysis
Dependency Parsing
Coreference Resolution
Entity Relationship Extraction
Semantic Analysis
…
Copyright 2017 Aaron Li (aaron@potatos.io)
Copyright 2017 Aaron Li
(aaron@potatos.io)
NLP Preprocessing
Copyright 2017 Aaron Li (aaron@potatos.io)16
Copyright 2017 Aaron Li
(aaron@potatos.io)
Sentence splitting
Tokenisation
Stop words removal
Stemming (optional)
POS Tagging
Lemmatisation
Form bag of words
• Mostly rules (by regex or FST)
• Look for sentence splitter
• For English: , . ! ? etc.
• Checkout Wikipedia article
• Open source code is good
• Also checkout this article
NLP Preprocessing
Copyright 2017 Aaron Li (aaron@potatos.io)17
Copyright 2017 Aaron Li
(aaron@potatos.io)
Sentence splitting
Tokenisation
Stop words removal
Stemming (optional)
POS Tagging
Lemmatisation
Form bag of words
• Find boundaries for words
• Easy for English (look for space)
• Hard for Chinese etc.
• Solution: FST, CRF, etc.
• Difficulties: see Wikipedia article
• Try making one by yourself using
FST! (CMU 11711 homework)
NLP Preprocessing
Copyright 2017 Aaron Li (aaron@potatos.io)18
Copyright 2017 Aaron Li
(aaron@potatos.io)
Sentence splitting
Tokenisation
Stop words removal
Stemming (optional)
POS Tagging
Lemmatisation
Form bag of words
• Stop words:
• occurs frequently
• semantically not meaningful
• i.e. am, is, who, what, etc.

• Small set of words

• Easy to implement
• e.g. in-memory hashset
NLP Preprocessing
Copyright 2017 Aaron Li (aaron@potatos.io)19
Copyright 2017 Aaron Li
(aaron@potatos.io)
Sentence splitting
Tokenisation
Stop words removal
Stemming (optional)
POS Tagging
Lemmatisation
Form bag of words
• Reduce word to root (stem)
• Usually used in IR system
• Root can be a non-word. e.g.
• fishing, fished, fisher => fish
• cats, catty => cat
• argument, arguing => argu

• Rule based implementation

• e.g. Porter’s Snowball stemmer

Also see Wikipedia article
NLP Preprocessing
Copyright 2017 Aaron Li (aaron@potatos.io)20
Copyright 2017 Aaron Li
(aaron@potatos.io)
Sentence splitting
Tokenisation
Stop words removal
Stemming (optional)
POS Tagging
Lemmatisation
Form bag of words
• POS = Part of Speech
• Find grammar role of each word. 

I ate a fish
PRP VBD DT NN

• Disambiguate same words used
in different context. e.g:
• “Train” as in “train a model”
• “Train” as in “catch a train”
• Techniques: HMM, CRF, etc.
• See this article for more details
NLP Preprocessing
Copyright 2017 Aaron Li (aaron@potatos.io)21
Copyright 2017 Aaron Li
(aaron@potatos.io)
Sentence splitting
Tokenisation
Stop words removal
Stemming (optional)
POS Tagging
Lemmatisation
Form bag of words
• Find base form of a word
• More complex than stemming
• Use POS tag information
• Different rules for different POS

• Base form is a valid word. e.g.
• walks, walking, walked =>walk
• am, are, is => be
• argument (NN) => argument
• arguing (VBG) => argue

• See Wikipedia article for details
NLP Preprocessing
Copyright 2017 Aaron Li (aaron@potatos.io)22
Copyright 2017 Aaron Li
(aaron@potatos.io)
Sentence splitting
Tokenisation
Stop words removal
Stemming (optional)
POS Tagging
Lemmatisation
Form bag of words
• Index pre-processed documents
and words with id and frequency

• e.g:
• id:1 word:(train, VBG) freq: 5
• id:2 word:(model, NN) freq: 2
• id:3 word:(train, NN) freq: 3
• …
See UCI Bag of Words dataset
Solutions
• Modelling & Training
• Naive Bayes
• Latent Semantic Analysis
• word2vec, doc2vec, …
• Topic Modelling
Copyright 2017 Aaron Li (aaron@potatos.io)23
Copyright 2017 Aaron Li
(aaron@potatos.io)
24
• Naive Bayes (very old technique)
• Use only key words to get probability for K labels
• Good for spam detection
• Poor performance for news recommendation
• Does not capture semantics / topics
• https://web.stanford.edu/class/cs124/lec/
naivebayes.pdf
Solutions
Copyright 2017 Aaron Li (aaron@potatos.io)
Copyright 2017 Aaron Li
(aaron@potatos.io)
25
• Latent Semantic Analysis (~1990 - 2000)
• SVD on a TF-IDF frequency matrix with documents as columns and
words as rows
• Gives a low-rank approximation of the matrix and represent documents
in low dimension vectors
• Problem: hard to interpret vectors / documents, probability distribution
is wrong (Gaussian)
• https://nlp.stanford.edu/IR-book/html/htmledition/latent-semantic-
indexing-1.html
• Thomas Hofmann. Probabilistic Latent Semantic Analysis. In Kathryn B.
Laskey and Henri Prade, editors, UAI,
Solutions
Copyright 2017 Aaron Li (aaron@potatos.io)
Copyright 2017 Aaron Li
(aaron@potatos.io)
26
• word2vec, doc2vec (2013~)
• Convert words to dense, low-dimensional, compositional
vectors (e.g. king - man + woman = queen)
• Good for classification problems
• Slow to train, hard to interpret (because of neural network),
yet to be tested in industrial use cases
• Mikolov, Tomas; et al. "Efficient Estimation of Word
Representations in Vector Space" ICLR 2013.
• Getting started with word2vec
Solutions
Copyright 2017 Aaron Li (aaron@potatos.io)
Copyright 2017 Aaron Li
(aaron@potatos.io)
27
• Topic Models (LDA etc., 2003~)
• Define a generative structure involving latent variables (e.g topics)
using well-structured distributions and infer the parameters
• Represent documents / words using low-dimensional, highly
interpretable distributions
• Extensively used in industry. Many open source tools
• Extensive research on speeding up / scaling up
• D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation.
Journal of Machine Learning Research, 3:993–1022, 2003
• Tutorial: Parameter estimation for text analysis, Gregor Heinrich 2008
Solutions
Copyright 2017 Aaron Li (aaron@potatos.io)
Copyright 2017 Aaron Li
(aaron@potatos.io)
Copyright 2017 Aaron Li (aaron@potatos.io)28
Copyright 2017 Aaron Li
(aaron@potatos.io)
Topic Models
Copyright 2017 Aaron Li (aaron@potatos.io)29
Copyright 2017 Aaron Li
(aaron@potatos.io)
Latent Dirichlet Allocation (LDA)
Image from [Li 2012, Multi-GPU Distributed Parallel Bayesian Differential Topic Modelling]
30
• LDA (Latent Dirichlet Allocation)
• Arguably the most popular topic model since 2013
• Created by David Blei, Andrew Ng, Michael Jordan
• To be practical we use this topic model in class
Topic Models
Copyright 2017 Aaron Li (aaron@potatos.io)
Copyright 2017 Aaron Li
(aaron@potatos.io)
LDA
Copyright 2017 Aaron Li (aaron@potatos.io)31
Copyright 2017 Aaron Li
(aaron@potatos.io)
Extracted from [Li 2012, Multi-GPU Distributed Parallel Bayesian Differential Topic Modelling]
LDA
Copyright 2017 Aaron Li (aaron@potatos.io)32
Copyright 2017 Aaron Li
(aaron@potatos.io)
Extracted from [Li 2012, Multi-GPU Distributed Parallel Bayesian Differential Topic Modelling]
LDA
Copyright 2017 Aaron Li (aaron@potatos.io)33
Copyright 2017 Aaron Li
(aaron@potatos.io)
Extracted from [Li 2012, Multi-GPU Distributed Parallel Bayesian Differential Topic Modelling]
Example
Copyright 2017 Aaron Li (aaron@potatos.io)34
Copyright 2017 Aaron Li
(aaron@potatos.io)
Example
Copyright 2017 Aaron Li (aaron@potatos.io)35
Copyright 2017 Aaron Li
(aaron@potatos.io)
Extracted from [Li 2012, Multi-GPU Distributed Parallel Bayesian Differential Topic Modelling]
Example
Copyright 2017 Aaron Li (aaron@potatos.io)36
Copyright 2017 Aaron Li
(aaron@potatos.io)
Extracted from [BleiNgJordan2003, Latent Dirichlet Allocation]
LDA
• Task: infer parameters
• each document’s representation by topic vector
• with this we can compute document similarity!
• each topic’s representation by words (counts)
• with this we can look at each topic manually, and
interpret the meaning of them!
Copyright 2017 Aaron Li (aaron@potatos.io)37
Copyright 2017 Aaron Li
(aaron@potatos.io)
Theory 1
Copyright 2017 Aaron Li (aaron@potatos.io)38
Copyright 2017 Aaron Li
(aaron@potatos.io)
End of class
Questions?
Industrial Applications &
Use cases
• Yi Wang et al. Peacock: Learning Long-Tail Topic Features for
Industrial Applications (TIST 2014)
• Advertising system in production
• Aaron Li et al. High Performance Latent Variable Models (arxiv,
2014)
• User preference learning from search data
• Arnab Bhadury, Clustering Similar Stories Using LDA
• News Recommendation
• And many more… search “AliasLDA” or “LightLDA” on Google
Copyright 2017 Aaron Li (aaron@potatos.io)39
Copyright 2017 Aaron Li
(aaron@potatos.io)
LDA Inference
Copyright 2017 Aaron Li (aaron@potatos.io)40
Copyright 2017 Aaron Li
(aaron@potatos.io)
Bayes rule:
Where denotes all latent variables
41
In LDA, the topic assignment for each word is latent
LDA Inference
Intractable: KL terms in denominator
Copyright 2017 Aaron Li (aaron@potatos.io)
Copyright 2017 Aaron Li
(aaron@potatos.io)
42
What can we do to address intractability?
• Gibbs sampling
• Variational inference (not discussed in class)
LDA Inference
Copyright 2017 Aaron Li (aaron@potatos.io)
Copyright 2017 Aaron Li
(aaron@potatos.io)
43
Estimate by sampling:
LDA Inference
Gibbs sampling:
Sample if is known
Copyright 2017 Aaron Li (aaron@potatos.io)
Copyright 2017 Aaron Li
(aaron@potatos.io)
LDA Inference
Copyright 2017 Aaron Li (aaron@potatos.io)44
Copyright 2017 Aaron Li
(aaron@potatos.io)
We can compute using Bayes rules
Above equation is called “predictive probability”. It
can be applied to the latent variable which assigns a
topic to each word
i.e. compute the probability of a word is assigned
with a particular topic, given other topic assignments
and the data (docs, words)
LDA Inference
Copyright 2017 Aaron Li (aaron@potatos.io)45
Copyright 2017 Aaron Li
(aaron@potatos.io)
Derive predictive probability
LDA Inference
Copyright 2017 Aaron Li (aaron@potatos.io)46
Copyright 2017 Aaron Li
(aaron@potatos.io)
Putting everything together
LDA Inference
Copyright 2017 Aaron Li (aaron@potatos.io)47
Copyright 2017 Aaron Li
(aaron@potatos.io)
LDA Inference
Copyright 2017 Aaron Li (aaron@potatos.io)48
Copyright 2017 Aaron Li
(aaron@potatos.io)
Terms on the right are known all the time!
We can compute the predictive probability (left
term) by normalising over all k’s
LDA Inference
Copyright 2017 Aaron Li (aaron@potatos.io)49
Copyright 2017 Aaron Li
(aaron@potatos.io)
50
• Algorithm (Gibbs sampling):
• Randomly assign a topic to each word & doc
• For T iterations (a large number to ensure convergence)
• For each doc
• For each word
• For each topic, compute predictive prob
• Sample topic by normalising over all predictive prob
• Repeat for T’ iterations (a small number) and compute topic count per
word and per doc. Use them to estimate and
LDA Inference
Copyright 2017 Aaron Li (aaron@potatos.io)
Copyright 2017 Aaron Li
(aaron@potatos.io)
Speed up LDA
(Switch to my KDD 2014 slides)
https://www.slideshare.net/AaronLi11/
kdd-2014-presentation-best-research-
paper-award-alias-topic-modelling-
reducing-the-sampling-complexity-of-
topic-models
Copyright 2017 Aaron Li (aaron@potatos.io)51
Copyright 2017 Aaron Li
(aaron@potatos.io)
Theory 2
Copyright 2017 Aaron Li (aaron@potatos.io)52
Copyright 2017 Aaron Li
(aaron@potatos.io)
End of class
Questions?

More Related Content

Similar to Topic Modelling: for news recommendation, user behaviour modelling, and many more

An-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingAn-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
Theodore J. LaGrow
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Bhaskar Mitra
 
CWIN17 Frankfurt / talend_nlp
CWIN17 Frankfurt / talend_nlpCWIN17 Frankfurt / talend_nlp
CWIN17 Frankfurt / talend_nlp
Capgemini
 
Improving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language ProcessingImproving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language Processing
DataWorks Summit
 
Relation-wise Automatic Domain-Range Information Management for Knowledge Ent...
Relation-wise Automatic Domain-Range Information Management for Knowledge Ent...Relation-wise Automatic Domain-Range Information Management for Knowledge Ent...
Relation-wise Automatic Domain-Range Information Management for Knowledge Ent...
National Inistitute of Informatics (NII), Tokyo, Japann
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
Alia Hamwi
 
You've Got (Big) Data! Now What?
You've Got (Big) Data! Now What?You've Got (Big) Data! Now What?
You've Got (Big) Data! Now What?
Jess Freaner
 
xAPI Vocabulary - Improving Semantic Interoperability of Controlled Vocabularies
xAPI Vocabulary - Improving Semantic Interoperability of Controlled VocabulariesxAPI Vocabulary - Improving Semantic Interoperability of Controlled Vocabularies
xAPI Vocabulary - Improving Semantic Interoperability of Controlled Vocabularies
Advanced Distributed Learning (ADL) Initiative
 
Intro 2 document
Intro 2 documentIntro 2 document
Intro 2 document
Uma Kant
 
Large-Scale Semantic Search
Large-Scale Semantic SearchLarge-Scale Semantic Search
Large-Scale Semantic Search
Roi Blanco
 
The NLP Muppets revolution!
The NLP Muppets revolution!The NLP Muppets revolution!
The NLP Muppets revolution!
Fabio Petroni, PhD
 
2_text operationinformation retrieval. ppt
2_text operationinformation retrieval. ppt2_text operationinformation retrieval. ppt
2_text operationinformation retrieval. ppt
HayomeTakele
 
Practical cases, Applied linguistics course (MUI)
Practical cases, Applied linguistics course (MUI)Practical cases, Applied linguistics course (MUI)
Practical cases, Applied linguistics course (MUI)
Alex Curtis
 
Intro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringIntro to Deep Learning for Question Answering
Intro to Deep Learning for Question Answering
Traian Rebedea
 
Natural Language Processing with Graphs
Natural Language Processing with GraphsNatural Language Processing with Graphs
Natural Language Processing with Graphs
Neo4j
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) Skills
Oscar Corcho
 
Natural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4jNatural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4j
William Lyon
 
Using Linked Data Traversal to Label Academic Communities - SAVE-SD2015
Using Linked Data Traversal to Label Academic Communities - SAVE-SD2015Using Linked Data Traversal to Label Academic Communities - SAVE-SD2015
Using Linked Data Traversal to Label Academic Communities - SAVE-SD2015
Vrije Universiteit Amsterdam
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Max Irwin
 
Introduction to NLP.pptx
Introduction to NLP.pptxIntroduction to NLP.pptx
Introduction to NLP.pptx
buivantan_uneti
 

Similar to Topic Modelling: for news recommendation, user behaviour modelling, and many more (20)

An-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingAn-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
CWIN17 Frankfurt / talend_nlp
CWIN17 Frankfurt / talend_nlpCWIN17 Frankfurt / talend_nlp
CWIN17 Frankfurt / talend_nlp
 
Improving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language ProcessingImproving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language Processing
 
Relation-wise Automatic Domain-Range Information Management for Knowledge Ent...
Relation-wise Automatic Domain-Range Information Management for Knowledge Ent...Relation-wise Automatic Domain-Range Information Management for Knowledge Ent...
Relation-wise Automatic Domain-Range Information Management for Knowledge Ent...
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
You've Got (Big) Data! Now What?
You've Got (Big) Data! Now What?You've Got (Big) Data! Now What?
You've Got (Big) Data! Now What?
 
xAPI Vocabulary - Improving Semantic Interoperability of Controlled Vocabularies
xAPI Vocabulary - Improving Semantic Interoperability of Controlled VocabulariesxAPI Vocabulary - Improving Semantic Interoperability of Controlled Vocabularies
xAPI Vocabulary - Improving Semantic Interoperability of Controlled Vocabularies
 
Intro 2 document
Intro 2 documentIntro 2 document
Intro 2 document
 
Large-Scale Semantic Search
Large-Scale Semantic SearchLarge-Scale Semantic Search
Large-Scale Semantic Search
 
The NLP Muppets revolution!
The NLP Muppets revolution!The NLP Muppets revolution!
The NLP Muppets revolution!
 
2_text operationinformation retrieval. ppt
2_text operationinformation retrieval. ppt2_text operationinformation retrieval. ppt
2_text operationinformation retrieval. ppt
 
Practical cases, Applied linguistics course (MUI)
Practical cases, Applied linguistics course (MUI)Practical cases, Applied linguistics course (MUI)
Practical cases, Applied linguistics course (MUI)
 
Intro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringIntro to Deep Learning for Question Answering
Intro to Deep Learning for Question Answering
 
Natural Language Processing with Graphs
Natural Language Processing with GraphsNatural Language Processing with Graphs
Natural Language Processing with Graphs
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) Skills
 
Natural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4jNatural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4j
 
Using Linked Data Traversal to Label Academic Communities - SAVE-SD2015
Using Linked Data Traversal to Label Academic Communities - SAVE-SD2015Using Linked Data Traversal to Label Academic Communities - SAVE-SD2015
Using Linked Data Traversal to Label Academic Communities - SAVE-SD2015
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
 
Introduction to NLP.pptx
Introduction to NLP.pptxIntroduction to NLP.pptx
Introduction to NLP.pptx
 

Recently uploaded

How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 

Recently uploaded (20)

How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 

Topic Modelling: for news recommendation, user behaviour modelling, and many more

  • 1. Copyright 2017 Aaron Li (aaron@potatos.io) Modelling Aaron Li aaron@potatos.io for news recommendation, user behaviour modelling, and many more
  • 2. About me • Working on a stealth startup • Former lead inference engineer at Scaled Inference • Did AI / Machine Learning at Google Research, NICTA, CMU, ANU, etc. • https://www.linkedin.com/in/aaronqli/ Copyright 2017 Aaron Li (aaron@potatos.io)2 Copyright 2017 Aaron Li (aaron@potatos.io)
  • 3. Overview • Theory (2 classes, 2h each) • work out the problem & solutions & why • discuss the math & models & NLP fundamentals • Industry use cases & systems & applications • Practice (2 classes, 2h each) • live demo + coding + debugging • data sets, open source tools, Q & A Copyright 2017 Aaron Li (aaron@potatos.io)3 Copyright 2017 Aaron Li (aaron@potatos.io)
  • 4. Overview • Background Knowledge • Linear Algebra • Probability Theory • Calculus • Scala / Go / Node / C++ (please vote) Copyright 2017 Aaron Li (aaron@potatos.io)4 Copyright 2017 Aaron Li (aaron@potatos.io)
  • 5. Theory 1 What is news recommendation? What is topic modeling? Why? Basic architecture NLP foundamentals Basic model: LDA Practice 1 LDA live demo NLP tools introduction Preprocessed Datasets Code LDA + Experiments Open source tools for industry Theory 2 LDA Inference Gibbs sampling SparseLDA, AliasLDA, LightLDA Applications & Industrial use cases Practice 2 Set up NLP pipeline SparseLDA, AliasLDA, LightLDA Train & use the model
 News recommendation demo Schedule Copyright 2017 Aaron Li (aaron@potatos.io)5 Copyright 2017 Aaron Li (aaron@potatos.io)
  • 6. News Recommendation Copyright 2017 Aaron Li (aaron@potatos.io)6 Copyright 2017 Aaron Li (aaron@potatos.io)
  • 7. 7 • A lot of people read news every day • Flipboard, CNN, Facebook, WeChat …
 • How do we make people more engaged? • Personalise & Recommendation • learn preference and show relevant content • recommend articles based on the current one News Recommendation Copyright 2017 Aaron Li (aaron@potatos.io) Copyright 2017 Aaron Li (aaron@potatos.io)
  • 8. 8 • Top websites / apps already doing this News Recommendation Copyright 2017 Aaron Li (aaron@potatos.io) Copyright 2017 Aaron Li (aaron@potatos.io)
  • 9. News Recommendation Copyright 2017 Aaron Li (aaron@potatos.io)9 Copyright 2017 Aaron Li (aaron@potatos.io) Flipboard
  • 10. Yahoo! News (now “Oath” News) News Recommendation Copyright 2017 Aaron Li (aaron@potatos.io)10 Copyright 2017 Aaron Li (aaron@potatos.io)
  • 11. 11 • Many websites don’t do it (e.g CNN) • Why not? It’s not a easy problem • Challenges • News article vocabulary is large (100k ~ 1M) • Documents are represented by high-dimensional vector, based on count of vocabulary • Traditional similarity measures don’t work News Recommendation Copyright 2017 Aaron Li (aaron@potatos.io) Copyright 2017 Aaron Li (aaron@potatos.io)
  • 12. Example In 1996 Linus Torvalds, the Finnish creator of the Open Source operating system Linux, visited the National Zoo and Aquarium with members of the Canberra Linux Users Group, and was captivated by one of the Zoo's little Penguins. Legend has it that Linus was infected with a mythical disease called Penguinitis. Penguinitis makes you stay awake at night thinking about Penguins and feeling great love towards them. Not long after this event the Open Source Software community decided they needed a logo for Linux. They were looking for something fun and after Linus mentioned his fondness of penguins, a slightly overweighted penguin sitting down after having a great meal seemed to fit the bill perfectly. Hence, Tux the penguin was created and now when people think of Linux they think of Tux. Copyright 2017 Aaron Li (aaron@potatos.io)12 Copyright 2017 Aaron Li (aaron@potatos.io)
  • 13. Example • Word count = 132, unique words = 91 • Very hard to measure its distance to other articles in our database talking about Linux, Linus Torvalds, and the creation of Tux • Distance for low-dimensional space aren’t effective • e.g. cosine similarity won’t make sense • Need to represent things in low-dimensional vectors • Capture semantics / topics efficiently Copyright 2017 Aaron Li (aaron@potatos.io)13 Copyright 2017 Aaron Li (aaron@potatos.io)
  • 14. Solutions Copyright 2017 Aaron Li (aaron@potatos.io)14 Copyright 2017 Aaron Li (aaron@potatos.io) Step 1. Get text data 
 Step 2. ??? (Machine can’t read text) 
 Step 3. Model & Train Step 4. Deploy & Predict News articles Emails Legal docs Resume … (i.e. documents)
  • 15. 15 Step 2: NLP Preprocessing - common pipeline Solutions Sentence splitting Tokenisation Stop words removal Stemming (optional) POS Tagging Lemmatisation Form bag of words There are a lot more… (used in advanced NLP tasks) Chunking Named Entity Recognition Sentiment Analysis Syntactic Analysis Dependency Parsing Coreference Resolution Entity Relationship Extraction Semantic Analysis … Copyright 2017 Aaron Li (aaron@potatos.io) Copyright 2017 Aaron Li (aaron@potatos.io)
  • 16. NLP Preprocessing Copyright 2017 Aaron Li (aaron@potatos.io)16 Copyright 2017 Aaron Li (aaron@potatos.io) Sentence splitting Tokenisation Stop words removal Stemming (optional) POS Tagging Lemmatisation Form bag of words • Mostly rules (by regex or FST) • Look for sentence splitter • For English: , . ! ? etc. • Checkout Wikipedia article • Open source code is good • Also checkout this article
  • 17. NLP Preprocessing Copyright 2017 Aaron Li (aaron@potatos.io)17 Copyright 2017 Aaron Li (aaron@potatos.io) Sentence splitting Tokenisation Stop words removal Stemming (optional) POS Tagging Lemmatisation Form bag of words • Find boundaries for words • Easy for English (look for space) • Hard for Chinese etc. • Solution: FST, CRF, etc. • Difficulties: see Wikipedia article • Try making one by yourself using FST! (CMU 11711 homework)
  • 18. NLP Preprocessing Copyright 2017 Aaron Li (aaron@potatos.io)18 Copyright 2017 Aaron Li (aaron@potatos.io) Sentence splitting Tokenisation Stop words removal Stemming (optional) POS Tagging Lemmatisation Form bag of words • Stop words: • occurs frequently • semantically not meaningful • i.e. am, is, who, what, etc.
 • Small set of words
 • Easy to implement • e.g. in-memory hashset
  • 19. NLP Preprocessing Copyright 2017 Aaron Li (aaron@potatos.io)19 Copyright 2017 Aaron Li (aaron@potatos.io) Sentence splitting Tokenisation Stop words removal Stemming (optional) POS Tagging Lemmatisation Form bag of words • Reduce word to root (stem) • Usually used in IR system • Root can be a non-word. e.g. • fishing, fished, fisher => fish • cats, catty => cat • argument, arguing => argu
 • Rule based implementation
 • e.g. Porter’s Snowball stemmer
 Also see Wikipedia article
  • 20. NLP Preprocessing Copyright 2017 Aaron Li (aaron@potatos.io)20 Copyright 2017 Aaron Li (aaron@potatos.io) Sentence splitting Tokenisation Stop words removal Stemming (optional) POS Tagging Lemmatisation Form bag of words • POS = Part of Speech • Find grammar role of each word. 
 I ate a fish PRP VBD DT NN
 • Disambiguate same words used in different context. e.g: • “Train” as in “train a model” • “Train” as in “catch a train” • Techniques: HMM, CRF, etc. • See this article for more details
  • 21. NLP Preprocessing Copyright 2017 Aaron Li (aaron@potatos.io)21 Copyright 2017 Aaron Li (aaron@potatos.io) Sentence splitting Tokenisation Stop words removal Stemming (optional) POS Tagging Lemmatisation Form bag of words • Find base form of a word • More complex than stemming • Use POS tag information • Different rules for different POS
 • Base form is a valid word. e.g. • walks, walking, walked =>walk • am, are, is => be • argument (NN) => argument • arguing (VBG) => argue
 • See Wikipedia article for details
  • 22. NLP Preprocessing Copyright 2017 Aaron Li (aaron@potatos.io)22 Copyright 2017 Aaron Li (aaron@potatos.io) Sentence splitting Tokenisation Stop words removal Stemming (optional) POS Tagging Lemmatisation Form bag of words • Index pre-processed documents and words with id and frequency
 • e.g: • id:1 word:(train, VBG) freq: 5 • id:2 word:(model, NN) freq: 2 • id:3 word:(train, NN) freq: 3 • … See UCI Bag of Words dataset
  • 23. Solutions • Modelling & Training • Naive Bayes • Latent Semantic Analysis • word2vec, doc2vec, … • Topic Modelling Copyright 2017 Aaron Li (aaron@potatos.io)23 Copyright 2017 Aaron Li (aaron@potatos.io)
  • 24. 24 • Naive Bayes (very old technique) • Use only key words to get probability for K labels • Good for spam detection • Poor performance for news recommendation • Does not capture semantics / topics • https://web.stanford.edu/class/cs124/lec/ naivebayes.pdf Solutions Copyright 2017 Aaron Li (aaron@potatos.io) Copyright 2017 Aaron Li (aaron@potatos.io)
  • 25. 25 • Latent Semantic Analysis (~1990 - 2000) • SVD on a TF-IDF frequency matrix with documents as columns and words as rows • Gives a low-rank approximation of the matrix and represent documents in low dimension vectors • Problem: hard to interpret vectors / documents, probability distribution is wrong (Gaussian) • https://nlp.stanford.edu/IR-book/html/htmledition/latent-semantic- indexing-1.html • Thomas Hofmann. Probabilistic Latent Semantic Analysis. In Kathryn B. Laskey and Henri Prade, editors, UAI, Solutions Copyright 2017 Aaron Li (aaron@potatos.io) Copyright 2017 Aaron Li (aaron@potatos.io)
  • 26. 26 • word2vec, doc2vec (2013~) • Convert words to dense, low-dimensional, compositional vectors (e.g. king - man + woman = queen) • Good for classification problems • Slow to train, hard to interpret (because of neural network), yet to be tested in industrial use cases • Mikolov, Tomas; et al. "Efficient Estimation of Word Representations in Vector Space" ICLR 2013. • Getting started with word2vec Solutions Copyright 2017 Aaron Li (aaron@potatos.io) Copyright 2017 Aaron Li (aaron@potatos.io)
  • 27. 27 • Topic Models (LDA etc., 2003~) • Define a generative structure involving latent variables (e.g topics) using well-structured distributions and infer the parameters • Represent documents / words using low-dimensional, highly interpretable distributions • Extensively used in industry. Many open source tools • Extensive research on speeding up / scaling up • D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993–1022, 2003 • Tutorial: Parameter estimation for text analysis, Gregor Heinrich 2008 Solutions Copyright 2017 Aaron Li (aaron@potatos.io) Copyright 2017 Aaron Li (aaron@potatos.io)
  • 28. Copyright 2017 Aaron Li (aaron@potatos.io)28 Copyright 2017 Aaron Li (aaron@potatos.io)
  • 29. Topic Models Copyright 2017 Aaron Li (aaron@potatos.io)29 Copyright 2017 Aaron Li (aaron@potatos.io) Latent Dirichlet Allocation (LDA) Image from [Li 2012, Multi-GPU Distributed Parallel Bayesian Differential Topic Modelling]
  • 30. 30 • LDA (Latent Dirichlet Allocation) • Arguably the most popular topic model since 2013 • Created by David Blei, Andrew Ng, Michael Jordan • To be practical we use this topic model in class Topic Models Copyright 2017 Aaron Li (aaron@potatos.io) Copyright 2017 Aaron Li (aaron@potatos.io)
  • 31. LDA Copyright 2017 Aaron Li (aaron@potatos.io)31 Copyright 2017 Aaron Li (aaron@potatos.io) Extracted from [Li 2012, Multi-GPU Distributed Parallel Bayesian Differential Topic Modelling]
  • 32. LDA Copyright 2017 Aaron Li (aaron@potatos.io)32 Copyright 2017 Aaron Li (aaron@potatos.io) Extracted from [Li 2012, Multi-GPU Distributed Parallel Bayesian Differential Topic Modelling]
  • 33. LDA Copyright 2017 Aaron Li (aaron@potatos.io)33 Copyright 2017 Aaron Li (aaron@potatos.io) Extracted from [Li 2012, Multi-GPU Distributed Parallel Bayesian Differential Topic Modelling]
  • 34. Example Copyright 2017 Aaron Li (aaron@potatos.io)34 Copyright 2017 Aaron Li (aaron@potatos.io)
  • 35. Example Copyright 2017 Aaron Li (aaron@potatos.io)35 Copyright 2017 Aaron Li (aaron@potatos.io) Extracted from [Li 2012, Multi-GPU Distributed Parallel Bayesian Differential Topic Modelling]
  • 36. Example Copyright 2017 Aaron Li (aaron@potatos.io)36 Copyright 2017 Aaron Li (aaron@potatos.io) Extracted from [BleiNgJordan2003, Latent Dirichlet Allocation]
  • 37. LDA • Task: infer parameters • each document’s representation by topic vector • with this we can compute document similarity! • each topic’s representation by words (counts) • with this we can look at each topic manually, and interpret the meaning of them! Copyright 2017 Aaron Li (aaron@potatos.io)37 Copyright 2017 Aaron Li (aaron@potatos.io)
  • 38. Theory 1 Copyright 2017 Aaron Li (aaron@potatos.io)38 Copyright 2017 Aaron Li (aaron@potatos.io) End of class Questions?
  • 39. Industrial Applications & Use cases • Yi Wang et al. Peacock: Learning Long-Tail Topic Features for Industrial Applications (TIST 2014) • Advertising system in production • Aaron Li et al. High Performance Latent Variable Models (arxiv, 2014) • User preference learning from search data • Arnab Bhadury, Clustering Similar Stories Using LDA • News Recommendation • And many more… search “AliasLDA” or “LightLDA” on Google Copyright 2017 Aaron Li (aaron@potatos.io)39 Copyright 2017 Aaron Li (aaron@potatos.io)
  • 40. LDA Inference Copyright 2017 Aaron Li (aaron@potatos.io)40 Copyright 2017 Aaron Li (aaron@potatos.io) Bayes rule: Where denotes all latent variables
  • 41. 41 In LDA, the topic assignment for each word is latent LDA Inference Intractable: KL terms in denominator Copyright 2017 Aaron Li (aaron@potatos.io) Copyright 2017 Aaron Li (aaron@potatos.io)
  • 42. 42 What can we do to address intractability? • Gibbs sampling • Variational inference (not discussed in class) LDA Inference Copyright 2017 Aaron Li (aaron@potatos.io) Copyright 2017 Aaron Li (aaron@potatos.io)
  • 43. 43 Estimate by sampling: LDA Inference Gibbs sampling: Sample if is known Copyright 2017 Aaron Li (aaron@potatos.io) Copyright 2017 Aaron Li (aaron@potatos.io)
  • 44. LDA Inference Copyright 2017 Aaron Li (aaron@potatos.io)44 Copyright 2017 Aaron Li (aaron@potatos.io) We can compute using Bayes rules Above equation is called “predictive probability”. It can be applied to the latent variable which assigns a topic to each word i.e. compute the probability of a word is assigned with a particular topic, given other topic assignments and the data (docs, words)
  • 45. LDA Inference Copyright 2017 Aaron Li (aaron@potatos.io)45 Copyright 2017 Aaron Li (aaron@potatos.io) Derive predictive probability
  • 46. LDA Inference Copyright 2017 Aaron Li (aaron@potatos.io)46 Copyright 2017 Aaron Li (aaron@potatos.io)
  • 47. Putting everything together LDA Inference Copyright 2017 Aaron Li (aaron@potatos.io)47 Copyright 2017 Aaron Li (aaron@potatos.io)
  • 48. LDA Inference Copyright 2017 Aaron Li (aaron@potatos.io)48 Copyright 2017 Aaron Li (aaron@potatos.io)
  • 49. Terms on the right are known all the time! We can compute the predictive probability (left term) by normalising over all k’s LDA Inference Copyright 2017 Aaron Li (aaron@potatos.io)49 Copyright 2017 Aaron Li (aaron@potatos.io)
  • 50. 50 • Algorithm (Gibbs sampling): • Randomly assign a topic to each word & doc • For T iterations (a large number to ensure convergence) • For each doc • For each word • For each topic, compute predictive prob • Sample topic by normalising over all predictive prob • Repeat for T’ iterations (a small number) and compute topic count per word and per doc. Use them to estimate and LDA Inference Copyright 2017 Aaron Li (aaron@potatos.io) Copyright 2017 Aaron Li (aaron@potatos.io)
  • 51. Speed up LDA (Switch to my KDD 2014 slides) https://www.slideshare.net/AaronLi11/ kdd-2014-presentation-best-research- paper-award-alias-topic-modelling- reducing-the-sampling-complexity-of- topic-models Copyright 2017 Aaron Li (aaron@potatos.io)51 Copyright 2017 Aaron Li (aaron@potatos.io)
  • 52. Theory 2 Copyright 2017 Aaron Li (aaron@potatos.io)52 Copyright 2017 Aaron Li (aaron@potatos.io) End of class Questions?