SlideShare a Scribd company logo
Fabrikatyr Analytics
Uncover tangible truths amidst the noise of modern media
Agenda
@conr
#genism
conor@fabrikatyr.com
Explanation of Topic Modelling
Application using Gensim
Sample Results
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
2
Explanation of Topic
Modelling
A BRIEF INTRODUCTION TO THE SEMANTIC WEB
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
3
Why is it
Important?
• Discover topics
in large groups
of documents
• Use these
labels to
understand the
body of text and
documents
more effectively
What is Semantic Analysis?
Some use cases:
•Consumer Insight
• Recommender
• Social Media Monitoring
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
4
What is Topic Modelling?
Grouping
documents based
on the probability of
words occurring in
each document
http://people.cs.umass.edu/~wallach/talks/priors.pdf08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
5
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
6
Transforming raw data to insight for a particular audience is not
about algorithms alone
Data
Insight
Good Data Science makes ‘The Gap’ as small as possible
Finding the most suitable application of Topic
modelling for ‘discussion’ is critical
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
7
Topic
modelling
Semantic
Subject matter
corpus
General Corpus
Statistical
Word
probability
Paragraph
structure
Word distance
Mixture of all?
Analysing political debate
discourse has the following issues
• Few / little ‘training’ texts
• Highly variable sentence
length
• Distinct word distributions
• Statistical word probability
has readily available
implementations and can
resolve these challenges
What is Gensim?
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
8
Gensim is a free Python library designed to automatically extract semantic topics from
documents, as efficiently (computer-wise) and painlessly (human-wise) as possible. Gensim aims
at processing raw, unstructured digital texts (“plain text”).
• Offers more precise modelling options than ‘topicModels’ in R or
MALLET
• Wider function set
• Somewhat complex to optimise
• Dependencies: numpy and scipy
Radim Rehurek
Application using
Gensim
HOW TO USE GENSIM TO UNDERSTAND LARGE VOLUMES
OF TEXT EFFECTIVELY
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
9
Preparing the
data
• No data set is ever
ready to operate on
‘out-of-the-box’
• Challenges
included:
• Character encoding
• Multiple fields in a
column
• Timestamps
DATA
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
10
What is a Text Corpus and a ‘Bag-of-Words’?
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
11
Bag-of-words (BOW) converts each
response into a set unordered single
words
This Method:
• does not parse sentences,
• does not care about word order, and
• does not "understand" grammar or syntax
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
12
The optimum number of topics can be selected by calculating the
model with the smallest measure of Chaos / Entropy
Least amount
of disorder in
the topics
Harmonic
Mean
AIC
Entropy
“Sum of Lowest average probability”
for each topic distribution
Balance of “Harmonic
mean “ against model
complexity
Least amount
of disorder in
the topics
Using Kullback–Leibler divergence we can
spot local minimum and pick the optimum
number based on how many topics we want
to ‘name’
Local minimums provide a chance to explore the
Trade-off between granularity and consistency
Latent Dirichlet Allocation
LDA repeatedly examines the
probability of the words in each
response and establish ‘common sets’
(topics)
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
13
The topic words associated can be extracted
Each comment is be assigned to a single topic
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
14
LDA.print_topic extracts the words in each topic
NP.Max gets the
most likely topic for
each comment
Sample Results
INTERROGATING A COMMUNITY FORUM DISCUSSION
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
15
How to use it?
There are 7 key stages to model topics effectively
1
• Collat
e text
2
• Creat
e
Corpu
s
3
• Creat
e ‘bag
of
words
’
4
• Optim
um
topics
5
• Establ
ish
keywo
rd
group
ings
6
• Name
Topic
s
7
• Visual
ise
1
Get
Data
2
Create
Corpus
3
Feature
review
4
Optimum
topics
5
Review
6
Name &
Visualise
7
Deliver
insight
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
16
Sample set : 11.3K posts to a Teleco
help forum
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
17
Corpus
5,000 Questions
3,000 Users
3 years of data
Kudos
Device
Thread size
User Age
Views
Maximum user posts
Data Features
Classifying users will help identify
admin versus users
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
18
We then use ‘Regression Forest’ to further identify
post features which drive ‘Views’
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
19
Removing the ‘Admin’ outlier ‘Kudos’ seems
to be the driving feature
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
20
Kudos Response no User Age Thread Size
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
21
Optimum topic number across the different user segments ensures our
grouping assumptions are reasonable
Using Kullback–Leibler
divergence we can spot local
minimum and pick the optimum
number based on how many
topics we want to ‘name’
Local minimums provide
a chance to explore the
Trade-off between
granularity and
consistency
Amount of posts in each topic and length of post
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
22
We examine the structure of the corpus and the lengths of
the posts to validate our model
Response
count
Length of
Post
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
23
Word probability distributions, corpus and domain
knowledge allow for topics to be named
Topic Topic Name Word tokens and probability
11 Internet setting internet setting data phone work
30% 27% 31% 12% 5%
12 Number Transfer number 48 sim support old
36% 35% 14% 6% 3%
13
General new account
query
phone sim go solution solved
26% 29% 24% 9% 6%
14 Roaming text roaming call send eu
23% 25% 22% 8% 5%
15 General chat im like think good dont
13% 12% 10% 4% 2%
16 Referral Bonus press key navi highlight select
27% 32% 31% 15% 9%
17 Network Issues network phone problem im internet
12% 11% 13% 5% 3%
18 Blackberry Problems problem blackberry mine get thanks
11% 11% 13% 5% 2%
Posts get ‘views’ for any number of reason, we
need to identifying topics are important
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
24
Using Random Forest of predicting ‘Views’
Topic ‘name’
Topic
number
Internet setting 11
Number Transfer 12
General new account query 13
Referal Bonus 16
Network Issues 17
Blackberry Problems 18
Only 5 Topics which drive views
This suggests these topics get ‘repeat’ visits
This is NOT the most ‘viewed’ topics, but the ones which people refer to
16 18 13 12 17 11
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
25
We then compare key topics posts over time to
understand the patterns
Using ‘Named Entity Recognition’(NER), Topic
Modelling can be used to understand how consumers
are interacting with brands
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
26
Brands mentions
only occur in 2%
of the entire
corpus, making
any assignment
of topics trivial
Conclusion
THINGS TO THINK ABOUT
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
27
2nd Generation of ‘Listening’ tools will be less metric and more Qualitative
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
28
Context is Key
Blind application of
complex modelling
will yield results which
deliver incorrect
classification
The final deliverable
and key features must
be defined before
embarking on the
analysis
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
29
There is an infinite amount of data, harvesting it is the key
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
30
Appendix
GENSIM
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
31
Comparison of LDA
implementations
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
32
Learning rate – (decay)
To ‘bootstrap’ small bodies of
text
‘Passes’ of the Bayesian sampling function can also effect the model
•Gensim in Python currently has
the most extensive set of
parameters however
topicmodels in R has some good
visualisation examples
•‘Online’ LDA implementations
are crucial for ‘social listening’
for evolving political commentary
The ‘Number of Topics’ is the key parameter however there are a few
other parameters which are important.
Priors Matter
Function of document
count and length
‘Honourable mention’ implementations
• Vowpal Wabbit – machine learning
• Mallet – Focus on text modelling
• Stanford - great resource
The Model still
needs to be
visualised
Again we use Kullback-Leibler
divergence to map the topics
against each other. Each word
has a measure of Saliency
Saliency is a compromise
between a word's overall
frequency and it's distinctiveness.
A word's distinctiveness is a
measure of that word's
distribution over topics
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
33
By visualising the word distributions in each topic
we understand them better
Why Priors
Matter!
Careful thinking about priors can yield new
insights
– e.g., priors and STOPWORD handling are
related
For LDA the choice of prior is surprisingly
important:
– Asymmetric prior for document-specific topic
distributions
– Symmetric prior for topic-specific word
distributions
Almost all work on LDA uses symmetric Dirichlet priors
– Two scalar concentration parameters: α and β
● Concentration parameters are usually set heuristically
● Some recent work on inferring optimal concentration
parameter values from data (Asuncion et al., 2009)
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
34

More Related Content

Viewers also liked

Lda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notesLda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notes
👋 Christopher Moody
 
Spotlight on Inbound Marketing
Spotlight on Inbound MarketingSpotlight on Inbound Marketing
Spotlight on Inbound Marketing
Conor Duke
 
lda2vec Text by the Bay 2016
lda2vec Text by the Bay 2016lda2vec Text by the Bay 2016
lda2vec Text by the Bay 2016
👋 Christopher Moody
 
Data driven community management June 2015
Data driven community management June 2015Data driven community management June 2015
Data driven community management June 2015
Conor Duke
 
Using predictive analytics to increase consumer response rate - PyCon Irelan...
 Using predictive analytics to increase consumer response rate - PyCon Irelan... Using predictive analytics to increase consumer response rate - PyCon Irelan...
Using predictive analytics to increase consumer response rate - PyCon Irelan...
Conor Duke
 
Rob Nelson - Ideology and algorithms: the uses of nationalism in the American...
Rob Nelson - Ideology and algorithms: the uses of nationalism in the American...Rob Nelson - Ideology and algorithms: the uses of nationalism in the American...
Rob Nelson - Ideology and algorithms: the uses of nationalism in the American...Digital History
 
StreamGrid: Summarization of large-scale Events using Topic Modeling and Temp...
StreamGrid: Summarization of large-scale Events using Topic Modeling and Temp...StreamGrid: Summarization of large-scale Events using Topic Modeling and Temp...
StreamGrid: Summarization of large-scale Events using Topic Modeling and Temp...
Symeon Papadopoulos
 
Topic Modelling: Tutorial on Usage and Applications
Topic Modelling: Tutorial on Usage and ApplicationsTopic Modelling: Tutorial on Usage and Applications
Topic Modelling: Tutorial on Usage and Applications
Ayush Jain
 
Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Pe...
Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Pe...Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Pe...
Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Pe...
Alexis Perrier
 
Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation
Daniele Di Mitri
 
Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016
Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016
Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016
Jonathan Sedar
 
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Vasily Leksin
 
Drawing word2vec
Drawing word2vecDrawing word2vec
Drawing word2vec
Kai Sasaki
 
Big Data and Marketing: Data Activation and Management
Big Data and Marketing: Data Activation and ManagementBig Data and Marketing: Data Activation and Management
Big Data and Marketing: Data Activation and Management
Conor Duke
 
Text categorization
Text categorizationText categorization
Text categorization
KU Leuven
 
Database - Entity Relationship Diagram (ERD)
Database - Entity Relationship Diagram (ERD)Database - Entity Relationship Diagram (ERD)
Database - Entity Relationship Diagram (ERD)
Mudasir Qazi
 
Entity Relationship Diagram
Entity Relationship DiagramEntity Relationship Diagram
Entity Relationship Diagram
Shakila Mahjabin
 
How to Draw an Effective ER diagram
How to Draw an Effective ER diagramHow to Draw an Effective ER diagram
How to Draw an Effective ER diagramTech_MX
 
Topic Modelling and APIs
Topic Modelling and APIsTopic Modelling and APIs
Topic Modelling and APIs
Ali Kheyrollahi
 

Viewers also liked (20)

Lda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notesLda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notes
 
Spotlight on Inbound Marketing
Spotlight on Inbound MarketingSpotlight on Inbound Marketing
Spotlight on Inbound Marketing
 
lda2vec Text by the Bay 2016
lda2vec Text by the Bay 2016lda2vec Text by the Bay 2016
lda2vec Text by the Bay 2016
 
Data driven community management June 2015
Data driven community management June 2015Data driven community management June 2015
Data driven community management June 2015
 
Using predictive analytics to increase consumer response rate - PyCon Irelan...
 Using predictive analytics to increase consumer response rate - PyCon Irelan... Using predictive analytics to increase consumer response rate - PyCon Irelan...
Using predictive analytics to increase consumer response rate - PyCon Irelan...
 
Rob Nelson - Ideology and algorithms: the uses of nationalism in the American...
Rob Nelson - Ideology and algorithms: the uses of nationalism in the American...Rob Nelson - Ideology and algorithms: the uses of nationalism in the American...
Rob Nelson - Ideology and algorithms: the uses of nationalism in the American...
 
SocialLda
SocialLda SocialLda
SocialLda
 
StreamGrid: Summarization of large-scale Events using Topic Modeling and Temp...
StreamGrid: Summarization of large-scale Events using Topic Modeling and Temp...StreamGrid: Summarization of large-scale Events using Topic Modeling and Temp...
StreamGrid: Summarization of large-scale Events using Topic Modeling and Temp...
 
Topic Modelling: Tutorial on Usage and Applications
Topic Modelling: Tutorial on Usage and ApplicationsTopic Modelling: Tutorial on Usage and Applications
Topic Modelling: Tutorial on Usage and Applications
 
Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Pe...
Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Pe...Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Pe...
Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Pe...
 
Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation
 
Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016
Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016
Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016
 
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
 
Drawing word2vec
Drawing word2vecDrawing word2vec
Drawing word2vec
 
Big Data and Marketing: Data Activation and Management
Big Data and Marketing: Data Activation and ManagementBig Data and Marketing: Data Activation and Management
Big Data and Marketing: Data Activation and Management
 
Text categorization
Text categorizationText categorization
Text categorization
 
Database - Entity Relationship Diagram (ERD)
Database - Entity Relationship Diagram (ERD)Database - Entity Relationship Diagram (ERD)
Database - Entity Relationship Diagram (ERD)
 
Entity Relationship Diagram
Entity Relationship DiagramEntity Relationship Diagram
Entity Relationship Diagram
 
How to Draw an Effective ER diagram
How to Draw an Effective ER diagramHow to Draw an Effective ER diagram
How to Draw an Effective ER diagram
 
Topic Modelling and APIs
Topic Modelling and APIsTopic Modelling and APIs
Topic Modelling and APIs
 

Similar to Topic Modelling to identify behavioral trends in online communities

AzureML Welcome to the future of Predictive Analytics
AzureML Welcome to the future of Predictive Analytics AzureML Welcome to the future of Predictive Analytics
AzureML Welcome to the future of Predictive Analytics
Ruben Pertusa Lopez
 
A Comprehensive Learning Path to Become a Data Science 2021.pptx
A Comprehensive Learning Path to Become a Data Science 2021.pptxA Comprehensive Learning Path to Become a Data Science 2021.pptx
A Comprehensive Learning Path to Become a Data Science 2021.pptx
RajSingh512965
 
Workshop_CITA2015
Workshop_CITA2015Workshop_CITA2015
Workshop_CITA2015Bebo White
 
Building Powerful and Intelligent Applications with Azure Machine Learning
Building Powerful and Intelligent Applications with Azure Machine LearningBuilding Powerful and Intelligent Applications with Azure Machine Learning
Building Powerful and Intelligent Applications with Azure Machine Learning
David Walker, CSM,CSD,MCP,MCAD,MCSD,MVP
 
Data kitchen 7 agile steps - big data fest 9-18-2015
Data kitchen   7 agile steps - big data fest 9-18-2015Data kitchen   7 agile steps - big data fest 9-18-2015
Data kitchen 7 agile steps - big data fest 9-18-2015
DataKitchen
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
zekeLabs Technologies
 
Neo4j GraphTalks Oslo - Next Generation Solutions built on Neoej
Neo4j GraphTalks Oslo - Next Generation Solutions built on NeoejNeo4j GraphTalks Oslo - Next Generation Solutions built on Neoej
Neo4j GraphTalks Oslo - Next Generation Solutions built on Neoej
Neo4j
 
10 Limitations of Large Language Models and Mitigation Options
10 Limitations of Large Language Models and Mitigation Options10 Limitations of Large Language Models and Mitigation Options
10 Limitations of Large Language Models and Mitigation Options
Mihai Criveti
 
Analyzing User Reviews in Tourism with Topic Models
Analyzing User Reviews in Tourism with Topic ModelsAnalyzing User Reviews in Tourism with Topic Models
The Heart of Data Modeling: The Best Data Modeler is a Lazy Data Modeler
The Heart of Data Modeling: The Best Data Modeler is a Lazy Data ModelerThe Heart of Data Modeling: The Best Data Modeler is a Lazy Data Modeler
The Heart of Data Modeling: The Best Data Modeler is a Lazy Data Modeler
DATAVERSITY
 
Doing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics EnvironmentDoing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics Environment
Tasktop
 
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Matt Stubbs
 
Data science presentation
Data science presentationData science presentation
Data science presentation
MSDEVMTL
 
A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...
A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...
A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...
Michael Mortenson
 
Enterprise Applications of Text Intelligence - Lecture slides
Enterprise Applications of Text Intelligence - Lecture slidesEnterprise Applications of Text Intelligence - Lecture slides
Enterprise Applications of Text Intelligence - Lecture slides
University St. Gallen
 
Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream mining
Albert Bifet
 
1. Text mining – Text mining or text data mining is a process to e.docx
1. Text mining – Text mining or text data mining is a process to e.docx1. Text mining – Text mining or text data mining is a process to e.docx
1. Text mining – Text mining or text data mining is a process to e.docx
stilliegeorgiana
 
A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...
A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...
A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...
Michael Mortenson
 
Understanding voice of the member via text mining
Understanding voice of the member via text miningUnderstanding voice of the member via text mining
Understanding voice of the member via text mining
Chi-Yi Kuan
 
Building Powerful and Intelligent Applications with Azure Machine Learning
Building Powerful and Intelligent Applications with Azure Machine LearningBuilding Powerful and Intelligent Applications with Azure Machine Learning
Building Powerful and Intelligent Applications with Azure Machine Learning
David Walker, CSM,CSD,MCP,MCAD,MCSD,MVP
 

Similar to Topic Modelling to identify behavioral trends in online communities (20)

AzureML Welcome to the future of Predictive Analytics
AzureML Welcome to the future of Predictive Analytics AzureML Welcome to the future of Predictive Analytics
AzureML Welcome to the future of Predictive Analytics
 
A Comprehensive Learning Path to Become a Data Science 2021.pptx
A Comprehensive Learning Path to Become a Data Science 2021.pptxA Comprehensive Learning Path to Become a Data Science 2021.pptx
A Comprehensive Learning Path to Become a Data Science 2021.pptx
 
Workshop_CITA2015
Workshop_CITA2015Workshop_CITA2015
Workshop_CITA2015
 
Building Powerful and Intelligent Applications with Azure Machine Learning
Building Powerful and Intelligent Applications with Azure Machine LearningBuilding Powerful and Intelligent Applications with Azure Machine Learning
Building Powerful and Intelligent Applications with Azure Machine Learning
 
Data kitchen 7 agile steps - big data fest 9-18-2015
Data kitchen   7 agile steps - big data fest 9-18-2015Data kitchen   7 agile steps - big data fest 9-18-2015
Data kitchen 7 agile steps - big data fest 9-18-2015
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
Neo4j GraphTalks Oslo - Next Generation Solutions built on Neoej
Neo4j GraphTalks Oslo - Next Generation Solutions built on NeoejNeo4j GraphTalks Oslo - Next Generation Solutions built on Neoej
Neo4j GraphTalks Oslo - Next Generation Solutions built on Neoej
 
10 Limitations of Large Language Models and Mitigation Options
10 Limitations of Large Language Models and Mitigation Options10 Limitations of Large Language Models and Mitigation Options
10 Limitations of Large Language Models and Mitigation Options
 
Analyzing User Reviews in Tourism with Topic Models
Analyzing User Reviews in Tourism with Topic ModelsAnalyzing User Reviews in Tourism with Topic Models
Analyzing User Reviews in Tourism with Topic Models
 
The Heart of Data Modeling: The Best Data Modeler is a Lazy Data Modeler
The Heart of Data Modeling: The Best Data Modeler is a Lazy Data ModelerThe Heart of Data Modeling: The Best Data Modeler is a Lazy Data Modeler
The Heart of Data Modeling: The Best Data Modeler is a Lazy Data Modeler
 
Doing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics EnvironmentDoing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics Environment
 
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...
A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...
A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...
 
Enterprise Applications of Text Intelligence - Lecture slides
Enterprise Applications of Text Intelligence - Lecture slidesEnterprise Applications of Text Intelligence - Lecture slides
Enterprise Applications of Text Intelligence - Lecture slides
 
Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream mining
 
1. Text mining – Text mining or text data mining is a process to e.docx
1. Text mining – Text mining or text data mining is a process to e.docx1. Text mining – Text mining or text data mining is a process to e.docx
1. Text mining – Text mining or text data mining is a process to e.docx
 
A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...
A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...
A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...
 
Understanding voice of the member via text mining
Understanding voice of the member via text miningUnderstanding voice of the member via text mining
Understanding voice of the member via text mining
 
Building Powerful and Intelligent Applications with Azure Machine Learning
Building Powerful and Intelligent Applications with Azure Machine LearningBuilding Powerful and Intelligent Applications with Azure Machine Learning
Building Powerful and Intelligent Applications with Azure Machine Learning
 

Recently uploaded

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 

Recently uploaded (20)

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 

Topic Modelling to identify behavioral trends in online communities

  • 1. Fabrikatyr Analytics Uncover tangible truths amidst the noise of modern media
  • 2. Agenda @conr #genism conor@fabrikatyr.com Explanation of Topic Modelling Application using Gensim Sample Results 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 2
  • 3. Explanation of Topic Modelling A BRIEF INTRODUCTION TO THE SEMANTIC WEB 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 3
  • 4. Why is it Important? • Discover topics in large groups of documents • Use these labels to understand the body of text and documents more effectively What is Semantic Analysis? Some use cases: •Consumer Insight • Recommender • Social Media Monitoring 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 4
  • 5. What is Topic Modelling? Grouping documents based on the probability of words occurring in each document http://people.cs.umass.edu/~wallach/talks/priors.pdf08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 5
  • 6. 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 6 Transforming raw data to insight for a particular audience is not about algorithms alone Data Insight Good Data Science makes ‘The Gap’ as small as possible
  • 7. Finding the most suitable application of Topic modelling for ‘discussion’ is critical 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 7 Topic modelling Semantic Subject matter corpus General Corpus Statistical Word probability Paragraph structure Word distance Mixture of all? Analysing political debate discourse has the following issues • Few / little ‘training’ texts • Highly variable sentence length • Distinct word distributions • Statistical word probability has readily available implementations and can resolve these challenges
  • 8. What is Gensim? 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 8 Gensim is a free Python library designed to automatically extract semantic topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible. Gensim aims at processing raw, unstructured digital texts (“plain text”). • Offers more precise modelling options than ‘topicModels’ in R or MALLET • Wider function set • Somewhat complex to optimise • Dependencies: numpy and scipy Radim Rehurek
  • 9. Application using Gensim HOW TO USE GENSIM TO UNDERSTAND LARGE VOLUMES OF TEXT EFFECTIVELY 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 9
  • 10. Preparing the data • No data set is ever ready to operate on ‘out-of-the-box’ • Challenges included: • Character encoding • Multiple fields in a column • Timestamps DATA 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 10
  • 11. What is a Text Corpus and a ‘Bag-of-Words’? 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 11 Bag-of-words (BOW) converts each response into a set unordered single words This Method: • does not parse sentences, • does not care about word order, and • does not "understand" grammar or syntax
  • 12. 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 12 The optimum number of topics can be selected by calculating the model with the smallest measure of Chaos / Entropy Least amount of disorder in the topics Harmonic Mean AIC Entropy “Sum of Lowest average probability” for each topic distribution Balance of “Harmonic mean “ against model complexity Least amount of disorder in the topics Using Kullback–Leibler divergence we can spot local minimum and pick the optimum number based on how many topics we want to ‘name’ Local minimums provide a chance to explore the Trade-off between granularity and consistency
  • 13. Latent Dirichlet Allocation LDA repeatedly examines the probability of the words in each response and establish ‘common sets’ (topics) 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 13
  • 14. The topic words associated can be extracted Each comment is be assigned to a single topic 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 14 LDA.print_topic extracts the words in each topic NP.Max gets the most likely topic for each comment
  • 15. Sample Results INTERROGATING A COMMUNITY FORUM DISCUSSION 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 15
  • 16. How to use it? There are 7 key stages to model topics effectively 1 • Collat e text 2 • Creat e Corpu s 3 • Creat e ‘bag of words ’ 4 • Optim um topics 5 • Establ ish keywo rd group ings 6 • Name Topic s 7 • Visual ise 1 Get Data 2 Create Corpus 3 Feature review 4 Optimum topics 5 Review 6 Name & Visualise 7 Deliver insight 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 16
  • 17. Sample set : 11.3K posts to a Teleco help forum 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 17 Corpus 5,000 Questions 3,000 Users 3 years of data Kudos Device Thread size User Age Views Maximum user posts Data Features
  • 18. Classifying users will help identify admin versus users 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 18
  • 19. We then use ‘Regression Forest’ to further identify post features which drive ‘Views’ 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 19
  • 20. Removing the ‘Admin’ outlier ‘Kudos’ seems to be the driving feature 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 20 Kudos Response no User Age Thread Size
  • 21. 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 21 Optimum topic number across the different user segments ensures our grouping assumptions are reasonable Using Kullback–Leibler divergence we can spot local minimum and pick the optimum number based on how many topics we want to ‘name’ Local minimums provide a chance to explore the Trade-off between granularity and consistency
  • 22. Amount of posts in each topic and length of post T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 22 We examine the structure of the corpus and the lengths of the posts to validate our model Response count Length of Post
  • 23. 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 23 Word probability distributions, corpus and domain knowledge allow for topics to be named Topic Topic Name Word tokens and probability 11 Internet setting internet setting data phone work 30% 27% 31% 12% 5% 12 Number Transfer number 48 sim support old 36% 35% 14% 6% 3% 13 General new account query phone sim go solution solved 26% 29% 24% 9% 6% 14 Roaming text roaming call send eu 23% 25% 22% 8% 5% 15 General chat im like think good dont 13% 12% 10% 4% 2% 16 Referral Bonus press key navi highlight select 27% 32% 31% 15% 9% 17 Network Issues network phone problem im internet 12% 11% 13% 5% 3% 18 Blackberry Problems problem blackberry mine get thanks 11% 11% 13% 5% 2%
  • 24. Posts get ‘views’ for any number of reason, we need to identifying topics are important 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 24 Using Random Forest of predicting ‘Views’ Topic ‘name’ Topic number Internet setting 11 Number Transfer 12 General new account query 13 Referal Bonus 16 Network Issues 17 Blackberry Problems 18 Only 5 Topics which drive views This suggests these topics get ‘repeat’ visits This is NOT the most ‘viewed’ topics, but the ones which people refer to 16 18 13 12 17 11
  • 25. 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 25 We then compare key topics posts over time to understand the patterns
  • 26. Using ‘Named Entity Recognition’(NER), Topic Modelling can be used to understand how consumers are interacting with brands 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 26 Brands mentions only occur in 2% of the entire corpus, making any assignment of topics trivial
  • 27. Conclusion THINGS TO THINK ABOUT 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 27
  • 28. 2nd Generation of ‘Listening’ tools will be less metric and more Qualitative 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 28
  • 29. Context is Key Blind application of complex modelling will yield results which deliver incorrect classification The final deliverable and key features must be defined before embarking on the analysis 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 29
  • 30. There is an infinite amount of data, harvesting it is the key 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 30
  • 31. Appendix GENSIM 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 31
  • 32. Comparison of LDA implementations 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 32 Learning rate – (decay) To ‘bootstrap’ small bodies of text ‘Passes’ of the Bayesian sampling function can also effect the model •Gensim in Python currently has the most extensive set of parameters however topicmodels in R has some good visualisation examples •‘Online’ LDA implementations are crucial for ‘social listening’ for evolving political commentary The ‘Number of Topics’ is the key parameter however there are a few other parameters which are important. Priors Matter Function of document count and length ‘Honourable mention’ implementations • Vowpal Wabbit – machine learning • Mallet – Focus on text modelling • Stanford - great resource
  • 33. The Model still needs to be visualised Again we use Kullback-Leibler divergence to map the topics against each other. Each word has a measure of Saliency Saliency is a compromise between a word's overall frequency and it's distinctiveness. A word's distinctiveness is a measure of that word's distribution over topics 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 33 By visualising the word distributions in each topic we understand them better
  • 34. Why Priors Matter! Careful thinking about priors can yield new insights – e.g., priors and STOPWORD handling are related For LDA the choice of prior is surprisingly important: – Asymmetric prior for document-specific topic distributions – Symmetric prior for topic-specific word distributions Almost all work on LDA uses symmetric Dirichlet priors – Two scalar concentration parameters: α and β ● Concentration parameters are usually set heuristically ● Some recent work on inferring optimal concentration parameter values from data (Asuncion et al., 2009) 08 Apr 2015 TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION 34

Editor's Notes

  1. In gensim, a corpus is an iterable that returns its documents as sparse vectors. (A sparse vector is just a compact way of storing large vectors that are mostly zeroes.)
  2. The course on Day 1 & 2 are set to weed out those who haven't trained sufficiently or aren’t properly prepared. After day 2 you the nerves go away and you actually start to sleep properly and the 5 degree’s of cold doesn’t bother you so much . By day 3 you get accustomed to the terrain and the heat, your training kicks in, and each day seems to get easier. Everyday you cross the finish line you think the run tomorrow as impossible, you can’t do it. Then you get the feet up and have a 800 calorie meal and a pop tart, and you think, eh, maybe I will just walk the 50KM tomorrow. By the time the morning rolls around you rock up to the finish line rearing to go, excited to get out into the breath-taking scenery again.
  3. For LDA the choice of prior is surprisingly important: – Asymmetric prior for document-specific topic distributions – Symmetric prior for topic-specific word distributions