Topic Modelling to identify behavioral trends in online communities

Fabrikatyr Analytics
Uncover tangible truths amidst the noise of modern media

Agenda
@conr
#genism
conor@fabrikatyr.com
Explanation of Topic Modelling
Application using Gensim
Sample Results
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
2

Explanation of Topic
Modelling
A BRIEF INTRODUCTION TO THE SEMANTIC WEB
08 Apr 2015
DISCUSSION
3

Why is it
Important?
• Discover topics
in large groups
of documents
• Use these
labels to
understand the
body of text and
documents
more effectively
What is Semantic Analysis?
Some use cases:
•Consumer Insight
• Recommender
• Social Media Monitoring
08 Apr 2015
DISCUSSION
4

What is Topic Modelling?
Grouping
documents based
on the probability of
words occurring in
each document
http://people.cs.umass.edu/~wallach/talks/priors.pdf08 Apr 2015
DISCUSSION
5

08 Apr 2015
DISCUSSION
6
Transforming raw data to insight for a particular audience is not
about algorithms alone
Data
Insight
Good Data Science makes ‘The Gap’ as small as possible

Finding the most suitable application of Topic
modelling for ‘discussion’ is critical
08 Apr 2015
DISCUSSION
7
Topic
modelling
Semantic
Subject matter
corpus
General Corpus
Statistical
Word
probability
Paragraph
structure
Word distance
Mixture of all?
Analysing political debate
discourse has the following issues
• Few / little ‘training’ texts
• Highly variable sentence
length
• Distinct word distributions
• Statistical word probability
has readily available
implementations and can
resolve these challenges

What is Gensim?
08 Apr 2015
DISCUSSION
8
Gensim is a free Python library designed to automatically extract semantic topics from
documents, as efficiently (computer-wise) and painlessly (human-wise) as possible. Gensim aims
at processing raw, unstructured digital texts (“plain text”).
• Offers more precise modelling options than ‘topicModels’ in R or
MALLET
• Wider function set
• Somewhat complex to optimise
• Dependencies: numpy and scipy
Radim Rehurek

Application using
Gensim
HOW TO USE GENSIM TO UNDERSTAND LARGE VOLUMES
OF TEXT EFFECTIVELY
08 Apr 2015
DISCUSSION
9

Preparing the
data
• No data set is ever
ready to operate on
‘out-of-the-box’
• Challenges
included:
• Character encoding
• Multiple fields in a
column
• Timestamps
DATA
08 Apr 2015
DISCUSSION
10

What is a Text Corpus and a ‘Bag-of-Words’?
08 Apr 2015
DISCUSSION
11
Bag-of-words (BOW) converts each
response into a set unordered single
words
This Method:
• does not parse sentences,
• does not care about word order, and
• does not "understand" grammar or syntax

08 Apr 2015
DISCUSSION
12
The optimum number of topics can be selected by calculating the
model with the smallest measure of Chaos / Entropy
Least amount
of disorder in
the topics
Harmonic
Mean
AIC
Entropy
“Sum of Lowest average probability”
for each topic distribution
Balance of “Harmonic
mean “ against model
complexity
Least amount
of disorder in
the topics
Using Kullback–Leibler divergence we can
spot local minimum and pick the optimum
number based on how many topics we want
to ‘name’
Local minimums provide a chance to explore the
Trade-off between granularity and consistency

Latent Dirichlet Allocation
LDA repeatedly examines the
probability of the words in each
response and establish ‘common sets’
(topics)
08 Apr 2015
DISCUSSION
13

The topic words associated can be extracted
Each comment is be assigned to a single topic
08 Apr 2015
DISCUSSION
14
LDA.print_topic extracts the words in each topic
NP.Max gets the
most likely topic for
each comment

Sample Results
INTERROGATING A COMMUNITY FORUM DISCUSSION
08 Apr 2015
DISCUSSION
15

How to use it?
There are 7 key stages to model topics effectively
1
• Collat
e text
2
• Creat
e
Corpu
s
3
• Creat
e ‘bag
of
words
’
4
• Optim
um
topics
5
• Establ
ish
keywo
rd
group
ings
6
• Name
Topic
s
7
• Visual
ise
1
Get
Data
2
Create
Corpus
3
Feature
review
4
Optimum
topics
5
Review
6
Name &
Visualise
7
Deliver
insight
08 Apr 2015
DISCUSSION
16

Sample set : 11.3K posts to a Teleco
help forum
08 Apr 2015
DISCUSSION
17
Corpus
5,000 Questions
3,000 Users
3 years of data
Kudos
Device
Thread size
User Age
Views
Maximum user posts
Data Features

Classifying users will help identify
admin versus users
08 Apr 2015
DISCUSSION
18

We then use ‘Regression Forest’ to further identify
post features which drive ‘Views’
08 Apr 2015
DISCUSSION
19

Removing the ‘Admin’ outlier ‘Kudos’ seems
to be the driving feature
08 Apr 2015
DISCUSSION
20
Kudos Response no User Age Thread Size

08 Apr 2015
DISCUSSION
21
Optimum topic number across the different user segments ensures our
grouping assumptions are reasonable
Using Kullback–Leibler
divergence we can spot local
minimum and pick the optimum
number based on how many
topics we want to ‘name’
Local minimums provide
a chance to explore the
Trade-off between
granularity and
consistency

Amount of posts in each topic and length of post
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18
08 Apr 2015
DISCUSSION
22
We examine the structure of the corpus and the lengths of
the posts to validate our model
Response
count
Length of
Post

08 Apr 2015
DISCUSSION
23
Word probability distributions, corpus and domain
knowledge allow for topics to be named
Topic Topic Name Word tokens and probability
11 Internet setting internet setting data phone work
30% 27% 31% 12% 5%
12 Number Transfer number 48 sim support old
36% 35% 14% 6% 3%
13
General new account
query
phone sim go solution solved
26% 29% 24% 9% 6%
14 Roaming text roaming call send eu
23% 25% 22% 8% 5%
15 General chat im like think good dont
13% 12% 10% 4% 2%
16 Referral Bonus press key navi highlight select
27% 32% 31% 15% 9%
17 Network Issues network phone problem im internet
12% 11% 13% 5% 3%
18 Blackberry Problems problem blackberry mine get thanks
11% 11% 13% 5% 2%

Posts get ‘views’ for any number of reason, we
need to identifying topics are important
08 Apr 2015
DISCUSSION
24
Using Random Forest of predicting ‘Views’
Topic ‘name’
Topic
number
Internet setting 11
Number Transfer 12
General new account query 13
Referal Bonus 16
Network Issues 17
Blackberry Problems 18
Only 5 Topics which drive views
This suggests these topics get ‘repeat’ visits
This is NOT the most ‘viewed’ topics, but the ones which people refer to
16 18 13 12 17 11

08 Apr 2015
DISCUSSION
25
We then compare key topics posts over time to
understand the patterns

Using ‘Named Entity Recognition’(NER), Topic
Modelling can be used to understand how consumers
are interacting with brands
08 Apr 2015
DISCUSSION
26
Brands mentions
only occur in 2%
of the entire
corpus, making
any assignment
of topics trivial

Conclusion
THINGS TO THINK ABOUT
08 Apr 2015
DISCUSSION
27

2nd Generation of ‘Listening’ tools will be less metric and more Qualitative
08 Apr 2015
DISCUSSION
28

Context is Key
Blind application of
complex modelling
will yield results which
deliver incorrect
classification
The final deliverable
and key features must
be defined before
embarking on the
analysis
08 Apr 2015
DISCUSSION
29

There is an infinite amount of data, harvesting it is the key
08 Apr 2015
DISCUSSION
30

Appendix
GENSIM
08 Apr 2015
DISCUSSION
31

Comparison of LDA
implementations
08 Apr 2015
DISCUSSION
32
Learning rate – (decay)
To ‘bootstrap’ small bodies of
text
‘Passes’ of the Bayesian sampling function can also effect the model
•Gensim in Python currently has
the most extensive set of
parameters however
topicmodels in R has some good
visualisation examples
•‘Online’ LDA implementations
are crucial for ‘social listening’
for evolving political commentary
The ‘Number of Topics’ is the key parameter however there are a few
other parameters which are important.
Priors Matter
Function of document
count and length
‘Honourable mention’ implementations
• Vowpal Wabbit – machine learning
• Mallet – Focus on text modelling
• Stanford - great resource

The Model still
needs to be
visualised
Again we use Kullback-Leibler
divergence to map the topics
against each other. Each word
has a measure of Saliency
Saliency is a compromise
between a word's overall
frequency and it's distinctiveness.
A word's distinctiveness is a
measure of that word's
distribution over topics
08 Apr 2015
DISCUSSION
33
By visualising the word distributions in each topic
we understand them better

Why Priors
Matter!
Careful thinking about priors can yield new
insights
– e.g., priors and STOPWORD handling are
related
For LDA the choice of prior is surprisingly
important:
– Asymmetric prior for document-specific topic
distributions
– Symmetric prior for topic-specific word
distributions
Almost all work on LDA uses symmetric Dirichlet priors
– Two scalar concentration parameters: α and β
● Concentration parameters are usually set heuristically
● Some recent work on inferring optimal concentration
parameter values from data (Asuncion et al., 2009)
08 Apr 2015
DISCUSSION
34

Topic Modelling to identify behavioral trends in online communities

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Topic Modelling to identify behavioral trends in online communities

Similar to Topic Modelling to identify behavioral trends in online communities (20)

Recently uploaded

Recently uploaded (20)

Topic Modelling to identify behavioral trends in online communities

Editor's Notes