1. Topic modeling is an unsupervised machine learning technique that analyzes documents to discover hidden topics.
2. It outputs a set of topics, where each topic is represented as a collection of related words.
3. It also assigns probabilities to topics for each document, indicating how strongly the document relates to different topics.
2. BigML, Inc X#MLSEV: Association Discovery
Association Discovery
Rule-Based Machine Learning
Charles Parker
VP Machine Learning Algorithms
3. BigML, Inc X#MLSEV: Association Discovery
Association Discovery
An unsupervised learning technique
• No labels necessary
• Useful for data discovery
Finds "significant" correlations/associations/relations
• Shopping cart: Coffee and sugar
• Medical: High plasma glucose and diabetes
Expresses them as "if then rules"
• If "antecedent" then "consequent"
4. BigML, Inc X#MLSEV: Association Discovery
Review of methods: clustering
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
5. BigML, Inc X#MLSEV: Association Discovery
Review of methods: clustering
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
similar
6. BigML, Inc X#MLSEV: Association Discovery
Review: anomaly detection
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
7. BigML, Inc X#MLSEV: Association Discovery
Review: anomaly detection
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
anomaly
8. BigML, Inc X#MLSEV: Association Discovery
Association Discovery
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
9. BigML, Inc X#MLSEV: Association Discovery
Association Discovery
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
{customer = Bob, account = 3421}
10. BigML, Inc X#MLSEV: Association Discovery
Association Discovery
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
zip = 46140{customer = Bob, account = 3421}
11. BigML, Inc X#MLSEV: Association Discovery
Association Discovery
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
zip = 46140{customer = Bob, account = 3421}
{class = gas}
12. BigML, Inc X#MLSEV: Association Discovery
Association Discovery
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
zip = 46140
amount < 100
{customer = Bob, account = 3421}
{class = gas}
13. BigML, Inc X#MLSEV: Association Discovery
Association Discovery
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
zip = 46140
amount < 100
Rules:
Antecedent Consequent
{customer = Bob, account = 3421}
{class = gas}
14. BigML, Inc X#MLSEV: Association Discovery
Use Cases
• Data Discovery: how do instances relate?
• Market Basket Analysis: Items that go together
• Behaviors that occur together
• Web usage patterns
• Intrusion detection
• Fraud detection
• Medical risk factors
15. BigML, Inc X#MLSEV: Association Discovery
Association Metrics
• Coverage
• Support
• Confidence
• Lift
• Leverage
Associations between grocery items
16. BigML, Inc X#MLSEV: Association Discovery
Association Metrics: coverage
Coverage
Percentage of instances
which match antecedent “A”
Instances
A
C
17. BigML, Inc X#MLSEV: Association Discovery
Association Metrics: support
Instances
A
C
Support
Percentage of instances
which match antecedent “A”
and Consequent “C”
18. BigML, Inc X#MLSEV: Association Discovery
Confidence
Percentage of instances in
the antecedent which also
contain the consequent.
Association Metrics: confidence
Coverage
Support
Instances
A
C
19. BigML, Inc X#MLSEV: Association Discovery
Association Metrics: confidence
C
Instances
A
C
A
Instances
C
Instances
A
Instances
A
C
0% 100%
Instances
A
C
Confidence
A never
implies C
A sometimes
implies C
A always
implies C
A >> C A = C A << C
20. BigML, Inc X#MLSEV: Association Discovery
Association Metrics: lift
Lift
Ratio of observed support
to support if A and C were
statistically independent.
Support == Confidence
p(A) * p(C) p(C)
Independent
A
C
C
Observed
A
Problem:
if p(C) is "small" then…
lift may be large.
21. BigML, Inc X#MLSEV: Association Discovery
Association Metrics: lift
C
Observed
A
Observed
A
C
< 1 > 1
Independent
A
C
Lift = 1
Negative
Correlation
No Correlation
Positive
Correlation
Independent
A
C
Independent
A
C
Observed
A
C
22. BigML, Inc X#MLSEV: Association Discovery
Association Metrics: leverage
Leverage
Difference of observed
support and support if A
and C were statistically
independent.
Support - [ p(A) * p(C) ]
Independent
A
C
C
Observed
A
23. BigML, Inc X#MLSEV: Association Discovery
Association Metrics: leverage
C
Observed
A
Observed
A
C
< 0 > 0
Independent
A
C
Leverage = 0
Negative
Correlation
No Correlation
Positive
Correlation
Independent
A
C
Independent
A
C
Observed
A
C
-1…
24. BigML, Inc X#MLSEV: Association Discovery
Items Type
itemscoffee, sugar, milk, honey,
dish soap, bread
items
• Canonical example: shopping cart contents
• Single feature describing a list of items
• Each item separated by a comma (default)
25. BigML, Inc X#MLSEV: Association Discovery
Use Cases
GOAL: Discover “interesting” rules about what store items
are typically purchased together.
• Dataset of 9,834 grocery cart transactions
• Each row is a list of all items in a cart at checkout
27. BigML, Inc X#MLSEV: Association Discovery
Summary
• Unsupervised learning technique for discovering
interesting associations
• Outputs antecedent/consequent rules
• Metrics: Support / Coverage / Confidence / Lift / Leverage
• Useful for “items” type and market basket analysis
• Applicable to understanding clusters and anomaly detectors
28. BigML, Inc X#MLSEV: Topic Models
Topic Models
“One of these things is not like the other things . . . “
Charles Parker
VP Machine Learning Algorithms
29. BigML, Inc X#MLSEV: Topic Models
What is Topic Modeling?
• Unsupervised algorithm
• Learns only from text fields
• Finds hidden topics that
model the text
Text Fields
• How is this different from the Text Analysis
that BigML already offers?
• What does it output and how do we use it?
Questions:
30. BigML, Inc X#MLSEV: Topic Models
What is Topic Modeling?
• Finds topics in your text fields
• A topic is a distribution over terms
• Terms with high probability in the same topic often occur
together in the same document
• Topics often correspond to real-world things that the
document may be “about” (e.g., sports, cooking,
technology)
• Each document is “about” one or more topics
• Usually each document is only about one or two topics
• But in practice we assign a probability to every topic for
every document
31. BigML, Inc X#MLSEV: Topic Models
Text Analysis
Be not afraid of greatness:
some are born great, some
achieve greatness, and
some have greatness
thrust upon 'em.
great: appears 4 times
1. Stem Words -> Tokens
2. Remove tokens that
occur too often
3. Remove tokens that do
not occur often enough
4. Count occurrences of
remaining “interesting”
tokens
32. BigML, Inc X#MLSEV: Topic Models
Text Analysis
Be not afraid of greatness:
some are born great, some achieve
greatness, and some have greatness
thrust upon ‘em.
… great afraid born achieve … …
… 4 1 1 1 … …
… … … … … … …
Model
The token “great”
occurs more than 3 times
The token “afraid”
occurs no more than once
35. BigML, Inc X#MLSEV: Topic Models
Text Analysis vs. Topic Modeling
Text Topic Model
Creates thousands of
hidden token counts
Token counts are
independently
uninteresting
No semantic importance
Co-occurrence limited to
consecutive n-grams
Creates tens of topics
that model the text
Topics are independently
interesting
Semantic meaning
extracted
Topics indicate broader
co-occurrences
36. BigML, Inc X#MLSEV: Topic Models
Generating Documents
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
shoe asteroid
flashlight
pizza…
plate giraffe
purple jump…
Be not afraid
of greatness:
some are born
great, some
achieve
greatness…
• "Machine" that generates a random word with equal
probability with each pull.
• Pull random number of times to generate a document.
• All documents can be generated, but most are nonsense.
word probability
shoe ϵ
asteroid ϵ
flashlight ϵ
pizza ϵ
… ϵ
37. BigML, Inc X#MLSEV: Topic Models
Topic Model
• Written documents have meaning - one way to
describe meaning is to assign a topic.
• For our random machine, the topic can be thought
of as increasing the probability of certain words.
Intuition:
Topic: travel
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
airplane
passport pizza
…
word probability
travel 23,55 %
airplane 2,33 %
mars 0,003 %
mantle ϵ
… ϵ
Topic: space
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
mars quasar
lightyear soda
word probability
space 38,94 %
airplane ϵ
mars 13,43 %
mantle 0,05 %
… ϵ
38. BigML, Inc X#MLSEV: Topic Models
Topic Model
plate giraffe
purple
jump…
Topic: "1"
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
word probability
travel 23,55 %
airplane 2,33 %
mars 0,003 %
mantle ϵ
… ϵ
Topic: "k"
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
word probability
shoe 12,12 %
coffee 3,39 %
telephone 13,43 %
paper 4,11 %
… ϵ
…Topic: "2"
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
word probability
space 38,94 %
airplane ϵ
mars 13,43 %
mantle 0,05 %
… ϵ
airplane
passport
pizza …
plate giraffe
purple
jump…
• Each text field in a row is concatenated into a document
• The documents are analyzed to generate "k" related topics
• Each topic is represented by a distribution of term
probabilities
40. BigML, Inc X#MLSEV: Topic Models
Topic Distribution
• Any given document is likely a mixture of the
modeled topics…
• This can be represented as a distribution of topic
probabilities
Intuition:
Will 2020 be
the year that
humans will
embrace
space
exploration
and finally
travel to Mars?
Topic: travel
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
word probability
travel 23,55 %
airplane 2,33 %
mars 0,003 %
mantle ϵ
… ϵ
11%
Topic: space
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
word probability
space 38,94 %
airplane ϵ
mars 13,43 %
mantle 0,05 %
… ϵ
89%
42. BigML, Inc X#MLSEV: Topic Models
Prediction?
Unlabelled Data
Centroid Label
Unlabelled Data
topic 1
prob
topic 3
prob
topic k
prob
Clustering Batch Centroid
Topic Model
Text Fields
Batch Topic Distribution
…
43. BigML, Inc X#MLSEV: Topic Models
Topic Model Use Cases
• As a preprocessor for other techniques
• Building better models
• Bootstrapping categories for classification
• Recommendation
• Discovery in large, heterogeneous text datasets