D E C E M B E R 8 - 9 , 2 0 1 6
BigML, Inc 2
Poul Petersen
CIO, BigML, Inc.
Association DiscoveryFinding Interesting Correlations
BigML, Inc 3Association Discovery
Association Discovery
• Algorithm: “Magnum Opus” from Geoff Webb
• Unsupervised Learning: Works with unlabelled
data, like clustering and anomaly detection.
• Learning Task: Find “interesting” relations
between variables.
BigML, Inc 4Association Discovery
Unsupervised Learning
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
The Sally 6788 sign food 26339 51
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
The Sally 6788 sign food 26339 51
Clustering
Anomaly Detection
similar
unusual
BigML, Inc 5Association Discovery
Association Rules
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
The Sally 6788 sign food 26339 51
zip = 46140
amount < 100
Rules:
Antecedent Consequent
{customer = Bob, account = 3421}
{class = gas}
BigML, Inc 6Association Discovery
Use Cases
• Market Basket Analysis
• Web usage patterns
• Intrusion detection
• Fraud detection
• Bioinformatics
• Medical risk factors
BigML, Inc 7Association Discovery
Magnum Opus
• Select measure of interest: Levarage, Lift, etc

• System finds the top-k associations on that
measure within constraints 

• Must be statistically significant interaction between
antecedent and consequent

• Every item in the antecedent must increase the
strength of association
BigML, Inc 8Association Discovery
Association Metrics
Instances
A
C
Coverage
Percentage of instances
which match antecedent “A”
BigML, Inc 9Association Discovery
Association Metrics
Instances
A
C
Support
Percentage of instances
which match antecedent
“A” and Consequent “C”
BigML, Inc 10Association Discovery
Association Metrics
Coverage
Support
Instances
A
C
Confidence
Percentage of instances in
the antecedent which also
contain the consequent.
BigML, Inc 11Association Discovery
Association Metrics
C
Instances
A
C
A
Instances
C
Instances
A
Instances
A
C
0% 100%
Instances
A
C
Confidence
A never 

implies C
A sometimes 

implies C
A always 

implies C
BigML, Inc 12Association Discovery
Association Metrics
Independent
A
C
C
Observed
A
Lift
Ratio of observed support
to support if A and C were
statistically independent.
Support == Confidence
p(A) * p(C) p(C)
BigML, Inc 13Association Discovery
Association Metrics
C
Observed
A
Observed
A
C
< 1 > 1
Independent
A
C
Lift = 1
Negative
Correlation
No Association
Positive
Correlation
Independent
A
C
Independent
A
C
Observed
A
C
BigML, Inc 14Association Discovery
Association Metrics
Independent
A
C
C
Observed
A
Leverage
Difference of observed
support and support if A
and C were statistically
independent. 

Support - [ p(A) * p(C) ]
BigML, Inc 15Association Discovery
Association Metrics
C
Observed
A
Observed
A
C
< 0 > 0
Independent
A
C
Leverage = 0
Negative
Correlation
No Association
Positive
Correlation
Independent
A
C
Independent
A
C
Observed
A
C
-1…
BigML, Inc 16Association Discovery
Use Cases
GOAL: Discover “interesting” rules about what store items
are typically purchased together.
• Dataset of 9,834 grocery cart transactions
• Each row is a list of all items in a cart at
checkout
BigML, Inc 17
Association Discovery
Demo #1
BigML, Inc 18Association Discovery
Use Cases
GOAL: Find general rules that indicate diabetes.
• Dataset of diagnostic measurements of 768
patients.
• Each patient labelled True/False for
diabetes.
BigML, Inc 19
Association Discovery

Demo #2
BigML, Inc 20Association Discovery
Medical Risks
Decision Tree
If plasma glucose > 155
and bmi > 29.32
and diabetes pedigree > 0.32
and insulin <= 629
and age <= 44
then diabetes = TRUE
Association Rule
If plasma glucose > 146
then diabetes = TRUE
BigML, Inc 21
Poul Petersen
CIO, BigML, Inc.
Topic ModelingDiscovering Meaning in Text
BigML, Inc 22Topic Modelling
Unsupervised Learning
Features
Instances
• Learn from instances
• Each instance has features
• There is no label
Clustering
Find similar instances
Anomaly Detection
Find unusual instances
Association Discovery
Find feature rules
BigML, Inc 23Topic Modelling
Topic Model
Text Fields
• Unsupervised algorithm
• Learns only from text fields
• Finds hidden topics that model
the text
• How is this different from the Text Analysis
that BigML already offers?
• What does it output and how do we use it
• Unsupervised… model?
Questions:
BigML, Inc 24Topic Modelling
Text Analysis
Be not afraid of greatness:
some are born great, some
achieve greatness, and
some have greatness
thrust upon 'em.
great: appears 4 times
Bag of Words
BigML, Inc 25Topic Modelling
Text Analysis
… great afraid born achieve … …
… 4 1 1 1 … …
… … … … … … …
Be not afraid of greatness:
some are born great, some achieve
greatness, and some have greatness
thrust upon ‘em.
Model
The token “great” 

occurs more than 3 times
The token “afraid” 

occurs no more than once
BigML, Inc 26
Topic Model Demo #1
BigML, Inc 27Topic Modelling
TA vs TM
Text Analysis Topic Model
Creates thousands of
hidden token counts
Token counts are
independently
uninteresting
No semantic importance
No measure of co-
occurrence
Creates tens of topics
that model the text
Topics are independently
interesting
Semantic meaning
extracted
Support for bigrams
BigML, Inc 28Topic Modelling
Generative Modeling
• Decision trees are discriminative models
• Aggressively model the classification boundary
• Parsimonious: Don’t consider anything you don’t have to
• Topic Models are generative models
• Come up with a theory of how the data is generated
• Tweak the theory to fit your data
BigML, Inc 29Topic Modelling
Generating Documents
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
shoe asteroid
flashlight
pizza…
plate giraffe
purple jump…
Be not afraid
of greatness: 

some are born
great, some
achieve 

greatness…
• "Machine" that generates a random word with equal
probability with each pull.
• Pull random number of times to generate a document.
• All documents can be generated, but most are nonsense.
word probability
shoe ϵ
asteroid ϵ
flashlight ϵ
pizza ϵ
… ϵ
BigML, Inc 30Topic Modelling
Topic Model
• Written documents have meaning - one way to
describe meaning is to assign a topic.
• For our random machine, the topic can be thought
of as increasing the probability of certain words.
Intuition:
Topic: travel
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
airplane
passport pizza
…
word probability
travel 23,55 %
airplane 2,33 %
mars 0,003 %
mantle ϵ
… ϵ
Topic: space
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
mars quasar
lightyear soda
word probability
space 38,94 %
airplane ϵ
mars 13,43 %
mantle 0,05 %
… ϵ
BigML, Inc 31Topic Modelling
Topic Model
plate giraffe
purple
jump…
Topic: "1"
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
word probability
travel 23,55 %
airplane 2,33 %
mars 0,003 %
mantle ϵ
… ϵ
Topic: "k"
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
word probability
shoe 12,12 %
coffee 3,39 %
telephone 13,43 %
paper 4,11 %
… ϵ
…Topic: "2"
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
word probability
space 38,94 %
airplane ϵ
mars 13,43 %
mantle 0,05 %
… ϵ
airplane
passport
pizza …
plate giraffe
purple
jump…
• Each text field in a row is concatenated into a document
• The documents are analyzed to generate "k" related topics
that can model the documents
• Each topic is represented by a distribution of term
probabilities
BigML, Inc 32
Topic Model Demo #2
BigML, Inc 33Topic Modelling
Use Cases
• As a preprocessor for other techniques
• Bootstrapping categories for classification
• Recommendation
• Discovery in large, heterogeneous text datasets
BigML, Inc 34Topic Modelling
Topic Distribution
• Any given document is likely a mixture of the
modeled topics…
• This can be represented as a distribution of topic
probabilities
Intuition:
Will 2020 be
the year that
humans will
embrace
space
exploration
and finally
travel to Mars?
Topic: travel
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
word probability
travel 23,55 %
airplane 2,33 %
mars 0,003 %
mantle ϵ
… ϵ
11%
Topic: space
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
word probability
space 38,94 %
airplane ϵ
mars 13,43 %
mantle 0,05 %
… ϵ
89%
BigML, Inc 35
Topic Model Demo #3
BigML, Inc 36Topic Modelling
Batch Topic Distribution
Unlabelled Data
Centroid Label
Unlabelled Data
topic 1
prob
topic 3
prob
topic k
prob
Clustering Batch Centroid
Topic Model
Text Fields
Batch Topic Distribution
…
BigML, Inc 37
Topic Model Demo #4
BigML, Inc 38Topic Modelling
Tips
• Setting k
• Much like k-means, the best value is data specific
• Too few will agglomerate unrelated topics, too many will
partition highly related topics
• I tend to find the latter more annoying than the former
• Tuning the Model
• Remove common, useless terms
• Set term limit higher, use bigrams
BSSML16 L4. Association Discovery and Topic Modeling

BSSML16 L4. Association Discovery and Topic Modeling

  • 1.
    D E CE M B E R 8 - 9 , 2 0 1 6
  • 2.
    BigML, Inc 2 PoulPetersen CIO, BigML, Inc. Association DiscoveryFinding Interesting Correlations
  • 3.
    BigML, Inc 3AssociationDiscovery Association Discovery • Algorithm: “Magnum Opus” from Geoff Webb • Unsupervised Learning: Works with unlabelled data, like clustering and anomaly detection. • Learning Task: Find “interesting” relations between variables.
  • 4.
    BigML, Inc 4AssociationDiscovery Unsupervised Learning date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 The Sally 6788 sign food 26339 51 date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 The Sally 6788 sign food 26339 51 Clustering Anomaly Detection similar unusual
  • 5.
    BigML, Inc 5AssociationDiscovery Association Rules date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 The Sally 6788 sign food 26339 51 zip = 46140 amount < 100 Rules: Antecedent Consequent {customer = Bob, account = 3421} {class = gas}
  • 6.
    BigML, Inc 6AssociationDiscovery Use Cases • Market Basket Analysis • Web usage patterns • Intrusion detection • Fraud detection • Bioinformatics • Medical risk factors
  • 7.
    BigML, Inc 7AssociationDiscovery Magnum Opus • Select measure of interest: Levarage, Lift, etc • System finds the top-k associations on that measure within constraints • Must be statistically significant interaction between antecedent and consequent • Every item in the antecedent must increase the strength of association
  • 8.
    BigML, Inc 8AssociationDiscovery Association Metrics Instances A C Coverage Percentage of instances which match antecedent “A”
  • 9.
    BigML, Inc 9AssociationDiscovery Association Metrics Instances A C Support Percentage of instances which match antecedent “A” and Consequent “C”
  • 10.
    BigML, Inc 10AssociationDiscovery Association Metrics Coverage Support Instances A C Confidence Percentage of instances in the antecedent which also contain the consequent.
  • 11.
    BigML, Inc 11AssociationDiscovery Association Metrics C Instances A C A Instances C Instances A Instances A C 0% 100% Instances A C Confidence A never implies C A sometimes implies C A always implies C
  • 12.
    BigML, Inc 12AssociationDiscovery Association Metrics Independent A C C Observed A Lift Ratio of observed support to support if A and C were statistically independent. Support == Confidence p(A) * p(C) p(C)
  • 13.
    BigML, Inc 13AssociationDiscovery Association Metrics C Observed A Observed A C < 1 > 1 Independent A C Lift = 1 Negative Correlation No Association Positive Correlation Independent A C Independent A C Observed A C
  • 14.
    BigML, Inc 14AssociationDiscovery Association Metrics Independent A C C Observed A Leverage Difference of observed support and support if A and C were statistically independent. Support - [ p(A) * p(C) ]
  • 15.
    BigML, Inc 15AssociationDiscovery Association Metrics C Observed A Observed A C < 0 > 0 Independent A C Leverage = 0 Negative Correlation No Association Positive Correlation Independent A C Independent A C Observed A C -1…
  • 16.
    BigML, Inc 16AssociationDiscovery Use Cases GOAL: Discover “interesting” rules about what store items are typically purchased together. • Dataset of 9,834 grocery cart transactions • Each row is a list of all items in a cart at checkout
  • 17.
    BigML, Inc 17 AssociationDiscovery Demo #1
  • 18.
    BigML, Inc 18AssociationDiscovery Use Cases GOAL: Find general rules that indicate diabetes. • Dataset of diagnostic measurements of 768 patients. • Each patient labelled True/False for diabetes.
  • 19.
    BigML, Inc 19 AssociationDiscovery
 Demo #2
  • 20.
    BigML, Inc 20AssociationDiscovery Medical Risks Decision Tree If plasma glucose > 155 and bmi > 29.32 and diabetes pedigree > 0.32 and insulin <= 629 and age <= 44 then diabetes = TRUE Association Rule If plasma glucose > 146 then diabetes = TRUE
  • 21.
    BigML, Inc 21 PoulPetersen CIO, BigML, Inc. Topic ModelingDiscovering Meaning in Text
  • 22.
    BigML, Inc 22TopicModelling Unsupervised Learning Features Instances • Learn from instances • Each instance has features • There is no label Clustering Find similar instances Anomaly Detection Find unusual instances Association Discovery Find feature rules
  • 23.
    BigML, Inc 23TopicModelling Topic Model Text Fields • Unsupervised algorithm • Learns only from text fields • Finds hidden topics that model the text • How is this different from the Text Analysis that BigML already offers? • What does it output and how do we use it • Unsupervised… model? Questions:
  • 24.
    BigML, Inc 24TopicModelling Text Analysis Be not afraid of greatness: some are born great, some achieve greatness, and some have greatness thrust upon 'em. great: appears 4 times Bag of Words
  • 25.
    BigML, Inc 25TopicModelling Text Analysis … great afraid born achieve … … … 4 1 1 1 … … … … … … … … … Be not afraid of greatness: some are born great, some achieve greatness, and some have greatness thrust upon ‘em. Model The token “great” occurs more than 3 times The token “afraid” occurs no more than once
  • 26.
    BigML, Inc 26 TopicModel Demo #1
  • 27.
    BigML, Inc 27TopicModelling TA vs TM Text Analysis Topic Model Creates thousands of hidden token counts Token counts are independently uninteresting No semantic importance No measure of co- occurrence Creates tens of topics that model the text Topics are independently interesting Semantic meaning extracted Support for bigrams
  • 28.
    BigML, Inc 28TopicModelling Generative Modeling • Decision trees are discriminative models • Aggressively model the classification boundary • Parsimonious: Don’t consider anything you don’t have to • Topic Models are generative models • Come up with a theory of how the data is generated • Tweak the theory to fit your data
  • 29.
    BigML, Inc 29TopicModelling Generating Documents cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… shoe asteroid flashlight pizza… plate giraffe purple jump… Be not afraid of greatness: some are born great, some achieve greatness… • "Machine" that generates a random word with equal probability with each pull. • Pull random number of times to generate a document. • All documents can be generated, but most are nonsense. word probability shoe ϵ asteroid ϵ flashlight ϵ pizza ϵ … ϵ
  • 30.
    BigML, Inc 30TopicModelling Topic Model • Written documents have meaning - one way to describe meaning is to assign a topic. • For our random machine, the topic can be thought of as increasing the probability of certain words. Intuition: Topic: travel cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… airplane passport pizza … word probability travel 23,55 % airplane 2,33 % mars 0,003 % mantle ϵ … ϵ Topic: space cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… mars quasar lightyear soda word probability space 38,94 % airplane ϵ mars 13,43 % mantle 0,05 % … ϵ
  • 31.
    BigML, Inc 31TopicModelling Topic Model plate giraffe purple jump… Topic: "1" cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… word probability travel 23,55 % airplane 2,33 % mars 0,003 % mantle ϵ … ϵ Topic: "k" cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… word probability shoe 12,12 % coffee 3,39 % telephone 13,43 % paper 4,11 % … ϵ …Topic: "2" cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… word probability space 38,94 % airplane ϵ mars 13,43 % mantle 0,05 % … ϵ airplane passport pizza … plate giraffe purple jump… • Each text field in a row is concatenated into a document • The documents are analyzed to generate "k" related topics that can model the documents • Each topic is represented by a distribution of term probabilities
  • 32.
    BigML, Inc 32 TopicModel Demo #2
  • 33.
    BigML, Inc 33TopicModelling Use Cases • As a preprocessor for other techniques • Bootstrapping categories for classification • Recommendation • Discovery in large, heterogeneous text datasets
  • 34.
    BigML, Inc 34TopicModelling Topic Distribution • Any given document is likely a mixture of the modeled topics… • This can be represented as a distribution of topic probabilities Intuition: Will 2020 be the year that humans will embrace space exploration and finally travel to Mars? Topic: travel cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… word probability travel 23,55 % airplane 2,33 % mars 0,003 % mantle ϵ … ϵ 11% Topic: space cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… word probability space 38,94 % airplane ϵ mars 13,43 % mantle 0,05 % … ϵ 89%
  • 35.
    BigML, Inc 35 TopicModel Demo #3
  • 36.
    BigML, Inc 36TopicModelling Batch Topic Distribution Unlabelled Data Centroid Label Unlabelled Data topic 1 prob topic 3 prob topic k prob Clustering Batch Centroid Topic Model Text Fields Batch Topic Distribution …
  • 37.
    BigML, Inc 37 TopicModel Demo #4
  • 38.
    BigML, Inc 38TopicModelling Tips • Setting k • Much like k-means, the best value is data specific • Too few will agglomerate unrelated topics, too many will partition highly related topics • I tend to find the latter more annoying than the former • Tuning the Model • Remove common, useless terms • Set term limit higher, use bigrams