Pattern Mining: Getting the
most out of your log data.
Krishna Sridhar
Staff Data Scientist, Dato Inc.
krishna_srd
• Background
- Machine Learning (ML) Research.
- Ph.D Numerical Optimization @Wisconsin
• Now
- Build ML tools for data-scientists & developers @Dato.
- Help deploy ML algorithms.
@krishna_srd, @DatoInc
About Me!
45+$and$growing$fast!
About Us!
+ =
Questions?
• (Now) I love questions. Feel free to interrupt for questions!
• (Later) Email me srikris@dato.com.
DAML Talks!
About you?
Creating a model pipeline
Ingest Transform Model Deploy
Unstructured Data
exploration
data
modeling
Data Science Workflow
Ingest Transform Model Deploy
Log Journey
Lots of data
Insights Profits
Log Mining: Pattern Mining
Logs are everywhere!
Machine Learning in Logs
Source: Mining Your Logs - Gaining Insight Through Visualization
Coffee shop
Coffee Shops Menu
Receipts
Coffee Shops Menu
Coffee Store Logs
Frequent Pattern Mining
What sets of items were bought together?
Real Applications
Real Applications
Real Applications
Log Mining: Rule Mining
Can we recommend items?
Rule Mining
Real Applications
Log Mining: Feature Extraction
Feature Extraction
0 1 0 0 0 0 1 1 0
1 1 0 0 1 0 0 0 0
0 0 1 1 1 0
Receipt Space Features in
Menu Space
ML
3 Useful Data Mining Tasks
Rule MiningPattern Mining Feature Extraction
Demo
Pattern Mining: Explained
Formulating Pattern Mining
N distinct items → 2N itemsets
Formulating Pattern Mining
Find the top K most frequent sets of length at least L
that occur at least M times.
Formulating Pattern Mining
Find the top K most frequent sets of length at least L
that occur at least M times.
- max_patterns
- min_length
- min_support
Pattern Mining
N distinct items → 2N itemsets
Pattern Mining: Principles
Pattern Mining: Principles
Principle 1: What is frequent?
A pattern is frequent if it occurs at least M times.
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
Is the pattern {C, D} frequent?
M = 4
Patterns
Principle 1: What is frequent?
A pattern is frequent if it occurs at least M times.
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
{C, D} occurs 5 times
M = 4
Patterns
Principle 1: What is frequent?
A pattern is frequent if it occurs at least M times.
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
M = 4
Patterns
{C, D} occurs 5 times
Frequent!
Principle 1: What is frequent?
A pattern is frequent if it occurs at least M times.
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
Is the pattern {A, D} frequent?
M = 4
Patterns
Principle 1: What is frequent?
A pattern is frequent if it occurs at least M times.
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
M = 4
Patterns
{A, D} occurs 3 times.
Not frequent.
Principle 1: What is frequent?
A pattern is frequent if it occurs at least M times.
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
{C, D}: 5 is frequent
M = 4
{A, D}: 3 is not frequent
min_support
Principle 2: Apriori principle
A pattern is frequent only if a subset is frequent
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
{B, C, D} : 4 is frequent therefore
{C, D} : 5 is frequent
M = 4
Principle 2: Apriori principle
A pattern is frequent only if a subset is frequent
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
{B, C, D} : 4 is frequent therefore
{C, D} : 5 is frequent
M = 4
Why?
{C, D} must occur at least as
many times as {B, C, D}.
Principle 2: Apriori principle (Contrapositive)
If a pattern is not frequent then all supersets are not frequent
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
M = 4
{A} : 3 is not frequent therefore
{A, D} : 3 is not frequent
Principle 2: Apriori principle (Contrapositive)
If a pattern is not frequent then all supersets are not frequent
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
M = 4
{A} : 3 is not frequent therefore
{A, D} : 3 is not frequent
Why?
{A, D} cannot occur more times
than {A}.
Two Main Algorithms
• Candidate Generation
- Apriori
- Eclat
• Pattern Growth
- FP-Growth
- TopK FP-Growth
Candidate Generation
Lots of Generalizations
Source: http://www.philippe-fournier-viger.com/spmf/
Candidate Generation
Two phases
1. Candidate generation.
2. Candidate filtering.
Exploit Apriori Principle!
Candidate Generation
{AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ?
{A} : ? {B} : ? {C} : ? {D} : ?
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
Candidate Generation
{AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ?
{A} : ? {B} : ? {C} : ? {D} : ?
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
Candidate Generation
{AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ?
{A} : 3 {B} : 4 {C} : 5 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
Candidate Generation
{AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ?
{A} : 3 {B} : 4 {C} : 5 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
Not frequent
Candidate Generation
{AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ?
{A} : 3 {B} : 4 {C} : 5 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
No need to
explore!
Candidate Generation
{AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ?
{A} : 3 {B} : 4 {C} : 5 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
Candidate Generation
{AB} : ? {AC} : ? {AD} : ? {BC} : 4 {BD} : 4 {CD} : 5
{A} : 3 {B} : 4 {C} : 5 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
Candidate Generation
{AB} : ? {AC} : ? {AD} : ? {BC} : 4 {BD} : 4 {CD} : 5
{A} : 3 {B} : 4 {C} : 5 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
Two Main Algorithms
• Candidate Generation
- Apriori
- Eclat
• Pattern Growth
- FP-Growth
- TopK FP-Growth
Candidate Generation
Two phases
1. Candidate generation: Enumerate all subsets.
2. Candidate filtering: Eliminate infrequent subsets.
Exploit Apriori Principle!
Pattern Growth
Pattern Growth
Two phases
1. Candidate filtering
2. Conditional database constructions.
Avoid full scans over the data & large
candidate sets!
Pattern Growth - Depth First {B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{AB} : 1 {AC} : 2 {AD} : 3 {BD} : 4 {CD} : 4
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6
{ABC} : 0 {ABD} : 1 {ACD} : 2 {BCD} : 2
{BC} : 2
Pattern Growth - Preprocessing {B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6
Preprocessing
First, count the number of times
each item (singleton) occurs.
Pattern Growth - Depth First {B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{AB} : ? {AC} : ? {AD} : ? {BD} : ? {CD} : ?
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{BC} : ?
Pattern Growth - Depth First {B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{AB} : ? {AC} : ? {AD} : ? {BD} : ? {CD} : ?
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{BC} : ?
Pattern Growth - Depth First {B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{AB} : ? {AC} : ? {AD} : ? {BD} : ? {CD} : ?
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{BC} : ?
Pattern Growth - Depth First {B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{AB} : ? {AC} : ? {AD} : ? {BD} : ? {CD} : ?
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{BC} : ?
No need to
explore!
Pattern Growth - Depth First {B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{AB} : X {AC} : ? {AD} : ? {BD} : X {CD} : ?
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{BC} : X
Explore depth
first on {B}
Pattern Growth
{B} : 4
{ } : 6
Conditional Database Construction
DB{} DB{B}
{B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{C, D}
{D}
{C, D}
{D}
Pattern Growth
{B} : 4
{ } : 6
Candidate Filtering
DB{B}
{C, D}
{D}
{C, D}
{D}
{D} : 4
{C} : 2
DB{}
{B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
DB{B}
Add {BD} as frequent
Pattern Growth - Depth First {C, D}
{D}
{C, D}
{D}
{AB} : X {AC} : ? {AD} : ? {BD} : 4 {CD} : ?
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{BC} : 2
Explore depth
first on {BD}
Pattern Growth - Depth First
{AB} : X {AC} : ? {AD} : ? {BD} : 4 {CD} : ?
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6
{ABC} : ? {ABD} : X {ACD} : ? {BCD} : X
{BC} : 2
Compare & Constrast
• Candidate Generation
+ Better than brute force
+ Filters candidate sets
- Multiple passes over the data
• Pattern Growth
+ Fewer passes over the data
+ Space efficient.
Compare & Constrast
• Candidate Generation
+ Better than brute force
+ Filters candidate sets
- Multiple passes over the data
• Pattern Growth
+ Fewer passes over the data
+ Space efficient.
Better choice
Some cool ideas
FP-Tree Compression
Figures From Florian Verhein’s Slides on FP-Growth
FP-Growth Algorithm
Figures From Florian Verhein’s Slides on FP-Growth
Two phases
1. Candidate filtering.
2. Conditional database constructions.
TopK FP-Growth Algorithm
Similar to FP-Growth
1. Dynamically raise min_support.
2. Estimates of min_support greatly help.
Future Work
Distributed FP-Growth
Partition database on item-ids.
Database
Bags + Sequences
× 2
Itemset: {Item}
Bags: {Item: quantity}
Sequences : (item)
Demo: Model built, now what?
Summary
Log Data Mining
≠
Rocket Science
• FP-Growth for finding frequent patterns.
• Find rules from patterns to make predictions.
• Extract features for useful ML in pattern space.
SELECT questions FROM audience
WHERE difficulty == “Easy”
Thanks!

Frequent Pattern Mining - Krishna Sridhar, Feb 2016

  • 1.
    Pattern Mining: Gettingthe most out of your log data. Krishna Sridhar Staff Data Scientist, Dato Inc. krishna_srd
  • 2.
    • Background - MachineLearning (ML) Research. - Ph.D Numerical Optimization @Wisconsin • Now - Build ML tools for data-scientists & developers @Dato. - Help deploy ML algorithms. @krishna_srd, @DatoInc About Me!
  • 3.
  • 4.
    + = Questions? • (Now)I love questions. Feel free to interrupt for questions! • (Later) Email me srikris@dato.com. DAML Talks!
  • 5.
  • 6.
    Creating a modelpipeline Ingest Transform Model Deploy Unstructured Data exploration data modeling Data Science Workflow Ingest Transform Model Deploy
  • 7.
    Log Journey Lots ofdata Insights Profits
  • 8.
  • 9.
  • 10.
    Machine Learning inLogs Source: Mining Your Logs - Gaining Insight Through Visualization
  • 11.
  • 12.
  • 13.
  • 14.
    Frequent Pattern Mining Whatsets of items were bought together?
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
    Can we recommenditems? Rule Mining
  • 20.
  • 21.
  • 22.
    Feature Extraction 0 10 0 0 0 1 1 0 1 1 0 0 1 0 0 0 0 0 0 1 1 1 0 Receipt Space Features in Menu Space ML
  • 23.
    3 Useful DataMining Tasks Rule MiningPattern Mining Feature Extraction
  • 24.
  • 25.
  • 26.
    Formulating Pattern Mining Ndistinct items → 2N itemsets
  • 27.
    Formulating Pattern Mining Findthe top K most frequent sets of length at least L that occur at least M times.
  • 28.
    Formulating Pattern Mining Findthe top K most frequent sets of length at least L that occur at least M times. - max_patterns - min_length - min_support
  • 29.
    Pattern Mining N distinctitems → 2N itemsets
  • 30.
  • 31.
  • 32.
    Principle 1: Whatis frequent? A pattern is frequent if it occurs at least M times. {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D} Is the pattern {C, D} frequent? M = 4 Patterns
  • 33.
    Principle 1: Whatis frequent? A pattern is frequent if it occurs at least M times. {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D} {C, D} occurs 5 times M = 4 Patterns
  • 34.
    Principle 1: Whatis frequent? A pattern is frequent if it occurs at least M times. {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D} M = 4 Patterns {C, D} occurs 5 times Frequent!
  • 35.
    Principle 1: Whatis frequent? A pattern is frequent if it occurs at least M times. {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D} Is the pattern {A, D} frequent? M = 4 Patterns
  • 36.
    Principle 1: Whatis frequent? A pattern is frequent if it occurs at least M times. {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D} M = 4 Patterns {A, D} occurs 3 times. Not frequent.
  • 37.
    Principle 1: Whatis frequent? A pattern is frequent if it occurs at least M times. {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D} {C, D}: 5 is frequent M = 4 {A, D}: 3 is not frequent min_support
  • 38.
    Principle 2: Aprioriprinciple A pattern is frequent only if a subset is frequent {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D} {B, C, D} : 4 is frequent therefore {C, D} : 5 is frequent M = 4
  • 39.
    Principle 2: Aprioriprinciple A pattern is frequent only if a subset is frequent {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D} {B, C, D} : 4 is frequent therefore {C, D} : 5 is frequent M = 4 Why? {C, D} must occur at least as many times as {B, C, D}.
  • 40.
    Principle 2: Aprioriprinciple (Contrapositive) If a pattern is not frequent then all supersets are not frequent {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D} M = 4 {A} : 3 is not frequent therefore {A, D} : 3 is not frequent
  • 41.
    Principle 2: Aprioriprinciple (Contrapositive) If a pattern is not frequent then all supersets are not frequent {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D} M = 4 {A} : 3 is not frequent therefore {A, D} : 3 is not frequent Why? {A, D} cannot occur more times than {A}.
  • 42.
    Two Main Algorithms •Candidate Generation - Apriori - Eclat • Pattern Growth - FP-Growth - TopK FP-Growth
  • 43.
  • 44.
    Lots of Generalizations Source:http://www.philippe-fournier-viger.com/spmf/
  • 45.
    Candidate Generation Two phases 1.Candidate generation. 2. Candidate filtering. Exploit Apriori Principle!
  • 46.
    Candidate Generation {AB} :? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ? {A} : ? {B} : ? {C} : ? {D} : ? { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D}
  • 47.
    Candidate Generation {AB} :? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ? {A} : ? {B} : ? {C} : ? {D} : ? { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D}
  • 48.
    Candidate Generation {AB} :? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ? {A} : 3 {B} : 4 {C} : 5 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D}
  • 49.
    Candidate Generation {AB} :? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ? {A} : 3 {B} : 4 {C} : 5 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D} Not frequent
  • 50.
    Candidate Generation {AB} :? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ? {A} : 3 {B} : 4 {C} : 5 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D} No need to explore!
  • 51.
    Candidate Generation {AB} :? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ? {A} : 3 {B} : 4 {C} : 5 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D}
  • 52.
    Candidate Generation {AB} :? {AC} : ? {AD} : ? {BC} : 4 {BD} : 4 {CD} : 5 {A} : 3 {B} : 4 {C} : 5 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D}
  • 53.
    Candidate Generation {AB} :? {AC} : ? {AD} : ? {BC} : 4 {BD} : 4 {CD} : 5 {A} : 3 {B} : 4 {C} : 5 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D}
  • 54.
    Two Main Algorithms •Candidate Generation - Apriori - Eclat • Pattern Growth - FP-Growth - TopK FP-Growth
  • 55.
    Candidate Generation Two phases 1.Candidate generation: Enumerate all subsets. 2. Candidate filtering: Eliminate infrequent subsets. Exploit Apriori Principle!
  • 56.
  • 57.
    Pattern Growth Two phases 1.Candidate filtering 2. Conditional database constructions. Avoid full scans over the data & large candidate sets!
  • 58.
    Pattern Growth -Depth First {B, C, D} {A, C, D} {B, D} {A, C, D} {B, C, D} {A, B, D} {AB} : 1 {AC} : 2 {AD} : 3 {BD} : 4 {CD} : 4 {A} : 3 {B} : 4 {C} : 4 {D} : 6 { } : 6 {ABC} : 0 {ABD} : 1 {ACD} : 2 {BCD} : 2 {BC} : 2
  • 59.
    Pattern Growth -Preprocessing {B, C, D} {A, C, D} {B, D} {A, C, D} {B, C, D} {A, B, D} {A} : 3 {B} : 4 {C} : 4 {D} : 6 { } : 6 Preprocessing First, count the number of times each item (singleton) occurs.
  • 60.
    Pattern Growth -Depth First {B, C, D} {A, C, D} {B, D} {A, C, D} {B, C, D} {A, B, D} {AB} : ? {AC} : ? {AD} : ? {BD} : ? {CD} : ? {A} : 3 {B} : 4 {C} : 4 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {BC} : ?
  • 61.
    Pattern Growth -Depth First {B, C, D} {A, C, D} {B, D} {A, C, D} {B, C, D} {A, B, D} {AB} : ? {AC} : ? {AD} : ? {BD} : ? {CD} : ? {A} : 3 {B} : 4 {C} : 4 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {BC} : ?
  • 62.
    Pattern Growth -Depth First {B, C, D} {A, C, D} {B, D} {A, C, D} {B, C, D} {A, B, D} {AB} : ? {AC} : ? {AD} : ? {BD} : ? {CD} : ? {A} : 3 {B} : 4 {C} : 4 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {BC} : ?
  • 63.
    Pattern Growth -Depth First {B, C, D} {A, C, D} {B, D} {A, C, D} {B, C, D} {A, B, D} {AB} : ? {AC} : ? {AD} : ? {BD} : ? {CD} : ? {A} : 3 {B} : 4 {C} : 4 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {BC} : ? No need to explore!
  • 64.
    Pattern Growth -Depth First {B, C, D} {A, C, D} {B, D} {A, C, D} {B, C, D} {A, B, D} {AB} : X {AC} : ? {AD} : ? {BD} : X {CD} : ? {A} : 3 {B} : 4 {C} : 4 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {BC} : X Explore depth first on {B}
  • 65.
    Pattern Growth {B} :4 { } : 6 Conditional Database Construction DB{} DB{B} {B, C, D} {A, C, D} {B, D} {A, C, D} {B, C, D} {A, B, D} {C, D} {D} {C, D} {D}
  • 66.
    Pattern Growth {B} :4 { } : 6 Candidate Filtering DB{B} {C, D} {D} {C, D} {D} {D} : 4 {C} : 2 DB{} {B, C, D} {A, C, D} {B, D} {A, C, D} {B, C, D} {A, B, D} DB{B} Add {BD} as frequent
  • 67.
    Pattern Growth -Depth First {C, D} {D} {C, D} {D} {AB} : X {AC} : ? {AD} : ? {BD} : 4 {CD} : ? {A} : 3 {B} : 4 {C} : 4 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {BC} : 2 Explore depth first on {BD}
  • 68.
    Pattern Growth -Depth First {AB} : X {AC} : ? {AD} : ? {BD} : 4 {CD} : ? {A} : 3 {B} : 4 {C} : 4 {D} : 6 { } : 6 {ABC} : ? {ABD} : X {ACD} : ? {BCD} : X {BC} : 2
  • 69.
    Compare & Constrast •Candidate Generation + Better than brute force + Filters candidate sets - Multiple passes over the data • Pattern Growth + Fewer passes over the data + Space efficient.
  • 70.
    Compare & Constrast •Candidate Generation + Better than brute force + Filters candidate sets - Multiple passes over the data • Pattern Growth + Fewer passes over the data + Space efficient. Better choice
  • 71.
  • 72.
    FP-Tree Compression Figures FromFlorian Verhein’s Slides on FP-Growth
  • 73.
    FP-Growth Algorithm Figures FromFlorian Verhein’s Slides on FP-Growth Two phases 1. Candidate filtering. 2. Conditional database constructions.
  • 74.
    TopK FP-Growth Algorithm Similarto FP-Growth 1. Dynamically raise min_support. 2. Estimates of min_support greatly help.
  • 75.
  • 76.
  • 77.
    Bags + Sequences ×2 Itemset: {Item} Bags: {Item: quantity} Sequences : (item)
  • 78.
  • 79.
    Summary Log Data Mining ≠ RocketScience • FP-Growth for finding frequent patterns. • Find rules from patterns to make predictions. • Extract features for useful ML in pattern space.
  • 80.
    SELECT questions FROMaudience WHERE difficulty == “Easy” Thanks!