An introduction to frequent pattern mining algorithms and their usage in mining log data. Presented by Krishna Sridhar (Dato) at Seattle DAML meetup, Feb 2016.
6. Creating a model pipeline
Ingest Transform Model Deploy
Unstructured Data
exploration
data
modeling
Data Science Workflow
Ingest Transform Model Deploy
28. Formulating Pattern Mining
Find the top K most frequent sets of length at least L
that occur at least M times.
- max_patterns
- min_length
- min_support
32. Principle 1: What is frequent?
A pattern is frequent if it occurs at least M times.
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
Is the pattern {C, D} frequent?
M = 4
Patterns
33. Principle 1: What is frequent?
A pattern is frequent if it occurs at least M times.
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
{C, D} occurs 5 times
M = 4
Patterns
34. Principle 1: What is frequent?
A pattern is frequent if it occurs at least M times.
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
M = 4
Patterns
{C, D} occurs 5 times
Frequent!
35. Principle 1: What is frequent?
A pattern is frequent if it occurs at least M times.
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
Is the pattern {A, D} frequent?
M = 4
Patterns
36. Principle 1: What is frequent?
A pattern is frequent if it occurs at least M times.
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
M = 4
Patterns
{A, D} occurs 3 times.
Not frequent.
37. Principle 1: What is frequent?
A pattern is frequent if it occurs at least M times.
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
{C, D}: 5 is frequent
M = 4
{A, D}: 3 is not frequent
min_support
38. Principle 2: Apriori principle
A pattern is frequent only if a subset is frequent
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
{B, C, D} : 4 is frequent therefore
{C, D} : 5 is frequent
M = 4
39. Principle 2: Apriori principle
A pattern is frequent only if a subset is frequent
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
{B, C, D} : 4 is frequent therefore
{C, D} : 5 is frequent
M = 4
Why?
{C, D} must occur at least as
many times as {B, C, D}.
40. Principle 2: Apriori principle (Contrapositive)
If a pattern is not frequent then all supersets are not frequent
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
M = 4
{A} : 3 is not frequent therefore
{A, D} : 3 is not frequent
41. Principle 2: Apriori principle (Contrapositive)
If a pattern is not frequent then all supersets are not frequent
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
M = 4
{A} : 3 is not frequent therefore
{A, D} : 3 is not frequent
Why?
{A, D} cannot occur more times
than {A}.
42. Two Main Algorithms
• Candidate Generation
- Apriori
- Eclat
• Pattern Growth
- FP-Growth
- TopK FP-Growth
69. Compare & Constrast
• Candidate Generation
+ Better than brute force
+ Filters candidate sets
- Multiple passes over the data
• Pattern Growth
+ Fewer passes over the data
+ Space efficient.
70. Compare & Constrast
• Candidate Generation
+ Better than brute force
+ Filters candidate sets
- Multiple passes over the data
• Pattern Growth
+ Fewer passes over the data
+ Space efficient.
Better choice
79. Summary
Log Data Mining
≠
Rocket Science
• FP-Growth for finding frequent patterns.
• Find rules from patterns to make predictions.
• Extract features for useful ML in pattern space.