Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Pattern Mining: Getting the
most out of your log data.
Krishna Sridhar
Staff Data Scientist, Dato Inc.
krishna_srd

• Background
- Machine Learning (ML) Research.
- Ph.D Numerical Optimization @Wisconsin
• Now
- Build ML tools for data-scientists & developers @Dato.
- Help deploy ML algorithms.
@krishna_srd, @DatoInc
About Me!

45+$and$growing$fast!
About Us!

+ =
Questions?
• (Now) I love questions. Feel free to interrupt for questions!
• (Later) Email me srikris@dato.com.
DAML Talks!

Creating a model pipeline
Ingest Transform Model Deploy
Unstructured Data
exploration
data
modeling
Data Science Workflow
Ingest Transform Model Deploy

Log Journey
Lots of data
Insights Profits

Machine Learning in Logs
Source: Mining Your Logs - Gaining Insight Through Visualization

Frequent Pattern Mining
What sets of items were bought together?

Can we recommend items?
Rule Mining

Log Mining: Feature Extraction

Feature Extraction
0 1 0 0 0 0 1 1 0
1 1 0 0 1 0 0 0 0
0 0 1 1 1 0
Receipt Space Features in
Menu Space
ML

3 Useful Data Mining Tasks
Rule MiningPattern Mining Feature Extraction

Formulating Pattern Mining
N distinct items → 2N itemsets

Find the top K most frequent sets of length at least L
that occur at least M times.

Find the top K most frequent sets of length at least L
that occur at least M times.
- max_patterns
- min_length
- min_support

Pattern Mining
N distinct items → 2N itemsets

Principle 1: What is frequent?
A pattern is frequent if it occurs at least M times.
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
Is the pattern {C, D} frequent?
M = 4
Patterns

{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
{C, D} occurs 5 times
M = 4
Patterns

{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
M = 4
Patterns
{C, D} occurs 5 times
Frequent!

{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
Is the pattern {A, D} frequent?
M = 4
Patterns

{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
M = 4
Patterns
{A, D} occurs 3 times.
Not frequent.

{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
{C, D}: 5 is frequent
M = 4
{A, D}: 3 is not frequent
min_support

Principle 2: Apriori principle
A pattern is frequent only if a subset is frequent
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
{B, C, D} : 4 is frequent therefore
{C, D} : 5 is frequent
M = 4

Principle 2: Apriori principle
A pattern is frequent only if a subset is frequent
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
{B, C, D} : 4 is frequent therefore
{C, D} : 5 is frequent
M = 4
Why?
{C, D} must occur at least as
many times as {B, C, D}.

Principle 2: Apriori principle (Contrapositive)
If a pattern is not frequent then all supersets are not frequent
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
M = 4
{A} : 3 is not frequent therefore
{A, D} : 3 is not frequent

Principle 2: Apriori principle (Contrapositive)
If a pattern is not frequent then all supersets are not frequent
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
M = 4
{A} : 3 is not frequent therefore
{A, D} : 3 is not frequent
Why?
{A, D} cannot occur more times
than {A}.

Two Main Algorithms
• Candidate Generation
- Apriori
- Eclat
• Pattern Growth
- FP-Growth
- TopK FP-Growth

Lots of Generalizations
Source: http://www.philippe-fournier-viger.com/spmf/

Candidate Generation
Two phases
1. Candidate generation.
2. Candidate filtering.
Exploit Apriori Principle!

{AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ?
{A} : ? {B} : ? {C} : ? {D} : ?
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}

{AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ?
{A} : 3 {B} : 4 {C} : 5 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}

{AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ?
{A} : 3 {B} : 4 {C} : 5 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
Not frequent

{AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ?
{A} : 3 {B} : 4 {C} : 5 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
No need to
explore!

{AB} : ? {AC} : ? {AD} : ? {BC} : 4 {BD} : 4 {CD} : 5
{A} : 3 {B} : 4 {C} : 5 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}

Two phases
1. Candidate generation: Enumerate all subsets.
2. Candidate filtering: Eliminate infrequent subsets.
Exploit Apriori Principle!

Pattern Growth
Two phases
1. Candidate filtering
2. Conditional database constructions.
Avoid full scans over the data & large
candidate sets!

Pattern Growth - Depth First {B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{AB} : 1 {AC} : 2 {AD} : 3 {BD} : 4 {CD} : 4
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6
{ABC} : 0 {ABD} : 1 {ACD} : 2 {BCD} : 2
{BC} : 2

Pattern Growth - Preprocessing {B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6
Preprocessing
First, count the number of times
each item (singleton) occurs.

{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{AB} : ? {AC} : ? {AD} : ? {BD} : ? {CD} : ?
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{BC} : ?

{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{AB} : ? {AC} : ? {AD} : ? {BD} : ? {CD} : ?
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{BC} : ?
No need to
explore!

{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{AB} : X {AC} : ? {AD} : ? {BD} : X {CD} : ?
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{BC} : X
Explore depth
first on {B}

Pattern Growth
{B} : 4
{ } : 6
Conditional Database Construction
DB{} DB{B}
{B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{C, D}
{D}
{C, D}
{D}

Pattern Growth
{B} : 4
{ } : 6
Candidate Filtering
DB{B}
{C, D}
{D}
{C, D}
{D}
{D} : 4
{C} : 2
DB{}
{B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
DB{B}
Add {BD} as frequent

Pattern Growth - Depth First {C, D}
{D}
{C, D}
{D}
{AB} : X {AC} : ? {AD} : ? {BD} : 4 {CD} : ?
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{BC} : 2
Explore depth
first on {BD}

Pattern Growth - Depth First
{AB} : X {AC} : ? {AD} : ? {BD} : 4 {CD} : ?
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6
{ABC} : ? {ABD} : X {ACD} : ? {BCD} : X
{BC} : 2

Compare & Constrast
+ Better than brute force
+ Filters candidate sets
- Multiple passes over the data
• Pattern Growth
+ Fewer passes over the data
+ Space efficient.

Compare & Constrast
+ Better than brute force
+ Filters candidate sets
- Multiple passes over the data
• Pattern Growth
+ Fewer passes over the data
+ Space efficient.
Better choice

FP-Tree Compression
Figures From Florian Verhein’s Slides on FP-Growth

FP-Growth Algorithm
Figures From Florian Verhein’s Slides on FP-Growth
Two phases
1. Candidate filtering.
2. Conditional database constructions.

TopK FP-Growth Algorithm
Similar to FP-Growth
1. Dynamically raise min_support.
2. Estimates of min_support greatly help.

Distributed FP-Growth
Partition database on item-ids.
Database

Bags + Sequences
× 2
Itemset: {Item}
Bags: {Item: quantity}
Sequences : (item)

Summary
Log Data Mining
≠
Rocket Science
• FP-Growth for finding frequent patterns.
• Find rules from patterns to make predictions.
• Extract features for useful ML in pattern space.

SELECT questions FROM audience
WHERE difficulty == “Easy”
Thanks!

Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Similar to Frequent Pattern Mining - Krishna Sridhar, Feb 2016 (20)

More from Seattle DAML meetup

More from Seattle DAML meetup (11)

Recently uploaded

Recently uploaded (20)

Frequent Pattern Mining - Krishna Sridhar, Feb 2016