Want to learn about PrefixSpan for Sequential Pattern Mining and PrefixSpan Implementations with Spark? Then have a look at this presentation and learn how Akanoo uses cutting edge technology to predict onlineshopper behavior in real-time. Akanoo uses smart algorithms to target and convert visitors while they are still surfing the shop site for high conversion rates and additional revenue.
5. cart . . .
Motivation
54% buyers
46% non-buyers
What is going on within the last 5 page impressions?
5
6. Translating clickstream data into patterns
pagetypes: home, overview, product, sale, account, cart, checkout, search, about
cart . . . overview overview overview product overview
cart . . . 5th last PI 4th last PI 3rd last PI 2nd last PI last PI
pattern: identified by SessionId:
< ( overview, add ), ( overview, no ), ( overview, no ), ( product, no ), ( overview, no ) >
cart . . . add no no no no
cart changes: add, remove, no
item
itemset
6
7. Problem definition
7
pattern 1 < (overview, no) , ( overview, no ), ( overview, no ), ( product, no ), ( overview, no ) >
pattern 2 < (home, no), ( product, no ), ( product, no ), ( product, no ), ( overview, no ) >
...
pattern n < (overview, no) , ( product, add ), ( product, no ), ( cart, no ), ( checkout, no ) >
Example < (overview, no), (product, no)> is a subpattern of patterns 1 and n
< (overview, no), (product, remove) > is no subpattern
A pattern is frequent with support n if it is n times a subpattern of the database patterns
9. PrefixSpan with a toy example: frequent patterns of length 1
9
ID pattern
1 < a (a b c) (a c) >
2 < (a d) c >
3 < (e f) (a b) >
Database
ID < a > < b > < c > < d > < e > < f >
1 (0,0), (1,0), (2,0) (1,1) (1,2), (2,1) -- -- --
2 (0,0) -- (1,0) (0,1) -- --
3 (1,0) (1,1) -- -- (0,0) (0,1)
Step1: Occurrences of base letters
ID < a > < b > < c > < d > < e > < f >
1 (0,0), (1,0), (2,0) (1,1) (1,2), (2,1) -- -- --
2 (0,0) -- (1,0) (0,1) -- --
3 (1,0) (1,1) -- -- (0,0) (0,1)
support 3 2 2 1 1 1min support: 2
frequent patterns of length 1 are: < a >, < b >, < c >
10. PrefixSpan with a toy example: frequent patterns of length 2
10
Occurrences of frequent patterns of length 1
ID < a > < b > < c >
1 (0,0), (1,0), (2,0) (1,1) (1,2), (2,1)
2 (0,0) -- (1,0)
3 (1,0) (1,1) --
ID <a a> <a b> <a c> <b a> <b b> <b c> <c a> <c b> <c c>
1 (1,0), (2,0) (1,1) (1,2), (2,1) (2,0) -- (2,1) (2,0) (2,0) (2,1)
2 -- -- (1,0) -- -- -- -- -- --
3 -- -- -- -- -- -- -- --
Step 2: Occurrences of patterns of length 2
frequent patterns of length 2 are:
< a c >
ID < a > < b > < c >
1 (0,0), (1,0), (2,0) (1,1) (1,2), (2,1)
2 (0,0) -- (1,0)
3 (1,0) (1,1) --
Occurrences of frequent base letters
11. 11
PrefixSpan with a toy example: frequent patterns of length 3
Occurrences of frequent base letters
ID < a > < b > < c >
1 (0,0), (1,0), (2,0) (1,1) (1,2), (2,1)
2 (0,0) -- (1,0)
3 (1,0) (1,1) --
ID <a c>
1 (1,2), (2,1)
2 (1,0)
3 --
ID <a c a> <a c b> <a c c>
1 (2,0) -- --
2 -- -- --
3 -- -- --
Occurrences of frequent patterns of length 2
Step 3: Occurrences of patterns of length 3
no frequent patterns of length 3
12. PrefixSpan with a toy example: all frequent patterns
12
ID pattern
1 < a (a b c) (a c) >
2 < (a d) c >
3 < (e f) (a b) >
Database Results: frequent patterns
pattern support
< a > 3
< b > 2
< c > 2
< a c > 2
min Support: 2
14. PrefixSpan with a toy example: all frequent patterns
14
ID pattern
1 < a (a b c) (a c) >
2 < (a d) c >
3 < (e f) (a b) >
Database
Results: frequent patterns
pattern support
< a > 3
< b > 2
< c > 2
< a c > 2
< (a b) > 2min Support: 2
?
15. Main class, main function
15
Pattern composition
Main class:
class SequenceRDDFunctions(
database: RDD[(SessionId, Pattern)]
)
Main function:
def mineFrequentPatterns(
patternGenerator: PatternGenerator,
minSupFraction: Double = 0.05
): List[PatternWithOccRddAndSupport]
Pattern
case class Pattern(
elements: Seq
[ItemSet]
)
val p = Pattern(
elements: Seq(a, d, c)
)
ItemSet
case class ItemSet(
items: Seq[Item])
val a = ItemSet(items =
Seq(item1)
)
Item
case class Item(
letter: Letter)
val item1 = Item(
letter =
pageTypeOverviewLetter
)
Letter type Letter = Int
val pageTypeOverviewLetter:
Letter = 1
16. From Session to Pattern (1 / 2)
16
{
"SessionID":"49624d7e",
...
},
{
"pageType":"overview",
...
},
{
"pageType":"product",
...
},
{
"pageType":"checkout",
...
},
Session
(json)
Visit
case class Visit(
session: Session,
viewList: List[View]
…
)
case class View (
viewId: String,
time: Long,
pageType: PageType,
…)
val a: View = View(
viewId= "view-id",
pageType = PageType.OVERVIEW,
...
val d: View = View(
viewId= "view-id",
pageType = PageType.PRODUCT,
...
)
val c: View = View(
viewId= "view-id",
pageType = PageType.CHECKOUT,
...
)
)
Views
17. From Session to Pattern (2 / 2)
Creating a pattern from a visit
def pageTypeLetterForView(view: View): Letter =
visits.viewList.view.pageType match {
case PageType.OVERVIEW => pageTypeOverviewLetter
case PageType.PRODUCT => pageTypeProductLetter
case PageType.ACCOUNT => pageTypeCheckoutLetter
}
)
def generatePattern(visit: Visit): Pattern = {
val itemSets = visit.map { (view) =>
val itemSet = pageTypeLetterForView(view)
.map(Item)
.map(ItemSet)
}
Pattern(itemSets)
}
Toy alphabet: pageTypeAlphabet
// letters
val pageTypeOverviewLetter: Letter = 1
val pageTypeCheckoutLetter: Letter = 3
val pageTypeProductLetter: Letter = 4
// alphabet
val pageTypeAlphabet: Alphabet = List(
pageTypeOverviewLetter,
pageTypeProductLetter,
pageTypeCheckoutLetter
)
18. Towards a database of patterns
Creating a pattern from a visit
def generatePattern(visit: Visit):
Pattern = {
val itemSets = visit.map { (view) =>
val itemSet =
pageTypeLetterForView(view)
.map(Item)
.map(ItemSet)
}
Pattern(itemSets)
}
Creating databse ...
… which is an RDD[(SessionId, Pattern)
def mapToPattern(visits: Seq(Visit)): RDD[(SessionId,
Pattern)] = {
visits.map(
visit =>
(visit.session.sessionId,generatePattern(visit)))
}
19. Main class, main function
19
Input params
Main class:
class SequenceRDDFunctions(
database: RDD[(SessionId, Pattern)]
)
Main function:
def mineFrequentPatterns(
patternGenerator: PatternGenerator,
minSupFraction: Double = 0.05
): List[PatternWithOccRddAndSupport]
patternGenerator
a function that lets us grab
the alphabet
minSupFraction: Double
= 0.05
a pattern is frequent if it
occurrs in at least 5% of all
database patterns
20. Main class, main function
20
Main class:
class SequenceRDDFunctions(
database: RDD[(SessionId, Pattern)]
)
Main function:
def mineFrequentPatterns(
patternGenerator: PatternGenerator,
minSupFraction: Double = 0.05
): List[PatternWithOccRddAndSupport]
case class PatternWithOccRddAndSupport(pattern:
Pattern, occRdd: RDD[(SessionId, Occ)], support: Long)
type Occ = Seq[(ItemSetIndex, ItemIndex)]
pattern 1 (a)
example SessionId Occurrence
< a (a b c) (a c) > 1 (0,0), (1,0), (2,0)
< (a d) c > 2 (0,0)
< (e f) (a b) > 3 (1,0)
Support 3
Output
List[PatternWithOccRddAndSupport]
One list item for each length 1 of
frequent patterns
21. How frequent patterns are found (1/6)
1. Calculate the occurrence table of the baseLetters
def occurrencesOfBaseLetters(baseLetter: Letter): RDD[(SessionId, Occ)] = {
database.mapValues(sessionPattern => {
sessionPattern.occurrence(Item(baseLetter))
})
}
case class Pattern(elements: Seq[ItemSet]) {
def occurrence(item: Item): Occ = {
elements.zipWithIndex.filter({
case (itemSet, itemSetIndex) => itemSet.contains(item)
})
.map({ case (itemSet, itemSetIndex) =>
(itemSetIndex, itemSet.indexOf(item))
})}
}
Reminder:
type Occ = Seq[(ItemSetIndex, ItemIndex)]
22. How frequent patterns are found (2/6)
22
One instance of PatternWithOccRddAndSupport for
any letter
letter 1
SessionId Occurrence
1 (1,2), (2,1)
2 (1,0)
3 --
Support 2
letter 2
SessionId Occurrence
1 (1,2), (2,1)
2 (1,0)
3 --
Support 2
ID < a > < b > < c >
1 (0,0), (1,0), (2,0) (1,1) (1,2), (2,1)
2 (0,0) -- (1,0)
3 (1,0) (1,1) --
...
… corresponds to this table of the
letter occurrence table in the theory
part
1. Calculate the occurrence table of the baseLetters
23. How frequent patterns are found (3/6)
2. Calculate the occurrence tables of (n+1)-Patterns from n-Patterns
val frequentPatterns = Stream.iterate(frequentBaseLettersAndOcccurences)(previousPatterns => {
previousPatterns.par.flatMap((previousPattern: PatternWithOccRddAndSupport) => {
previousPattern.occurrences.persist()
// creates n+1 patterns with enough support
val nextPatterns = getNextFrequentPatterns(previousPattern.pattern, previousPattern.occurrences)
previousPattern.occurrences.unpersist()
nextPatterns
}).toList
}).takeWhile(_.nonEmpty).flatten.toList
24. How frequent patterns are found (4/6)
2. Calculate the occurrence tables of (n+1)-Patterns from n-Patterns
// based on pattern of length n, identifies frequent patterns of length n+1 by appending and assembling frequent letters
def getNextFrequentPatterns(previousPattern: Pattern, previousPatternOccurrences: RDD[(SessionId, Occ)]): List
[PatternWithOccRddAndSupport] = {
val appendedFreqPatterns = frequentBaseLettersAndOcccurences.par.map(baseLetterAndOccs => {
//make <a b> + <b> => <a b b>
val nextPattern = previousPattern ++ baseLetterAndOccs.pattern
// joins occurrences of previous pattern and any letter
val joinedOccurrences: RDD[(SessionId, (Occ, Occ))] = previousPatternOccurrences.join(baseLetterAndOccs.occurrences)
// returns occurrences of new pattern
val nextPatternOccurrences: RDD[(SessionId, Occ)] = joinedOccurrences.mapAppendedOccPairToOcc()
PatternWithOccRddAndSupport(nextPattern, nextPatternOccurrences, nextPatternOccurrences.countSupport())
}) // only leaves pattern with enough support
.filter(_.support >= minSup).toList
appendedFreqPatterns
}
25. How frequent patterns are found (5/6)
2. Calculate the occurrence tables of (n+1)-Patterns from n-Patterns
// based on pattern of length n, identifies frequent patterns of length n+1 by appending and assembling frequent letters
def getNextFrequentPatterns(previousPattern: Pattern, previousPatternOccurrences: RDD[(SessionId, Occ)]): List
[PatternWithOccRddAndSupport] = {
…
// prefix OCC, suffix OCC
val joinedOccurrences: RDD[(SessionId, (Occ, Occ))] = previousPatternOccurrences.join(baseLetterAndOccs.occurrences)
// returns occurrences of new pattern
val nextPatternOccurrences: RDD[(SessionId, Occ)] = joinedOccurrences.mapAppendedOccPairToOcc()
...
def mapAppendedOccPairToOcc(): RDD[(SessionId, Occ)] = {
self.mapValues((occPair: (Occ, Occ)) => {
// occP = occ(<a b>) , occS = occ(<c>)
val (occPrefix, occSuffix) = occPair
PseudoProjection.pseudoProjectionAppend(occPrefix, occSuffix)
})
}
joinedOccurrences example entry:
<ac> append <a>
<ac> - prefixOcc: [(1,0), (2,1)]
<a> - suffixOcc: [(0,0), (1,0), (2,0)]
-> (2,0)
26. How frequent patterns are found (6/6)
2. Calculate the occurrence tables of (n+1)-Patterns from n-Patterns
def pseudoProjectionAppend(prefixOcc: Occ, suffixOcc: Occ): Occ =
suffixOcc.filter({ case (suffixItemSetIndex, suffixItemIndex) =>
// suffix occurrence after first occurrence of prefix
suffixItemSetIndex > prefixOcc.map(
{ case (prefixItemSetIndex, _) => prefixItemSetIndex }).min
})
def mapAppendedOccPairToOcc(): RDD[(SessionId, Occ)] = {
self.mapValues((occPair: (Occ, Occ)) => {
// occP = occ(<a b>) , occS = occ(<c>)
val (occPrefix, occSuffix) = occPair
PseudoProjection.pseudoProjectionAppend(occPrefix, occSuffix)
})
}
joinedOccurrences example entry:
<ac> append <a>
<ac> - prefixOcc: [(1,0), (2,1)]
<a> - suffixOcc: [(0,0), (1,0), (2,0)]
-> (2,0)
27. Results: Mining frequent patterns of conversions and abandonments
KäuferKaufabbrecher
Häufigkeit
Handlungsempfehlung: Mit potentiellen Abbrechern nach mehrfachen Besuch von zwei Übersichtsseiten
interagieren (Erinnerung an den Warenkorb, Abschluss-orientierte Kaufanreize)
28. Thank you for your attention!
Any further questions?
Find more information on akanoo.com or
write us a mail to hi@akanoo.com!
Editor's Notes
A motivation for mining frequent patterns in the field of e-commerce:
At one of our biggest customers we have seen that 10% of the users check their cart, but only 54% buyt.
To understand the behavior on the last 5 impressions, we wanted to mine frequent patterns of conversions and abandonments.
Another example:
I customers typically rent “Star Wars", then Empire Strikes Back", and then “Return of the Jedi". Note that these rentals need not be consecutive. Customers who rent some other videos in between also support this sequential pattern.
< “Star Wars”, “Empire Strikes Back”, “Return of the Jedi” >
Therefor, we have to translate clickstream data into patterns:
We can extract pagetypes from each page impression and also cart changes such as adding or removing an item from the cart
By doing so, we translate a clickstream into a pattern where each itemset corresponds to a page impression
The algorithm starts to find frequent base patterns of length 1. In the following steps frequent patterns are identified by extending the mined patterns of the step before with frequent base patterns and we search for occurrences in a relevant part of the initial database.
More detailed:
PrefixSpan [5] is the most promising of the pattern-growth methods and is based on recursively constructing the patterns, as shown in figure 5. Its great advantage is the use of projected databases. An α-projected database is the set of subsequences in the database, that are suffixes of the sequences that have prefix α. In each step, the algorithm looks for the frequent sequences with prefix α, in the correspondent projected database. In this way the search space is reduced in each step, allowing for better performances in the presence of small support thresholds.
In the next slides I will explain the algorithm within an example
The first step is to calcultate the occurrences of the base patterns.
Base patterns are: …
What is the occurence of pattern < a > in the pattern 1?
If we count zero-based, then a occurrs in the itemset 0 at index 0. It also occurrs in the first itemset at index zero. Plus at the second itemset at index 0. And in pattern 3? It appears in
we call < a > a prefix
each column is a realisation of the < a > pseudo-projected database (of the suffixes)
We will see, that in the next steps, the pseudo-projected database will shrink
We keep the table of the frequent base letters.
First, we build up candidates by extending the frequent patterns of length 1 with frequent base patterns. (pruning)
Then, we calculate the occurences:
What is the occurrence of pattern < a c > in pattern 1? Well c occurres in itemset 1 and a appears in pattern 1 in itemset 0, before itemset 1. Thus, < a c > occurres at (1,2).
Pruning:
A prerequisite of a frequent pattern is that each subpattern of a frequent pattern has to be frequent.
We keep the table of the frequent base letters and the table of occurrences of patterns of length 2
First, we build up candidates by extending the frequent patterns of length 2 with frequent base patterns. (pruning)
Then, we calculate the occurences:
What is the occurrence of pattern < a c a > in pattern 1?
Well a occurres at itemset 0 but there is no chance that a c appears before, so we don’t take this occurrence.
A occurres at itemset 1 but there is no chance that a c appears before, so we don’t take this occurrence.
A occurres at itemset 2 and, indeed, <a c> occurres at (1,2), so we copy this occurrence
There are no frequent patterns of length 3.
The support is defined as fraction of the patternCount of the database
The support is defined as fraction of the patternCount of the database
The support is defined as fraction of the patternCount of the database
The support is defined as fraction of the patternCount of the database
The support is defined as fraction of the patternCount of the database
In der Tat: das Muster Ü, Ü, Ü, Ü, Ü kann auch bei Käufern weiter vorne stattfinden, allerdings können wir mit Hilfe der Kaufwahrscheinlichkeit Käufer von Nichtkäufern trennen.
In der Tat: das Muster Ü, Ü, Ü, Ü, Ü kann auch bei Käufern weiter vorne stattfinden, allerdings können wir mit Hilfe der Kaufwahrscheinlichkeit Käufer von Nichtkäufern trennen.