Turning Krimp into a Triclustering Technique on Sets of Attribute-Condition Pairs that Compress

Turning Krimp into a Triclustering Technique on
Sets of Attribute-Condition Pairs that Compress
Maxim Yurov and Dmitry I. Ignatov
National Research University Higher School of Economics, Moscow, Russia
Data Analysis and AI Dept. & Itelligent Systems and Structural Analysis Lab @
Computer Science Faculty
IJCRS 2017
Olsztyn, Poland
03.07.2017
1

Outline
Problem statement
Krimp algorithm
Triadic data and its transformation
Biclustering and Triclustering
Results of experiments
2

Research Domain
Frequent Itemset Mining (FIM) is one of the basic problems in
Data Mining.
One of the ﬁrst FIM task is market basket analysis (Agrawal et al.,
1993).
One of the ﬁrst FIM algorithms is Apriori (Agrawal et al., 1994).
3

Frequent Itemset Mining
Problem: a humongous number of frequent itemset, which makes
complicated the search of the most interesting patterns among
them.
Q: How to solve it?
A: For example, to use Minimal Description Lenght princinple:
MDL principal
The best set of frequent itemsets compresses the input data the
best.1
1
Siebes A., Vreeken J., van Leeuwen M., Itemsets that compress (2011).
4

Frequent Itemset Mining
Problem: a humongous number of frequent itemset, which makes
complicated the search of the most interesting patterns among
them.
Q: How to solve it?
A: For example, to use Minimal Description Lenght princinple:
MDL principal
The best set of frequent itemsets compresses the input data the
best.1
1

Krimp Algorithm
Input
A database D of transactions over a set items I (like purchases in a
supermarket).
Code Table
The code table CT is the table with two columns: the itemsets are
on the left and their codes are on the right.
The left column contains at least all single itemsets.
The codes are unique.
7

Krimp Algorithm
Input
A database D of transactions over a set items I (like purchases in a
supermarket).
Code Table
The code table CT is the table with two columns: the itemsets are
on the left and their codes are on the right.
The left column contains at least all single itemsets.
The codes are unique.
8

Figure: Code table example. The width of the Code column shows the
length of the code. I = {A, B, C}. NB: the column Usage is not a part
of the code table.
2
2
9

Figure: Example of a database, its cover, and the encoded database
based on the previous codetable from Fig. 1. I = {A, B, C}.
3
3
10

Figure: Example of the standard codetable for database from Fig. 2, its
cover and the encoded database.
4
4
11

Minimal Coding Set Problem
Let I be a set of items, D be a dataset of transactions (some
itemsets) over I, cover be a coverage function, and F be a set of
candidate itemsets. Find the minimal coding set CS ⊆ F such that
the resulting code table CT implies the minimum total size of the
encoded database and the code table L(D, CT).
L(D, CT) = L(D|CT) + L(CT|D)
L(CT|D) =
X∈CT:usageD(X)=0
L(codeST (X)) + L(codeCT (X)))
L(D|CT) =
t∈D
L(t|CT)
L(t|CT) =
X∈cover(CT,t)
L(codeCT (X))
L(codeCT (X)) = |codeCT (X)|
12

Krimp algorithm
The algorithmic strategy
It starts with the standard code table ST, which contains only
singletones X ∈ I
Then it adds one by one othes itemsets (candidates) from F.
If the resulsting codetable maintains better compression, then
Krimp stores it and continues the search. Otherwise, Krimp
eliminates this itemset.

Krimp algorithm
Standard Cover Order
Let us order X ∈ CT by decreasing cardinality, then by decreasing
support, and ﬁnally in lectic order:
|X| ↓ suppD(X) ↓ lexicographically ↑
Standard Candidate Order
Frequent and long itemsets are of priority:
suppD(X) ↓ |X| ↓ lexicographically ↑

Krimp algorithm
Input: D is a transaction database and F is a candidate set over a
input set of items I.
Output: A heuristic solution to the Minimal Coding Set Problem,
code table CT.
1 CT ← StandardCodeTable(D)
2 F0 ← F in Standard Candidate Order
3 for F ∈ F0 {{i} | i ∈ I} do
4 CTc ← (CT ∪ F) in Standard Cover Order
5 if L(D, CTc) < L(D, CT) then
6 CT ← CTc
7 end
8 end
9 return CT
15

Krimp algorithm
Figure: The scheme of Krimp.
5
5
16

Triadic Data
Folksonomy is a ternary relation over sets of objects, attributes and
conditions.6
From ternary binary relation to dyadic ones
(Obj., Attr., Cond.) → (Obj., Attr. × Cond.),
where A × B is the Cartesian product of A and B.
6
Folksonomy coinage and deﬁnition (2007) T. Vander Wal –
http://vanderwal.net/folksonomy.html

Data
1. A sample of Top-250 movies from www.IMDB.com.
The objects are movie titles, the attributes are keywords, and
the conditions are genres.
2. A sample from bibliography sharing system BibSonomy.org.
The objects are users, the attributes are tags, and the
conditions are electronic bookmarks.
19

Example of data transformation
If there is a movie description in terms of keywords and genres
{Star Wars} × {Princess, Empire} × {Adventure, Sci-Fi, Action},
then this piece of data can be transformed into object-attribute
form as follows:
{Star Wars} ×



(Princess, Adventure), (Princess, Sci-Fi)
(Princess, Action), (Empire, Adventure)
(Empire, Sci-Fi), (Empire, Action)



.

Biclustering
[Mirkin, 1995]
Coinage the term bicluster
The term bicluster(ing) was proposed by B. Mirkin in the book
Mathematical Classiﬁcation and Clustering. Kluwer Academic
Publishers (1996).
p. 296
The term biclustering refers to simultaneous clustering of
both row and column sets in a data matrix. Biclustering
addresses the problems of aggregate representation of the
basic features of interrelation between rows and columns
as expressed in the data.
21

Concept-based biclustering
[D. Ignatov and S. Kuznetsov, 2010]
Let K = (G, M, I ⊆ G × M) be a formal context.
Deﬁnition 1
If (g, m) ∈ I, then (m , g ) is called an object-attribute bicluster or
OA-bicluster with density ρ(m , g ) = |I∩(m ×g )|
|m |·|g | .7
7
(.) : 2G
→ 2M
and (.) : 2M
→ 2G
are the derivation operators applied to
{g} ⊆ G and {m} ⊆ M in sense of [Ganter & Wille, 1999].

Geometric interpretation of OA-bicluster: connection with
RST
[D. Ignatov and S. Kuznetsov, 2010]
g
m
g''
m''
g'
m'
23

Triadic FCA and Triclustering
[Lehman & Wille, 1993]
Consider K = (G, M, B, J ⊆ G × M × B), a triadic context; in
what follows we will refer to a trisets T = (X, Y , Z) with Z ⊆ G,
Y ⊆ M, Z ⊆ B as an object-attribute-condition tricluster or simply
tricluster8.
8
Ignatov, D.I., Gnatyshak, D.V., Kuznetsov, S.O., Mirkin, B.G.: Triadic
formal concept analysis and triclustering: searching for optimal patterns.
Machine Learning 101(1-3) (2015) 271–302
24

KRIMP-based triclusters
Each encoding set of (object, attribute) pairs found by Krimp
is contained as a coding block in the description of some
object g ∈ G.
Let S be a coding set returned by Krimp that consists of n
attribute-condition pairs from M × B.
Then the ﬁrst component X of the corresponding tricluster is
{g | (g, m, b) ∈ I for all (m, b) ∈ S}.
The remaining two components are
Y = {m | ∀(m, b) ∈ S} and Z = {b | ∀(m, b) ∈ S}.
S is not necessarily equal to Y × Z, so, some amount of
missing triples is allowed inside T = (X, Y , Z). The quality of
such a tricluster can be assessed by its density.

Quality measures
Density
ρ(Ti ) =
|J ∩ (X × Y × Z)|
|X||Y ||Z|
For the tricluster collection:
ρ(T ) =
Ti ∈T ρ(Ti )
|T |
Coverage
coverage(T , K) =
| (X,Y ,Z)∈T X × Y × Z ∩ J|
|J|
26

Diversity
diversity(T ) = 1 −
j i<j intersect(Ti , Tj )
|T |(|T |−1)
2
,
where:
intersect(Ti , Tj ) =



1, GTi
∩ GTj
= ∅∧
∧MTi
∩ MTj
= ∅∧
∧BTi
∩ BTj
= ∅
0, otherwise
27

IMDB: Top-250 movies
Table: Basic statistics of the IMDB dataset with top-250 movies.
Context |G| |M| |B| # triples Density
IMDB 250 795 22 3818 0.00087
28

Results of experiments with triclustering
Table: Time, cardinality, density, coverage and diversity for Top-250
IMDB movies dataset.
Algorithm t, ms number of triclusters ρ, % Cov, % Div, %
IMDB
OAC ( ) 2314 1500 1.84 100 15.650
OAC ( ) 547 1274 53.85 100 96.550
SpecTric 98799 21 17.07 20.88 100
TriBox 197136 328 91.65 98.90 98.890
TRIAS 102554 1956 100 100 99.890
Krimp (minsup = 2, 87 152 100 24.04 99.556
only non-singletons)
Krimp (minsup = 2, 87 2859 100 99.97 99.997
usage = 0)
Krimp (minsup = 3, 46 57 100 12.07 98.684
only non-singletons)
Krimp (minsup = 3, 46 2966 100 99.97 99.998
usage = 0)
29

Examples of triclusters
Three triclusters extracted by Krimp from IMDB dataset.
Tricluster 1.
Keyword-genre component:
{(Princess,Adventure), (Princess,Fantasy), (Empire,Sci-Fi),
(Empire,Adventure), (Empire,Action), (Princess,Sci-Fi),
(Princess,Action), (Empire,Fantasy), (Death Star,Sci-Fi),
(Death Star,Fantasy), (Death Star,Adventure),
(Death Star,Action)},
(2,2)
Movies component:
{Star Wars: Episode VI – Return of the Jedi (1983),
Star Wars (1977)}
30

Examples of triclusters
Three triclusters extracted by Krimp from IMDB dataset.
Tricluster 2.
{(Future,Sci-Fi), (Future,Thriller), (Future,Action), (Cyborg,Thriller),
(Cyborg,Sci-Fi), (Cyborg,Action), (The Terminator,Thriller),
(The Terminator,Sci-Fi), (The Terminator,Action) },
(2,2)
Movies component:
{The Terminator (1984), Terminator 2: Judgment Day (1991)}
Tricluster 3.
{(Gotham,Thriller), (Gotham,Drama), (Gotham,Crime), (Gotham,Action),
(Batman,Thriller), (Batman,Drama), (Batman,Crime), (Batman,Action)},
(2,2)
Movies component:
{Batman Begins (2005), The Dark Knight (2008)}.
31

Conclusion
Krimp (or its descendants) can be considered as a prospective
method for triadic data analysis.
The positive features:
fast computational time (although on the dataset of rather
moderate size with the lowest minimal support minsup = 2);
absolutely dense triclusters (however, this may not be the case
for sparse and noisy datasets);
we can select a rather small set of “large” triclusters (e.g., by
imposing higher support for non-singletons).
The negative features:
the strong trade-oﬀ between coverage and the number of
triclusters (switching from coding sets with singletons to
itemsets of higher size);
even higher number of triclusters than the number of
triconcepts when the usage of singletons is allowed.

Turning Krimp into a Triclustering Technique on Sets of Attribute-Condition Pairs that Compress

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Turning Krimp into a Triclustering Technique on Sets of Attribute-Condition Pairs that Compress

Similar to Turning Krimp into a Triclustering Technique on Sets of Attribute-Condition Pairs that Compress (20)

More from Dmitrii Ignatov

More from Dmitrii Ignatov (20)

Recently uploaded

Recently uploaded (20)

Turning Krimp into a Triclustering Technique on Sets of Attribute-Condition Pairs that Compress