A One-Pass Triclustering Approach: Is There any Room for Big Data?

A One-Pass Triclustering Approach: Is There any Room
for Big Data?
Dmitry V. Gnatyshak1 Dmitry I. Ignatov1 Sergei O. Kuznetsov1 Lhouari
Nourine2
National Research University Higher School of Economics, Russian Federation
Blaise Pascal University, LIMOS, CNRS, France
10.10.2014
Dmitry V. Gnatyshak et al. A One-Pass Triclustering Approach 10.10.2014 1 / 26

Outline
1 Motivation

Outline
1 Motivation
2 Prime OAC-triclustering
Formal concept analysis
Basic algorithm
Online version of the algorithm

Outline
1 Motivation
Basic algorithm
3 Experiments
Description of the experiments
Datasets
Results

Outline
1 Motivation
Basic algorithm
3 Experiments
Datasets
Results
4 Conclusion

Motiation
Big amount of multimodal data:
Gene expression data
Folksonomies
. . .
Non-binary data can be scaled (possibly increasing the dimensionality)
Increasing amount of big data: fast algorithms required (linear or sublinear,
one-pass)
Existing methods — finding all p-clusters satisfying some conditions (often
exponential number)

Outline
1 Motivation
Basic algorithm
3 Experiments
Datasets
Results
4 Conclusion

Prime OAC-triclustering
Formal concept analysis: triadic case
Definition
Let G, M, B be some sets. Let the ternary relation I be a subset of their cartesian
product: I ⊆ G × M × B. Then the tuple K = {G,M,B, I } is called a triadic
formal context.
G — a set of objects, M — a set of attributes, B — a set of conditions.
GM m1 m2 m3 m1 m2 m3 m1 m2 m3
g1 x x x x x x x x
g2 x x x x x
g3 x x x x
g4 x x x x x x
B b1 b2 b3

Definition
Galois operators (prime operators) are defined the same way as in dyadic case:
2G → 2M × 2B
2M → 2G × 2B
2B → 2G × 2M
2G × 2M → 2B
2G × 2B → 2M
2M × 2B → 2G

GM m1 m2 m3 m1 m2 m3 m1 m2 m3
g1 x x x x x x x x
g2 x x x x x
g3 x x x x
g4 x x x x x x
B b1 b2 b3
({g1, g2}, {m1,m2})′ = {b1, b3}

GM m1 m2 m3 m1 m2 m3 m1 m2 m3
g1 x x x x x x x x
g2 x x x x x
g3 x x x x
g4 x x x x x x
B b1 b2 b3
m′2
= {(g1, b1), (g2, b1), (g3, b1), (g1, b2), (g1, b3), (g2, b3), (g4, b3)}

Definition
The triple (X,Y , Z) is called triadic formal concept of the context
K = (G,M,B, I ), if X ⊆ G,Y ⊆ M, Z ⊆ B, (X,Y )′ = Z, (X, Z)′ = Y ,
(Y , Z)′ = X.
X is called (formal) extent, Y — (formal) intent, Z — (formal) modus.
GM m1 m2 m3 m1 m2 m3 m1 m2 m3
g1 x x x x x x x x
g2 x x x x x
g3 x x x x
g4 x x x x x x
B b1 b2 b3

Basic algorithm
This method uses the following types of prime operators (for the context
K = (G,M,B, I )):
(g,m)′ = {b ∈ B | (g,m, b) ∈ I },
(g, b)′ = {m ∈ M | (g,m, b) ∈ I },
(m, b)′ = {g ∈ G | (g,m, b) ∈ I }
Definition
Then the triple T = ((m, b)′, (g, b)′, (g,m)′) is called prime OAC-tricluster based
on triple (g,m, b) ∈ I . The sets of tricluster are called, respectively, extent,
intent, and modus. Triple (g,m, b) is called a generating triple of the tricluster T.
Definition
Density of a tricluster: ρ(X,Y , Z) = |I∩(X×Y×Z)|
|X||Y||Z|

Basic algorithm
An example of a tricluster based on triple (eg,em
,eb
):

Basic algorithm
Require: K = (G,M, B, I ) — triadic context;
ρmin — density threshold
Ensure: T = {T = (X, Y, Z)}
1: T := ∅
2: for all (g,m) : g ∈ G,m ∈ M do
3: PrimesObjAttr [g,m] = (g,m)′
4: end for
5: for all (g, b) : g ∈ G,b ∈ B do
6: PrimesObjCond[g, b] = (g, b)′
7: end for
8: for all (m, b) : m ∈ M,b ∈ B do
9: PrimesAttrCond[m, b] = (m, b)′
10: end for
11: for all (g,m, b) ∈ I do
12: T = (PrimesAttrCond[m, b], PrimesObjCond[g, b], PrimesObjAttr [g,m])
13: Tkey = hash(T)
14: if Tkey̸∈ T .keys ∧ ρ(T) ≥ ρmin then
15: T [Tkey] := T
16: end if
17: end for

Let K = (G,M,B, I ) be a triadic context. We do not know G, M, B, I , or their
cardinalities.
Input on each iteration: {(g,m, b)} = J ⊆ I .
Goal — maintain an updated version of the results and efficiently update them
when new triples are received.
We need to keep in memory the results of prime operators’ application (prime
sets):
PrimesObjAttr — dictionary with elements of type ((g,m), {b ∈ B}), g ∈ G,
m ∈ M;
PrimesObjCond — dictionary with elements of type ((g, b), {m ∈ M}),
g ∈ G, b ∈ B;
PrimesAttrCond — dictionary with elements of type ((m, b), {g ∈ G}),
m ∈ M, b ∈ B.

Remark
In this case we need to consider triclusters based on different triples different, even
if their extents, intents, and modi are equal.

Algorithm of triples addition (standard):
Require: J — a set of triples to add;
T = {T = (∗X, ∗Y , ∗Z)} — current tricluster set;
PrimesObjAttr , PrimesObjCond, PrimesAttrCond;
Ensure: T = {T = (∗X, ∗Y , ∗Z)};
1: for all (g,m, b) ∈ J do
2: PrimesObjAttr [g,m] := PrimesObjAttr [g,m] ∪ b
3: PrimesObjCond[g, b] := PrimesObjCond[g, b] ∪ m
4: PrimesAttrCond[m, b] := PrimesAttrCond[m, b] ∪ g
5: T :=
T ∪ (&PrimesAttrCond[m, b],&PrimesObjCond[g, b],&PrimesObjAttr [g,m])
6: end for

Algorithm of triples removal (optional):
Triclusters must be kept in the dictionary with generating triples being keys.
Require: J — a set of triples to remove;
T = {(key(T),T = (∗X, ∗Y , ∗Z))} — current tricluster dictionary;
Ensure: T = {T = (∗X, ∗Y , ∗Z)};
1: for all (g,m, b) ∈ J do
2: T := T T [(g,m, b)]
3: PrimesObjAttr [g,m] := PrimesObjAttr [g,m] b
4: PrimesObjCond[g, b] := PrimesObjCond[g, b] m
5: PrimesAttrCond[m, b] := PrimesAttrCond[m, b] g
6: end for

If a user have asked for an output, we may need to remove the triclusters with the
same extent, intent and modi at the post-processing stage. At this stage we can
also check various conditions (for instance, minimal density condition).
Require: T = {T = (∗X, ∗Y , ∗Z)} — current tricluster set;
Ensure: T = {T = (∗X, ∗Y , ∗Z)} — processed tricluster hash-set;
1: for all T ∈ T do
2: Compute hash(T)
3: if hash(T)̸∈ T then
4: T := T ∪ T
5: end if
6: end for

Remark 1
To allow an efficient access to the prime sets dictionaries PrimesObjAttr ,
PrimesObjCond, and PrimesAttrCond must be implemented as hash tables.
Remark 2
For an efficient computation of triclusters’ hash values we can keep hash values of
prime sets along with prime sets. Then the calculation of the triclusters’ hash
values will require to find a value of some function of the prime sets’ hash values
(multiplied by non-repeating coefficients, for instance).
It is important not to use LHS hash-function (Locality Sensitive Hashing).

Complexities:
Time complexity: O(|I |) (as there is a constant number of operations on
each step);
More precisely: 8|I | operations in total;
1 Modification of 3 prime sets (3);
2 Creation of a new tricluster (1);
3 Addition of pointers to its extent, intent, and modus (3);
4 Addition of the tricluster to the set of all triclusters (1).
Memory complexity: O(|I |) (as we need to keep in memory only prime sets,
|I | elements in each dictionary + keys).

Example:

→ (g1,m1, b1)
1 PrimesObjAttr = {((g1,m1), {b1})}
2 PrimesObjCond = {((g1, b1), {m1})}
3 PrimesAttrCond = {((m1, b1), {g1})}
4 T := T ∪ {PrimesAttrCond[m1, b1], PrimesObjCond[g1, b1], PrimesObjAttr [g1,m1]}

→ (g1,m2, b1)
1 PrimesObjAttr = {((g1,m1), {b1}), ((g1,m2), {b1})}
2 PrimesObjCond = {((g1, b1), {m1,m2})}
3 PrimesAttrCond = {((m1, b1), {g1}), ((m2, b1), {g1})}

→ (g2,m1, b1)
1 PrimesObjAttr = {((g1,m1), {b1}), ((g1,m2), {b1}), ((g2,m1), {b1})}
2 PrimesObjCond = {((g1, b1), {m1,m2}), ((g2, b1), {m1})}
3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1})}

→ (g2,m2, b1)
1 PrimesObjAttr = {((g1,m1), {b1}), ((g1,m2), {b1}), ((g2,m1), {b1}), ((g2,m2), {b1})}
2 PrimesObjCond = {((g1, b1), {m1,m2}), ((g2, b1), {m1,m2})}
3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2})}

→ (g3,m3, b1)
1 PrimesObjAttr =
{((g1,m1), {b1}), ((g1,m2), {b1}), ((g2,m1), {b1}), ((g2,m2), {b1}), ((g3,m3), {b1})}
2 PrimesObjCond = {((g1, b1), {m1,m2}), ((g2, b1), {m1,m2}), ((g3, b1), {m3})}
3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2}), ((m3, b1), {g3})}

→ (g1,m2, b2)
1 PrimesObjAttr = {((g1,m1), {b1}), ((g1,m2), {b1, b2}), ((g2,m1),
{b1}), ((g2,m2), {b1}), ((g3,m3), {b1})}
2 PrimesObjCond = {((g1, b1), {m1,m2}), ((g2, b1), {m1,m2}), ((g3, b1),
{m3}), ((g1, b2), {m2})}
3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2}), ((m3, b1),
{g3}), ((m2, b2), {g1})}

→ (g2,m1, b2)
1 PrimesObjAttr = {((g1,m1), {b1}), ((g1,m2), {b1, b2}), ((g2,m1), {b1, b2}),
((g2,m2), {b1}), ((g3,m3), {b1})}
2 PrimesObjCond = {((g1, b1), {m1,m2}), ((g2, b1), {m1,m2}), ((g3, b1), {m3}),
((g1, b2), {m2}), ((g2, b2), {m1})}
3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2}), ((m3, b1), {g3}),
((m2, b2), {g1}), ((m1, b2), {g2})}

→ (g2,m2, b2)
1 PrimesObjAttr = {((g1,m1), {b1}), ((g1,m2), {b1, b2}), ((g2,m1), {b1, b2}),
((g2,m2), {b1, b2}), ((g3,m3), {b1})}
2 PrimesObjCond = {((g1, b1), {m1,m2}), ((g2, b1), {m1,m2}), ((g3, b1), {m3}),
((g1, b2), {m2}), ((g2, b2), {m1,m2})}
((m2, b2), {g1, g2}), ((m1, b2), {g2})}

→ (g3,m3, b2)
1 PrimesObjAttr = {((g1,m1), {b1}), ((g1,m2), {b1, b2}), ((g2,m1), {b1, b2}), ((g2,m2),
{b1, b2}), ((g3,m3), {b1, b2})}
2 PrimesObjCond = {((g1, b1), {m1,m2}), ((g2, b1), {m1,m2}), ((g3, b1), {m3}), ((g1, b2),
{m2}), ((g2, b2), {m1,m2}), ((g3, b2), {m3})}
((m2, b2), {g1, g2}), ((m1, b2), {g2}), ((m3, b2), {g3})}

Postprocessing:
1 T(g1,m1,b1) = (g1, g2,m1,m2, b1) ← add
2 T(g1,m2,b1) = (g1, g2,m1,m2, b1, b2) ← add
3 T(g2,m1,b1) = (g1, g2,m1,m2, b1, b2) ← the same as T(g1,m2,b1), skip
5 T(g3,m3,b1) = (g3,m3, b1, b2) ← add
6 T(g1,m2,b2) = (g1, g2,m2, b1, b2) ← add
7 T(g2,m1,b2) = (g2,m1,m2, b1, b2) ← add
9 T(g3,m3,b2) = (g3,m3, b1, b2) ← the same as T(g3,m3,b1), skip

The final output set of triclusters:
1 T1 = ({g1, g2}, {m1,m2}, {b1})
2 T2 = ({g1, g2}, {m1,m2}, {b1, b2})
3 T3 = ({g3}, {m3}, {b1, b2})
4 T4 = ({g1, g2}, {m2}, {b1, b2})
5 T5 = ({g2}, {m1,m2}, {b1, b2})

Outline
1 Motivation
Basic algorithm
3 Experiments
Datasets
Results
4 Conclusion

Experiments
Goals:
Show that the online algorithm of p-dimensional attribute clustering
outperforms the basic algorithm
Confirm the complexity estimations
For each dataset for each version of the algorithm 11 experiments were conducted:
for each there were different density threshold (from 0 to 1 with 0.1 intervals). To
evaluate the time more precisely, for each context there were 5 runs of the
algorithms with the average result recorded.
Additional tests to check the performance on big datasets and confirm linearity of
the online algorithm.
All experiments were conducted on the computer with Intel Core i7-351U 2.40
GHz processor, 8 GB RAM, Windows 8 operating system.

Experiments
Datasets
5 pseudo-random uniform contexts 50 × 50 × 50. Probability of each
quadruple’s presence varied from 0.02 to 0.1 with 0.02 interval
10 pseudo-random uniform contexts with average density equal to 0.001.
Cardinalities of sets varied from 100 to 1000 with 100 interval
Top-250 list of IMDB (Internet Movie Database) (triples: (movie, tag,
genre))
Sample of 3000 triples of the first 100 000 triples of Bibsonomy.org dataset
(triples: (user, bookmark, tag))

Experiments
Datasets
Context |G| |M| |B| # triples Density
Synthetic1, 0.02 50 50 50 2530 0.02024
Synthetic1, 0.04 50 50 50 5001 0.04001
Synthetic1, 0.06 50 50 50 7454 0.05963
Synthetic1, 0.08 50 50 50 10046 0.08037
Synthetic1, 0.1 50 50 50 12462 0.09970
Synthetic2, 100 100 100 100 996 0.001
Synthetic2, 200 200 200 200 7995 0.001
Synthetic2, 300 300 300 300 27161 0.001
Synthetic2, 400 400 400 400 63921 0.001
Synthetic2, 500 500 500 500 125104 0.001
Synthetic2, 600 600 600 600 216021 0.001
Synthetic2, 700 700 700 700 343157 0.001
Synthetic2, 800 800 800 800 512097 0.001
Synthetic2, 900 900 900 900 729395 0.001
Synthetic2, 1000 1000 1000 1000 1000589 0.001
IMDB 250 795 22 3818 0.00087
BibSonomy 51 924 2844 3000 0.000022

Experiments
Results

Experiments
Results
Densities for the contexts:

Outline
1 Motivation
Basic algorithm
3 Experiments
Datasets
Results
4 Conclusion

Conclusion
Prime OAC-triclustering algorithm was described
One-pass linear online version of its basic algorithm was proposed
Efficiency of the online algorithm and complexities of both algorithms were
confirmed.

Thank you!
Questions?

A One-Pass Triclustering Approach: Is There any Room for Big Data?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to A One-Pass Triclustering Approach: Is There any Room for Big Data?

Similar to A One-Pass Triclustering Approach: Is There any Room for Big Data? (20)

More from Dmitrii Ignatov

More from Dmitrii Ignatov (10)

Recently uploaded

Recently uploaded (20)

A One-Pass Triclustering Approach: Is There any Room for Big Data?