Statistical Characteristics of Modified Stochastic Algorithm

Statistical Characteristics of
Modified Stochastic Algorithm

Vilnius University
Institute of Mathematics and Informatics

Loreta Savulioniene

Structure
•
•
•
•

Data mining
Steps of the Apriori algorithm
Association rules
Modified stochastic algorithm for mining frequent
subsequences
• Computer Modeling

2

Introduction (1)

Discovering new knowledge consists of some steps:
• Data selection;
• Data preparation for analysis;
• Application of algorithms to discover knowledge;
• Presentation of new knowledge.

3

Introduction (2)
• Data mining is research and analysis of large amounts of
data using automated or semi-automated methods in order
to find important relation between data, discover models
and association rules.
• Data mining is defined as the method of acquisition,
tracking and discovering of new meanings in data.

4

Introduction (3)
All algorithms used for frequent sequence mining could be
classified in two groups:
• Exact algorithms;
• Approximate algorithms.

5

Apriori algorithm
• Frequent one element itemsets are found in the first step of
the Apriori algorithm step.
• Other steps of the algorithm consist of two parts:
• generating potentially frequent itemsets;
• determining the frequent candidate itemsets.

6

Association rules (1)
Let I={i1; i2, …, in} be a set of items.
Let D be a database of transactions, where each transaction T
consists of a set of items such that T⊆ I.
Given itemset X⊆ I, transaction T contains X if and only if X
⊆ T.
Definition 1. An association rule is an implication of the
form X⇒Y, where X⊆ I, Y⊆ I and X∩Y=∅ .
Definition 2. The association rule X ⇒ Y holds in D with
confidence conf if the probability of a transaction in D which
contains X also contains Y is conf.
7

Definition 3. The association rule X⇒Y has support supp in
D if the probability of a transaction in D contains both X and
Y is supp.
Definition 4. Confidence conf of the association rule X⇒Y is
called a value:
(1)

8


Discovering of association rules consists of two steps:
1. Discovering of frequent itemsets.
2. Creation of an association rule according to identified
frequent itemsets.

9

Modified stochastic algorithm for mining
frequent subsequences (1)
• Let us analyse an M-length database D.
• Namely, randomly selected random length l subsets,
containing at least one frequent element, determined by
the Apriori algorithm, are analysed.
• Assume that the analysed subset length is distributed
according to the geometrical distribution with the
parameter q, and the spacing between the two subset
lengths is also distributed according to the geometrical
distribution with the parameter p.

10

Modified stochastic algorithm for mining
frequent subsequences (2)
The average analysed subset length is:
l=q/(1-q) (2),
and the average length of the gap between adjacent subsets is
equal to:
t=p/(1-p)
(3).
Let us randomly choose N (number of samples) subsets of
various lengths for analysing database D. Subset frequencies
ci of the appropriate length are calculated using the following
formula (4):
ci=Ni /N, where i=1, 2, …, n,
(4)
11

Statistical Characteristics of Modified
Stochastic Algorithm (1)
We have two independent subset samples with their sizes being
n1 and n2. In the first sample there occur k1 and in the second k2 elements with necessary attribute value.
The hypothesis:
H0: p1 =p2
H1: p1≠ p2.

(5)
(6)

12

Criterion Statistics u
Criterion statistics u is estimated according to this formula (7):
u=

d1 − d 2
 k1 + k2   k1 + k2   1 1 

 n + n  ⋅ 1 − n + n  ⋅  n + n 
 
 

1
2   1
2 
 1 2 

(7).

If d is labeled d = (k1 + k2)/(n1+ n2), the formula is as follows (8):
u=

d1 − d 2
1 1
d ⋅ (1 − d ) ⋅  + 
n n 
2 
 1

(8).

13

Criterion Statistics z
Criterion statistics z is estimated according to this formula (9):

(

)

z = 2 arcsin d1 − 2 arcsin d 2 ⋅

n1 ⋅ n2
n1 + n2

(9).

14

Assumption Evaluation
After criterion statistics is estimated, assumption of probability
evaluation is performed. When alternative is double (H1: p1≠ p2),
the obtained value u, corresponding value P, is calculated as
follows (10):
P=2-(l-NORMSDIST(ABS(u))).

(10)

15

Computer Modeling(1)
Transaction number
...
1001
1001
1001
...
1002
1002
...

Item title
...
I
J
T
...
A
C
...

Quantity
...
1
1
1
...
2
2
...

16

ABCDEFGHIJKLMPRSTUV
ACEGIKM
ABTUV
..............................
ABCDEF
CDEFGHIJKLMPRST
............

17

This file is processed by the modified stochastic algorithm,
when 50 ≤ min_supp ≤ 600.
The average processing time of the algorithm is 2 min. 20 s.

18

Conclusion
• The modified stochastic algorithm is based on the analysis
of randomly chosen subsets, that include at least one
frequent element, determined by the Apriori algorithm.
• This algorithm is applied in solving the problem of the
market basket.
• The most frequent market basket consists of 6 items.

22

Statistical Characteristics of Modified Stochastic Algorithm

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (13)

Similar to Statistical Characteristics of Modified Stochastic Algorithm

Similar to Statistical Characteristics of Modified Stochastic Algorithm (20)

More from Lietuvos kompiuterininkų sąjunga

More from Lietuvos kompiuterininkų sąjunga (20)

Recently uploaded

Recently uploaded (20)

Statistical Characteristics of Modified Stochastic Algorithm