IMPROVING DATA MINING EFFICIENCY BY PREDICTIVE ITEMSETS
1.
IMPROVING DATA MINING EFFICIENCY BY PREDICTIVE ITEMSETS
Tzung-Pei Hong Chyan-Yuan Horng Shyue-Liang Wang
Dept. of Electrical Engineering Inst. of Info. Engineering Dept. of Info. Management
National Univ. of Kaohsiung I-Shou University National Univ. of Kaohsiung
tphong@nuk.edu.tw hcy@ms4.url.com.tw slwang@nuk.edu.tw
ABSTRACT each level a different support threshold, which is derived
from a data dependency parameter. Also, our proposed
In this paper, we propose a novel mining algorithm to approach determines whether an item should be included
improve the efficiency of finding large itemsets. The in a promising candidate itemset directly by supports of
proposed algorithm bases on Denwattana and Getta’ items, which are easily obtained in the phase of finding
prediction concept and considers the data dependency in large 1-itemsets.
the given transactions. It aims at efficiently finding any p
levels of large itemsets by scanning a database twice
except for the first level. A new reasonable estimation 2. REVIEW OF MINING ASSOCIATION RULES
method is proposed to predict promising and
non-promising candidate itemsets flexibly. The proposed One application of data mining is to induce association
approach estimates for each level a different support rules from transaction data, such that the presence of
threshold, which is derived from a data dependency certain items in a transaction will imply the presence of
parameter. Also, it determines whether an item should be certain other items. To achieve this purpose, Agrawal and
included in a promising candidate itemset directly by his co-workers proposed several mining algorithms based
supports of items, which are easily obtained in the phase of on the concept of large itemsets to find association rules in
finding large 1-itemsets. transaction data [1][3][5]. They divided the mining process
into two phases. In the first phase, candidate itemsets were
generated and counted by scanning the transaction data. If
1. INTRODUCTION the count of an itemset appearing in the transactions was
larger than a pre-defined threshold value (called the
Among the mining techniques proposed, finding minimum support), the itemset was considered a large
association rules from transaction databases is most itemset. Itemsets containing only one item were processed
commonly seen [1][3][5][8][9][10][11][13][14]. In the past, first. Large itemsets containing only single items were then
many algorithms for mining association rules from combined to form candidate itemsets containing two items.
transactions were proposed, most of which were executed This process was repeated until all large itemsets had been
in level-wise processes. Denwattana and Getta then found. In the second phase, association rules were induced
proposed an interesting mining algorithm to reduce the from the large itemsets found in the first phase. All
numbers of database scans in the process of finding large possible association combinations for each large itemset
itemsets. Their approach first found the itemsets of single were formed, and those with calculated confidence values
items by one database scan. After that, the approach could larger than a predefined threshold (called the minimum
find each p more levels of large itemsets by two database confidence) were output as association rules.
scans, where p could be arbitrarily assigned. Denwattana
and Getta’ approach used a heuristic estimation method to
predict the possibly large itemsets. If the prediction was 3. REVIEW OF DENWATTANA AND GETTA’S
valid, then the approach was efficient in finding the APPROACH
actually large itemsets. The estimation method was thus Denwattana and Getta proposed an approach to find each p
very critical to the efficiency of the approach. levels of large itemsets by two database scans [7]. Their
approach partitioned candidate itemsets into two parts:
In this paper, we propose a novel mining algorithm to
improve the efficiency of finding large itemsets. Our positive candidate itemsets (C + ) and negative candidate
approach bases on Denwattana and Getta’ of prediction itemsets ( C − ). Positive candidate itemsets were guessed to
concept and considers the data dependency in the given be large and negative candidate itemsets were guessed to
transactions. As Denwattana and Getta’ approach did, our be small. Two parameters, called m-element transaction
proposed mining algorithm aims at efficiently finding any threshold tt and frequency threshold tf, were used to judge
p levels of large itemsets by scanning a database twice whether an item could compose a positive candidate
except for the first level. A new reasonable estimation itemset. For each integer j, j ≤ tt,, the frequency of each
method is proposed to predict promising and item appearing in transactions with j items was found. If at
non-promising candidate itemsets flexibly. Different from least one frequency of an item was larger than or equal to
the approach in [7], our proposed approach estimates for the fixed frequency threshold tf, the item could be used to
2.
compose a positive candidate itemset. scan due to the incorrectly predicted promising
itemsets in a phase;
Their approach first found the large itemsets of single NC-: the itemsets to be checked in the second database
items by one database scan. After that, the approach could scan due to the incorrectly predicted non-promising
find each p more levels of large itemsets by two database itemsets in a phase.
scans, where p could be arbitrarily assigned. Denwattana
and Getta’ approach then formed positive candidate
+ 5. THEORETICAL FOUNDATION
2-itemsets C 2 , each of which has its two items satisfying
the above criteria. The remaining candidate 2-itemsets not
Usually, items must have greater support values for be ing
in C 2 formed C 2 . C 3 and C 3− were then formed
+ − +
+ covered in large itemsets with more items from a
from only C 2 in a similar way. The positive candidate probabilistic viewpoint. At one extreme, if total
+
2-itemsets which were subsets of the itemsets in C 3 were dependency relations exist in the transactions, an
+
then removed from C 2 . The same process was repeated appearance of one item will certainly imply the appearance
until p levels of positive and negative candidate itemsets of another. In this case, the support thresholds for an item
were formed. Databases were then scanned to check to appear in large itemsets with different items are the
whether the itemsets in the positive candidate itemsets same. At the other extreme, if the items are totally
were actually large and the itemsets in the negative independent, then the support thresholds for an item to
candidate itemsets were actually small. The itemsets appear in large itemsets with different items should be set
incorrectly guessed were then expanded and processed by at different values. In this case, the support threshold for an
a second database scan. item to appear in a large itemset with r items can be easily
derived below.
In this paper, we propose a novel mining algorithm based
on Denwattana and Getta’ prediction concept. A new Since all the r items in a large r-itemset must be 1-large
reasonable estimation method is proposed to predict itemsets, all the supports of the r items must be larger than
promising and non-promising candidate itemsets flexibly. or equal to the predefined minimum supporta. Since the
Different from the approach in [7], our proposed approach items are assumed totally independent, the support of the
estimates for each level a different support threshold, r-itemset is s1 × s2 × ...× sr , where si is the actual
which is derived from a data dependency parameter. Also,
support of the i-th item in the itemset. If this r-itemset is
our proposed approach determines whether an item should
large, its support must be larger than or equal to a. Thus:
be included in a promising candidate itemset directly by
supports of items, which are easily obtained in the phase of s1 × s 2 × ...× s r ≥ α .
finding large 1-itemsets.
If the predictive support threshold for an item to appear in
4. NOTATION a large r-itemset is α r , then:
The notation used in this paper is defined below. s1 × s2 × ...× sr ≥ α r × α r × ...× α r ≥ α .
n: the number of transactions; Thus:
m: the number of items; α rr ≥ α .
a: the minimum support value;
ß: the minimum confidence value; It implies:
Ai: the i-th item, 1 ≤ i ≤ m; αr ≥ α1 / r .
counti: the number of occurrences of Ai in the set of
transactions; Therefore, if the items are totally independent, the support
supporti: the support of Ai, calculated as counti / n;
threshold of an item should be expected to be α 1 / r for
p: the number of levels to be processed in a phase;
being included in a large r-itemset.
r: the number of items in itemsets currently being
processed; Since the transactions are seldom totally dependent or
r’: the number of items in itemsets at an end of a phase; totally independent, a data dependency parameter w,
w: the dependency parameter used to predict the possible ranging between 0 and 1, is then used to calculate the
support threshold (called predicting minimum support) predictive support threshold of an item for appearing in a
for an item to appear in an r-itemset; large r-itemset as:
Pr: the set of items predicted to appear in r -itemsets; 1
Lr: the set of large itemsets with r items; wα + ( 1 − w )α r .
Cr: the set of candidate itemsets with r items;
C r+ : the set of promising candidate itemsets in C with r A larger w value represents a stronger item relationship
each item existing in Pr; existing in transactions. w = 1 means the total dependency
C r− : the set of non-promising candidate itemsets in Cr with for transaction items and w = 0 means the total
at least one item not existing in Pr; independency. T proposed approach thus uses different
he
NC+: the itemsets to be checked in the second database predictive support thresholds for each item to be included
3.
in promising itemsets with different numbers of items. choosing from Cr the itemsets with each item
existing in Pr.
STEP 10: Set the non-promising candidate itemsets:
6. THE PROPOSED MINING ALGORITHM
C r− = C r − C r .
+
The proposed mining algorithm aims at efficiently finding
any p levels of large itemsets by scanning a database twice STEP 11: Set r=r+1.
+
except for the first level. The support of each item from the STEP 12: Generate the candidate set Cr from C r −1 in a
first database scan is directly judged to predict whether an way similar to that in the Apriori algorithm [3].
item will appear in an itemset. The proposed method uses a +
That is, the algorithm first joins C r −1 and
higher predicting minimum support for each item to be +
C r −1 assuming that r-2 items in two itemsets
included in a promising itemset of more items. Itemsets
with different numbers of items then have different are the same and the other one is different. It
predicting minimum supports for an item. A predicting then keeps in Cr the itemsets, which have all
+
minimum support is calculated as a weighted average of their sub-itemsets of r-1 items existing in C r −1 .
the possible minimum supports for totally dependent data STEP 13: Check whether the support of each item Ai in Pr-1
and for totally independent data. A data dependency is larger than or equal to the predicting minimum
parameter, ranging between 0 and 1, is used as the weight. support:
A mining process similar to that proposed in [7] can then 1
be adopted to find the p levels of large itemsets. The wα + ( 1 − w )α r .
details of the proposed mining algorithm are described
below. If the support of Ai is equal to or greater than the
predicting minimum support, put Ai in the set of
The proposed mining algorithm: predicted large items (Pr) for r-itemsets.
INPUT: A set of n transactions with m items, a minimum STEP 14: Set the non-promising candidate itemsets:
support value α, a minimum confidence value β, a
dependency parameter w, and a level number p. C r = C r − C r+ .
−
OUTPUT: A set of association rules. +
STEP 15: Remove the itemsets in C r −1 , which are subsets
STEP 1: Calculate the number (count i) of each item Ai +
appearing in the set of transactions; set the of any itemset in C r .
support (supporti) of each item Ai as count i / n. STEP 16: Repeat STEPS 11 to 15 until r = r’+ p.
STEP 2: Check whether the support of each item Ai is STEP 17: Scan the database to check whether the
larger than or equal to the predefined minimum promising candidate itemsets, Cr+' +1 to C r+
support α. If the support of Ai is equal to or are actually large and whether the non-promising
greater than α, put Ai in the set of large −
candidate itemsets, Cr−' +1 to C r are actually
1-itemsets L1. not large. Put the actually large itemsets in the
STEP 3: Set r=1, where r is used to represent the number corresponding sets Lr' +1 to L r .
of items in itemsets currently being processed. STEP 18: Find all the proper subsets with r’+1 to i items
STEP 4: Set r’=1, where r’ is used to record the number of for each itemset which is not large in C i+ , r’+1
items at an end of a phase. ≤ i ≤ r; keep the proper subsets which are not
STEP 5: Set P1 = L1, where Pr is used to predict the items among existing large itemsets; Donate them as
to be included in r-itemsets.
STEP 6: Generate the candidate set Cr+1 from Lr in a way NC + .
STEP 19: Find all the proper supersets with i to r items for
similar to that in the apriori algorithm [3]. That is,
each itemset which is large in C i− , r’+1 ≤ i ≤ r;
the algorithm first joins Lr and Lr assuming that
r-1 items in the itemsets are the same and the the supersets must also have all their
other one is different. It then keeps in Cr+1 the sub-itemsets of r’ items existing in Lr' and
itemsets, w hich have all their sub-itemsets of r cannot include any sub-itemset in the non-large
items existing in Lr. itemsets in Ci+ and C i− checked in Step 17;
STEP 7: Set r=r+1. Donate them as NC − .
STEP 8: Check whether the support of each item Ai in Pr-1 STEP 20: Scan the database to check whether the itemsets
is larger than or equal to the predicting minimum in NC + and NC − are large; add the large
support: itemsets to the corresponding sets L r' +1 to
1 Lr .
wα + ( 1 − w )α r , STEP 21: If Lr is not null, set r’= r’+ p and go to STEP
to be included in predicted r-itemsets. If the 5 for another phase; otherwise do the next step.
support of Ai is equal to or greater than the STEP 22: Add the non-redundant subsets of large itemsets
predicting minimum support, put Ai in the set of to the corresponding sets L2 to L r .
predicted large items (Pr) for r-itemsets. STEP 23: Drive association rules with confidence values
+ larger than or equal to β from the large itemsets
STEP 9: Form the promising candidate itemsets C r by
L 2 to Lr .
4.
+
AB is in C 2 since both A and B are in P2. AD is,
+
After Step 22, all the large itemsets for the transactions however, not in C 2 since D is not in P2 although A is in
+
have been determined. The final association rules can then P2. C 2 is thus formed as follows:
be derived in Step 23 in the same way as the other mining +
approaches uses. C 2 = {AB, AC, AE, AF, AH, BC, BE, BF, BH, CE, CF,
CH, EF, EH, FH}.
7. AN EXAMPLE −
The non-promising candidate itemsets C 2 is found as:
+
In this section, an example is given to illustrate the C2− = C 2 - C 2 = {AD, BD, CD, DE, DF, DH}.
proposed data-mining algorithm. Assume a database
including 10 transactions is shown in Table 1. Similarly, the predicting minimum support values for each
item to be in promising 3-itemsets is calculated as 0.527.
Table 1. A database used as an example C 3+ and C 3− are thus formed as follows:
ID Items
1 ABCDEFH C 3+ = {ABC, ABE, ABF, ACE, ACF, AEF, BCE, BCF,
2 ABFGH
BEF, CEF}.
3 ABCEF − +
4 ABEH C 3 = C3 - C 3 ={ABH, ACH, AEH, AFH, BEH,
5 ABCDEF BFH, CEH, CFH}.
6 ABCE
Since all itemsets except AH, BH, CH and EH in C 2+ are
7 BCDEFH +
subsets of itemsets in C 3+ , they are removed from C 2 .
8 ACEF
Thus:
9 ADEGH
10 ABCEF C 2+ ={AH, BH, CH, EH}.
Similarly,
Each transaction is composed of a transaction identifier
and items purchased. There are eight items, respectively C 4+ = {ABCE, ABCF, ABEF, ACEF, BCEF}.
being A, B, C, D, E, F, G and H, to be purchased. The
proposed algorithm processes this set of transactions as C4− = C 4 - C 4+ = φ .
follows. The count and support of each item are found by
scanning the database. Results are shown in Ta ble 2. The itemsets in C3+ which are subsets of itemsets in C 4+
are removed from C 3+ . Thus:
Table 2. The count and support of each item
Item Count Support C 3+ = φ .
A 9 0.9
The database is then scanned to check whether the
B 8 0.8
promising candidate itemsets of C 2+ to C 4 are actually
+
C 7 0.7
large and whether the non-promising candidate itemsets of
D 4 0.4 − −
C 2 to C 4 are actually not large. The set of large
E 9 0.9 −
2-itemsets in C 2+ is {AH, BH, EH} and in C 2 is {DE}.
F 7 0.7
AH, BH, EH and DE are then put in L2. The itemset CH in
G 2 0.2 −
C 2+ and the itemset DE in C 2 are incorrectly predicted.
Similarly, L3 can be found as φ and L4 can be found as
H 5 0.5
{ABCE, ABCF, ABEF, ACEF, BCEF}. All the itemsets in
The support of each item is compared with the predefined + − + −
C 3 , C 3 , C 4 and C 4 are predicted correctly.
minimum support value α. Assume α is set at 0.35 in this
example. Since the supports of {A}, {B}, {C}, {D}, {E}, Next, the proper subsets of the itemsets incorrectly
{F} and {H} are larger than or equal to 0.35, they are put in predicted in C i+ are generated. Since only {CH} is
L1. Also, P1 is the same as L1, which is {A, B, C, D, E, F, incorrectly predicted in this example, its proper subsets
H}. The candidate set C2 is then formed from L1. Assume in with 2 to 4 items and not in existing large itemsets are φ .
Thus NC + = φ .
this example, the dependency parameter w is set at 0.5. The
predicting minimum support value for each item to be in
promising 2-itemsets is calculated as:
1
Also, The proper supersets of the itemsets incorrectly
predicted in Ci− are generated. Since only {DE} is
α' = 0 .5 × 0. 35 + ( 1 − 0.5 ) × 0.35 2 = 0.471 .
incorrectly predicted in this example, its proper supersets
The support of each item in P1 is then compared with with 3 items and not in existing large itemsets are {ADE},
0.471. Since the supports of {A}, {B}, {C}, {E}, {F} and {BDE}, {CDE}, {FDE} and {HDE}. Since all of the
{H} are larger than 0.471, P2 is {A, B, C, E, F, H}. above supersets contain at least one sub-itemset which is
−
The itemsets in C2 with each item existing in P2 are chosen not large in C 2 (from Step 17), they are not possibly large.
to form promising candidate itemsets C2+ . For example, The proper supersets of {DE} with 4 items and not in
5.
existing large item are also not possibly large. Thus NC − The proposed algorithm apriori algorithm
The
= φ.
Execution Time (sec.)
2000
The database is then scanned to find the large itemsets
from NC + and NC − . Since both NC + and NC − are 1500
empty in this example, no scan is needed. The large
1000
itemsets L2 to L4 are then found as follows:
500
L2 = {AH, BH, EH, DE},
L3 = φ , and 0
L4 = {ABCE, ABCF, ABEF, ACEF, BCEF}. 0 0.2 0.4 0.6 0.8 1
Minimum Support
Since L4 = {ABCE, ABCF, ABEF, ACEF, BCEF }, which
is not null, the next phase is executed. STEPs 5 to 20 are Figure 1: A comparison of the proposed algorithm with w =
then repeated for L5 to L7. The results are shown as 0.5 and the apriori algorithm
follows:
9. CONCLUSIONS
L5 = {ABCEF},
L6 = φ , and In this paper, we have proposed a novel mining algorithm
L7 = φ . to improve the efficiency of finding large itemsets. The
proposed algorithm can efficiently find any p levels of
The non-redundant subsets of the found large itemsets are large itemsets by scanning a database twice except for the
added to the corresponding sets L2 to L5 . The final first level. The proposed approach estimates for each level
large itemsets L2 to L5 are then found as follows: a different support threshold, which is derived from a data
dependency parameter. It determines whether an item
L2 = {AB, AC, AE, AF, AH, BC, BE, BF, BH, CE, should be included in a promising candidate itemset
CF, DE, EF, EH}. directly by supports of items, which are easily obtained in
L3 = {ABC, ABE, ABF, ACE, ACF, AEF, BCE, the phase of finding large 1-itemsets. An example has also
BCF, BEF, CEF}. been given to illustrate the algorithm clearly. From the
L4 = {ABCE, ABCF, ABEF, ACEF, BCEF}. results of the proposed mining algorithm on the example,
L5 = {ABCEF}. different data dependency parameter values will cause the
same large itemsets, but different predictive effects. Thus
if the data dependency relationships in transactions can be
8. EXPERIMENTAL RESULTS well utilized, the proposed algorithm can help raise the
performance of data mining. In the future, we will continue
The experiments were implemented in VB on a extending the proposed approach to finding sequential
Pentium-IV 2.0 G personal computer. There were 8 items patterns.
to be purchased. 10000 transactions were run by the
proposed algorithm and by the apriori algorithm. These
ACKNOWLEDGEMENT
transactions were randomly generated, with each item
having different appearing probability. An item could not
be generated twice in a transaction. This research was supported by the National Science
Council of the Republic of China under contract
Experiments were then made to compare the performance NSC91-2213-E-390-001.
of the proposed approach and the apriori approach for
showing the effect of predictive itemsets. The relationships REFERENCES
between execution time and minimum supports for the
proposed algorithm with w = 0.5 and for the apriori [1] R. Agrawal, T. Imielinksi and A. Swami, “Mining
algorithm are shown in Figure 1. association rules between sets of items in large
database,“ The ACM SIGMOD Conference, pp.
From Figure 1, it is easily seen that the proposed algorithm 207-216, Washington DC, USA, 1993.
has a better efficiency than the apriori algorithm when the [2] R. Agrawal, T. Imielinksi and A. Swami, “Database
minimum support value lies about below 0.7. This is mining: a performance perspective,” IEEE
because when the minimum support values are quite large, Transactions on Knowledge and Data Engineering,
the numbers of large itemsets will become very small. The Vol. 5, No. 6, pp. 914-925, 1993.
time saved due to the pruning of candidate itemsets in the [3] R. Agrawal and R. Srikant, “Fast algorithm for mining
proposed algorithm will not cover the additional overhead. association rules,” The International Conference on
The proposed algorithm is thus suitable for low or middle Very Large Data Bases, pp. 487-499, 1994.
minimum support values. [4] R. Agrawal and R. Srikant, ”Mining sequential
6.
patterns,” The Eleventh IEEE International Conference [10] H. Mannila, H. Toivonen, and A.I. Verkamo,
on Data Engineering, pp. 3-14, 1995. “Efficient algorithm for discovering association
[5] R. Agrawal, R. Srikant and Q. Vu, “Mining association rules,” The AAAI Workshop on Knowledge Discovery
rules with item constraints,” The Third International in Databases, pp. 181-192, 1994.
Conference on Knowledge Discovery in Databases and [11] J.S. Park, M.S. Chen, P.S. Yu, “Using a hash-based
Data Mining, pp. 67-73, Newport Beach, California, method with transaction trimming for mining
1997. association rules,” IEEE Transactions on Knowledge
[6] M.S. Chen, J. Han and P.S. Yu, “Data mining: an and Data Engineering, Vol. 9, No. 5, pp. 812-825,
overview from a database perspective,” IEEE 1997.
Transactions on Knowledge and Data Engineering, [12] L. Shen, H. Shen and L. Cheng, “New algorithms for
Vol. 8, No. 6, pp. 866-883, 1996. efficient mining of association rules,” The Seventh
[7] N. Denwattana and J. R. Getta, “A parameterised Symposium on the Frontiers of Massively Parallel
algorithm for mining association rules,” The Twelfth Computation, pp. 234-241, 1999.
Australasian Database Conference, pp. 45-51, 2001. [13] R. Srikant and R. Agrawal, “Mining generalized
[8] T. Fukuda, Y. Morimoto, S. Morishita and T. association rules,” The Twenty-first International
Tokuyama, "Mining optimized association rules for Conference on Very Large Data Bases, pp. 407-419,
numeric attributes," The ACM Zurich, Switzerland, 1995.
SIGACT-SIGMOD-SIGART Symposium on Principles [14] R. Srikant and R. Agrawal, “Mining quantitative
of Database Systems, pp. 182-191, 1996. association rules in large relational tables,” The 1996
[9] J. Han and Y. Fu, “Discovery of multiple-level ACM SIGMOD International Conference on
association rules from large database,” The Management of Data, pp. 1-12, Montreal, Canada,
Twenty-first International Conference on Very Large 1996.
Data Bases, pp. 420-431, Zurich, Switzerland, 1995.
Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.
Be the first to comment