SlideShare a Scribd company logo
1 of 24
Download to read offline
An efficient algorithm for mining frequent
inter-transaction patterns
Anthony J.T. Lee *, Chun-Sheng Wang
Department of Information Management, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei 10617, Taiwan, ROC
Received 3 May 2006; received in revised form 1 March 2007; accepted 10 March 2007
Abstract
In this paper, we propose an efficient method for mining all frequent inter-transaction patterns. The method consists of
two phases. First, we devise two data structures: a dat-list, which stores the item information used to find frequent inter-
transaction patterns; and an ITP-tree, which stores the discovered frequent inter-transaction patterns. In the second phase,
we apply an algorithm, called ITP-Miner (Inter-Transaction Patterns Miner), to mine all frequent inter-transaction pat-
terns. By using the ITP-tree, the algorithm requires only one database scan and can localize joining, pruning, and support
counting to a small number of dat-lists. The experiment results show that the ITP-Miner algorithm outperforms the FITI
(First Intra Then Inter) algorithm by one order of magnitude.
Ó 2007 Elsevier Inc. All rights reserved.
Keywords: Association rules; Data mining; Inter-transaction patterns
1. Introduction
Association rule mining is an important problem in the data mining field [1,2,4,5,8,10–
13,15,16,20,22,25,27]. Traditional association analysis is intra-transactional because it focuses on association
relationships among itemsets within a transaction. For example, an intra-transaction association rule might
be: ‘‘If the stock prices of Microsoft and IBM go up, the price of Intel is likely to go up on the same day.’’
However, intra-transactional approaches cannot capture a rule like: ‘‘If the stock prices of Microsoft and
IBM go up, the price of Intel is likely to go up two days later.’’ Inter-transaction association rule mining
[6,7,14,17,18,23,24] extends the association rules to describe association relationships among itemsets across
several transactions.
Many algorithms for mining inter-transaction association rules have been proposed. Lu et al. [17] and Feng
et al. [6] applied inter-transaction association rule mining algorithms to the prediction of trends in meteoro-
logical and stock market data. Lu et al. [17,18] proposed the EH-Apriori algorithm, which uses the Apriori
0020-0255/$ - see front matter Ó 2007 Elsevier Inc. All rights reserved.
doi:10.1016/j.ins.2007.03.007
*
Corresponding author.
E-mail addresses: jtlee@ntu.edu.tw, d91725001@ntu.edu.tw (A.J.T. Lee).
Information Sciences 177 (2007) 3453–3476
www.elsevier.com/locate/ins
algorithm to discover frequent inter-transaction itemsets. To enhance their algorithm’s efficiency, the authors
used a hashing technique to reduce the number of candidate itemsets of length two. Feng et al. [7] used tem-
plates to reduce the number of rules. More recently, Tung et al. [23,24] developed an algorithm, called FITI
(First Intra Then Inter), which discovers frequent intra-transaction itemsets and uses them to generate fre-
quent inter-transaction itemsets.
All the algorithms for mining inter-transaction association rules developed thus far have been based on
Apriori-like breadth-first search (BFS) approaches that search for frequent itemsets level by level. At each
level, a database must be scanned once to determine the support for each candidate itemset. It has been shown
that Apriori-like approaches [19,22] perform well in finding frequent intra-transaction itemsets when the item-
sets are short. However, when mining long frequent itemsets, or using very small support thresholds, the per-
formance of such algorithms often deteriorates dramatically. The reason is that a frequent itemset of length k
implies the presence of 2k
À 2 additional frequent sub itemsets, each of which must be examined. Moreover,
since Apriori-like approaches may generate a large number of candidate patterns at each level, they are prone
to memory shortage during the mining process. We observe that Apriori-like methods for finding frequent
inter-transaction itemsets have the same drawbacks as those for finding frequent intra-transaction itemsets.
Therefore, in this paper, we propose an efficient method for mining a complete set of frequent inter-trans-
action itemsets (patterns). The method consists of two phases. First, we find all frequent items. For each fre-
quent item found, we construct a dat-list that records the item information required for finding the frequent
inter-transaction patterns. Then, we devise a data structure, called an ITP-tree, to store the discovered fre-
quent inter-transaction patterns. In the second phase, we propose an algorithm, called ITP-Miner, to effi-
ciently find all frequent inter-transaction patterns in a depth-first search manner. By using the ITP-tree and
dat-lists to mine the frequent inter-transaction patterns, the ITP-Miner algorithm requires only one database
scan and can localize joining, pruning, and support counting to a small number of dat-lists. Therefore, it is
more efficient than the FITI algorithm.
The remainder of this paper is organized as follows. In Section 2, we describe the problem of mining fre-
quent inter-transaction patterns. Section 3 introduces our proposed algorithm, ITP-Miner, for mining fre-
quent inter-transaction patterns. We explain the basic concept of the algorithm and give an example to
illustrate how it works. Section 4 describes the performance evaluation. Finally, we present the conclusions
in Section 5.
2. Problem description
In this section, we introduce the notations and describe the problem of mining frequent inter-transaction
patterns.
Definition 1. Let I be a set of data items, and N be a set of non-negative integers called domain attributes. A
transaction database consists of a set of transactions, where a transaction is in the form of ht; si; t 2 N and
s  I; t is called a dimensional attribute (or dat), and s is called an itemset.
The dimensional attribute describes the properties associated with the data items, such as time and location.
As it is an ordinal, it can be divided into intervals of equal length. For example, time can be divided into days,
weeks, etc. These intervals can be represented by non-negative integers 0, 1, 2, and so on. An itemset is denoted
by fu1; u2; . . . ; ukg, where ui is an item, 1 6 i 6 k; and items in an itemset are listed in alphabetical order, i.e., we
write fa; c; dg instead of fc; a; dg. Table 1 shows a transaction database containing six transactions.
Definition 2. An itemset s ¼ fu1; u2; . . . ; ukg at the dimensional attribute t is called an extended itemset and
denoted by Dts ¼ fu1ðtÞ; u2ðtÞ; . . . ; ukðtÞg.
Before mining inter-transaction association rules, we need to find the frequent inter-transaction itemsets
that span several transactions. Since an inter-transaction itemset can span many intervals, discovering all such
itemsets would require a lot of resources, but a user may only be interested in rules that span a certain number
of intervals. Therefore, to avoid wasting resources by mining unwanted rules, we introduce a parameter called
maxspan. When mining for inter-transaction association rules, we only mine rules whose span is equal to or
less than the maxspan intervals.
3454 A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476
Definition 3. Let Dt0
a0; Dt1
a1; . . . ; Dtn
an be extended itemsets, where t0  t1  Á Á Á  tn, and tn À t0 is not greater
than a user-specified maximum span threshold maxspan. Then, let a ¼ D0a0 [ Dt1Àt0
a1 [ Á Á Á [ DtnÀt0
an, where [
is an operator that combines multiple extended itemsets into single unit, a is called an inter-transaction itemset
(or a pattern) and t0 is called the reference point. An item in the inter-transaction itemset is called an extended
item. Since a takes t0 as the reference point, we say that a starts from t0. Note that the reference point of an
inter-transaction itemset is defined as the smallest dimensional attribute of the extended itemsets in the inter-
transaction itemset.
For example, let us consider the database as shown in Table 1, where fað1Þ; bð1Þg is the extended itemset for
the first transaction. Let maxspan = 1. The first two transactions form an inter-transaction itemset,
fað0Þ; bð0Þ; að1Þ; cð1Þ; dð1Þg, where we take the dimensional attribute of the first transaction as the reference
point.
Definition 4. Let uðiÞ and vðjÞ be two extended items. If (i ¼ j and u ¼ v), we can say that uðiÞ ¼ vðjÞ. Also, if
(i ¼ j and u  v) or ði  jÞ, we say that uðiÞ  vðjÞ.
Definition 5. Let a ¼ fu0ð0Þ; u1ði1Þ; u2ði2Þ; . . . ; unðinÞg and b ¼ fv0ð0Þ; v1ðj1Þ; v2ðj2Þ; . . . ; vnðjnÞg be two patterns,
where n P 1. We say that a ¼ b if uiðiiÞ ¼ viðjiÞ for 0 6 i 6 n, where i0 ¼ j0 ¼ 0. We also say that a  b, if (1)
u0ð0Þ  v0(0); or (2) there exists k ðP 0Þ such that uiðiiÞ ¼ viðjiÞ for 0 6 i 6 k  n, and ukþ1ðikþ1Þ  vkþ1ðjkþ1Þ.
Definition 6. Let a ¼ fu0ð0Þ; u1ði1Þ; u2ði2Þ; . . . ; unðinÞg and b ¼ fv0ð0Þ; v1ðj1Þ; v2ðj2Þ; . . . ; vmðjmÞg be two patterns,
where 1 6 n 6 m. a is a subset of b if we find n extended items vk1ðjk1Þ; vk2ðjk2Þ; . . . ; vknðjknÞ in b such that
u0ð0Þ ¼ v0ð0Þ; u1ði1Þ ¼ vk1ðjk1Þ; u2ði2Þ ¼ vk2ðjk2Þ; . . . ; and unðinÞ ¼ vknðjknÞ. We can also say that b contains a.
For example, both fbð0Þ; að1Þ; cð3Þg and fað0Þ; dð1Þg are subsets of fað0Þ; bð0Þ; að1Þ; dð1Þ; bð3Þ; cð3Þg.
Definition 7. In a transaction database D, the number of transactions is denoted by jDj. Let a be a pattern, and
Ta be all inter-transaction itemsets in D, where every inter-transaction itemset in T a contains a. The count of a,
count(a), is denoted by jTaj and the support of a, sup(a), is denoted by countðaÞ=jDj. If sup(a) is not less than the
user-specified minimum support threshold minsup, a is called a frequent inter-transaction pattern (or frequent
pattern for short).
Definition 8. The number of extended items in a pattern is called the length of the pattern. A pattern of length
k is called a k-pattern.
For example, the length of fað0Þ; að1Þ; dð1Þ; bð3Þ; cð3Þg is equal to 5.
Definition 9. An inter-transaction association rule is written in the form of a ! b, where both a and a [ b are
frequent patterns, a  b ¼ /, conf ða ! bÞ is not less than the user-specified minimum confidence, and the con-
fidence of the rule is defined as supða [ bÞ=supðaÞ.
Suppose that minsup = 50% and maxspan = 1. Let us consider the database D shown in Table 1, where the
database contains six transactions (i.e., jDj ¼ 6). To compute the support of fað0Þ; bð0Þ; dð1Þg, we first form an
inter-transaction itemset fað0Þ; bð0Þ; að1Þ; cð1Þ; dð1Þg by using the first two transactions and find that it
contains fað0Þ; bð0Þ; dð1Þg. Thus, countðfað0Þ; bð0Þ; dð1ÞgÞ ¼ 1. Next, we form an inter-transaction itemset
Table 1
A transaction database
Dat Itemset
1 {a,b}
2 {a,c,d}
3 {a}
4 {a,b,c,d}
5 {a,b,d}
6 {a,d}
A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476 3455
fað0Þ; bð0Þ; cð0Þ; dð0Þ; að1Þ; bð1Þ; dð1Þg by using the fourth and fifth transactions, and find that it contains
fað0Þ; bð0Þ; dð1Þg. Thus, the count of fað0Þ; bð0Þ; dð1Þg is increased by one. Finally, we form an inter-
transaction itemset fað0Þ; bð0Þ; dð0Þ; að1Þ; dð1Þg by using the last two transactions and find that it contains
fað0Þ; bð0Þ; dð1Þg. Thus, the count of fað0Þ; bð0Þ; dð1Þg is increased by one. That is, countðfað0Þ;
bð0Þ; dð1ÞgÞ ¼ 3, and supðfað0Þ; bð0Þ; dð1ÞgÞ ¼ 3=6 ¼ 0:5. Consequently, fað0Þ; bð0Þ; dð1Þg is a frequent
pattern. Similarly, we find that supðfað0Þ; bð0ÞgÞ ¼ 0:5. That is, fað0Þ; bð0Þg is also a frequent pattern.
Therefore, we have an inter-transaction association rule fað0Þ; bð0Þg ! fdð1Þg with the confidence = 0.5/
0.5 = 100%.
The objective of mining frequent inter-transaction patterns is to find a complete set of frequent patterns in a
transaction database with respect to user-specified minsup and maxspan thresholds.
3. The ITP-Miner algorithm
We now introduce the ITP-Miner algorithm. In Section 3.1 we describe a data structure, called a dat-list,
and a search tree, called an ITP-tree. We then explain the main concept of our proposed algorithm in Section
3.2, and describe it in detail in Section 3.3. The memory management and complexity analysis of ITP-Miner
are discussed in Section 3.4. Finally, we give an example to demonstrate how the algorithm works in Section
3.5.
3.1. The dat-list and ITP-tree
We have devised a data structure, called the dimensional attribute list (dat-list), to store the dimensional
attributes of a frequent pattern during the mining process. Let aht1; t2; . . . ; tni be a dat-list, where a is a pattern
and ht1; t2; . . . ; tni is a list of dats in which the inter-transaction itemset starting from ti contains a; 1 6 i 6 n.
For example, fað0Þ; dð1Þgh1; 3; 4; 5i is a dat-list containing 4 dats, 1, 3, 4, and 5; i.e., the inter-transaction item-
sets starting from the first, third, fourth, and fifth transactions in the database shown in Table 1 contain the
pattern fað0Þ; dð1Þg. Table 2 shows all dat-lists of the frequent patterns found in the database shown in Table
1, where minsup = 50% and maxspan = 1. Note that the count of a pattern is equal to the number of dats
contained in its dat-list. For example, the support counts of fað0Þ; dð1Þgh1; 3; 4; 5i and fað0Þ; bð0Þ;
að1Þ; dð1Þgh1; 4; 5i are 4 and 3, respectively.
Next, we introduce a data structure called an ITP-tree, which stores the dat-lists generated during the min-
ing process.
Definition 10. An ITP-tree is a search tree denoted by T. A node in T represents a dat-list, aht1; t2; . . . ; tni,
where n P 1; a is a frequent pattern, and the extended items in a are listed in increasing order. The parent of
node aht1; t2; . . . ; tni is the dat-list b  t0
1; t0
2; . . . ; t0
m , where m P n, and b is the pattern obtained by removing
the last extended item from a. The root of T is a null pattern, i.e., {}. Siblings with the same parent are sorted
in increasing order.
Consider the complete ITP-tree shown in Fig. 1, where the database contains only two items, a and b; the
maxspan is equal to 1; and the list of dats associated with each dat-list is omitted. Note that all possible inter-
transaction patterns of the ITP-tree in Fig. 1 can be enumerated in a depth-first search manner.
Table 2
The dat-lists
Length Dat-list
1 fað0Þgh1; 2; 3; 4; 5; 6i, fbð0Þgh1; 4; 5i, fdð0Þgh2; 4; 5; 6i
2 fað0Þ; bð0Þgh1; 4; 5i, fað0Þ; dð0Þgh2; 4; 5; 6i, fað0Þ; að1Þgh1; 2; 3; 4; 5i, fað0Þ; dð1Þgh1; 3; 4; 5i, fbð0Þ; að1Þgh1; 4; 5i,
fbð0Þ; dð1Þgh1; 4; 5i, fdð0Þ; að1Þgh2; 4; 5i
3 fað0Þ; bð0Þ; að1Þgh1; 4; 5i, fað0Þ; bð0Þ; dð1Þgh1; 4; 5i, fað0Þ; dð0Þ; að1Þgh2; 4; 5i, fað0Þ; að1Þ; dð1Þgh1; 3; 4; 5i,
fbð0Þ; að1Þ; dð1Þgh1; 4; 5i
4 fað0Þ; bð0Þ; að1Þ; dð1Þgh1; 4; 5i
3456 A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476
3.2. Main concept
First, we consider the generation of a candidate 2-pattern by joining two frequent 1-patterns.
Definition 11. If both a and b are frequent 1-patterns, a is joinable to b.
Let t1 and t2 be the dats in the dat-lists of a ¼ fuð0Þg and b ¼ fvð0Þg, respectively. If (u  v and
0 6 t2 À t1 6 maxspan) or (u P v and 0  t2 À t1 6 maxspan), then: (1) we generate a candidate 2-pattern
fuð0Þ; vðt2 À t1Þg if it has not been generated already; (2) we add t1 to the candidate’s dat-list; and (3) we
increase the count of fuð0Þ; vðt2 À t1Þg by one. For each candidate 2-pattern generated, we check that its
support is not less than the minsup. If this is the case, it is a frequent 2-pattern and we store its dat-list in the
ITP-tree. Thus, for a frequent 2-pattern fuð0Þ; vðwÞg; 0 6 w 6 maxspan, we store its dat-list as a child of a’s
dat-list in the ITP-tree.
Fig. 2 illustrates how we find a frequent 2-pattern by joining the dat-lists of two frequent 1-patterns
fað0Þgh1; 2; 3; 4; 5; 6i and fbð0Þgh2; 4; 5; 6i, where minsup = 50% and maxspan = 1. Here, we find two frequent
2-patterns, fað0Þ; bð0Þg and fað0Þ; bð1Þg. In Fig. 2(a), the dats used to generate pattern fað0Þ; bð0Þg are
connected by black arrows, while the dats used to generate pattern fað0Þ; bð1Þg are connected by dashed
arrows. Fig. 2(b) shows the ITP-tree formed by the dat-lists of the patterns found. Note that there are six dats
in fað0Þgh1; 2; 3; 4; 5; 6i; however, there are only four in fað0Þ; bð0Þgh2; 4; 5; 6i and fað0Þ; bð1Þgh1; 3; 4; 5i.
Definition 12. Let a ¼ fu0ð0Þ; u1ði1Þ; . . . ; ukÀ1ðikÀ1Þg and b ¼ fv0ð0Þ; v1ðj1Þ; . . . ; vkÀ1ðjkÀ1Þg be two frequent
k-patterns, k  1. a is joinable to b if the first k À 1 extended items of a are equal to those of b, and the last
extended item of a is less than that of b. The joined pattern is obtained by appending the last extended item of
b to a; that is, the joined pattern is fu0ð0Þ; u1ði1Þ; . . . ; ukÀ1ðikÀ1Þ; vkÀ1ðjkÀ1Þg.
We now consider how to find frequent ðk þ 1Þ-patterns by joining two joinable k-patterns. Assume there
are two dat-lists of 2-patterns, fað0Þ; bð0Þgh1; 4; 5i and fað0Þ; að1Þgh1; 2; 3; 4; 5i. Note that the first extended
item of fað0Þ; bð0Þg is equal to that of fað0Þ; að1Þg. To join both 2-patterns, we match the dats in
fað0Þ; bð0Þgh1; 4; 5i with the dats in fað0Þ; að1Þgh1; 2; 3; 4; 5i. If a dat in the former is equal to a dat in the latter,
{} Length-0
{a(0)} {b(0)} Length-1
{a(0),b(0)} {a(0),a(1)} {a(0),b(1)} {b(0),a(1)} {b(0),b(1)} Length-2
{a(0),b(0),a(1)} {a(0),b(0),b(1)}{a(0),a(1),b(1)} {b(0),a(1),b(1)} Length-3
{a(0),b(0),a(1),b(1)} Length-4
Fig. 1. An ITP-tree.
{a(0)}1, 2, 3, 4, 5, 6
{b(0)} 2, 4, 5, 6
(a) Dat-lists (b) ITP-tree
{}
{a(0)}1,2,3,4,5,6
{a(0),b(0)}2,4,5,6 {a(0),b(1)}1,3,4,5
Fig. 2. Generating 2-patterns by joining two 1-patterns: (a) dat-lists and (b) ITP-tree.
A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476 3457
the count of the joined pattern ðfað0Þ; bð0Þ; að1ÞgÞ is increased by one. By matching every dat in
fað0Þ; bð0Þgh1; 4; 5i with its counterpart in fað0Þ; að1Þgh1; 2; 3; 4; 5i, we find that the count of
fað0Þ; bð0Þ; að1Þg is equal to 3. Thus, it is a frequent 3-pattern. We store fað0Þ; bð0Þ; að1Þgh1; 4; 5i in the
ITP-tree as shown in Fig. 3, where h1; 4; 5i are the matched dats between h1; 4; 5i and h1; 2; 3; 4; 5i.
The ITP-Miner algorithm generates the frequent patterns in a depth-first search manner. We use the fre-
quent patterns obtained at this level (frequent k-pattern) to generate the frequent patterns at the next level
(frequent ðk þ 1Þ-pattern). To do so, we first define the joinable and extended groups for a frequent k-pattern.
A frequent k-pattern, a, only needs to join the patterns in its joinable group to generate the frequent ðk þ 1Þ-
pattern extended from a. Generating the frequent ðk þ 1Þ-pattern in this manner can localize joining, pruning,
and support counting to a small number of dat-lists.
Definition 13. Let a be a frequent k-pattern, where k P 1. The joinable group of a is JðaÞ ¼ fa1; a2; . . . ; ang,
where a is joinable to ai, which is a frequent k-pattern, 1 6 i 6 n. Also, the extended group of a is
EðaÞ ¼ fb1; b2; . . . ; bmg, where every bj is a frequent ðk þ 1Þ-pattern extended from a; 1 6 j 6 m. Since bj is
extended from a, its dat-list will be a child of a’s dat-list in the ITP-tree.
Lemma 1. Let a be a frequent k-pattern; and b be a frequent ðk þ 1Þ-pattern that is extended from a, where
k P 1. Then, we have: (1) EðaÞ ¼ fhjh ¼ ða join cÞ, h is frequent, and c 2 JðaÞg; and (2) JðbÞ ¼ fhjh is greater
than b, and h 2 EðaÞg.
We can use Lemma 1 to generate frequent patterns and store their dat-lists in the ITP-tree, where the fre-
quent patterns are generated in a depth-first search manner. The generation of frequent patterns consists of
two steps. Let a be a frequent k-pattern. First, a is joined to each c in JðaÞ, after which we can find all the
frequent ðk þ 1Þ-pattern patterns extended from a, referred to as EðaÞ. Next, for each frequent ðk þ 1Þ-pattern
b in EðaÞ; JðbÞ is the set containing the patterns in EðaÞ that are greater than b. We can therefore join b to
every c in JðbÞ to find all the frequent (k + 2)-patterns extended from b, referred to as EðbÞ. Let a ¼ b and
k ¼ k þ 1. We perform the second step recursively in a depth-first search manner until no more frequent pat-
terns can be generated. This is the main concept of our proposed method.
Let us consider the example shown in Fig. 4, where a ¼ fað0Þ; bð0Þg and b ¼ fað0Þ; bð0Þ; að1Þg. From the
figure, we know that JðaÞ ¼ ffað0Þ; dð0Þg; fað0Þ; að1Þg; fað0Þ; dð1Þgg; EðaÞ ¼ ffað0Þ; bð0Þ; að1Þg; fað0Þ; bð0Þ;
dð1Þgg; JðbÞ ¼ ffað0Þ; bð0Þ; dð1Þgg; and EðbÞ ¼ ffað0Þ; bð0Þ; að1Þ; dð1Þgg. The figure shows that EðaÞ can be
obtained by joining a with every pattern in JðaÞ, while EðbÞ can be obtained by joining b with every pattern
in JðbÞ.
3.3. The ITP-Miner algorithm
The ITP-Miner algorithm is comprised of three functions: Join2, DFS, and JoinK. The algorithm and its
functions are shown in Figs. 5–8, respectively. The Join2 function is used to find frequent 2-patterns by joining
two frequent 1-patterns, while the JoinK function is used to find frequent ðk þ 1Þ-pattern by joining two fre-
quent k-patterns.
{a(0)}1,2,3,4,5,6
{a(0),b(0)}1, 4, 5 {a(0),b(0)}1,4,5 {a(0),a(1)}1,2,3,4,5
{a(0),a(1)}1,2,3,4,5 {a(0),b(0),a(1)}1,4,5
{}
Fig. 3. Generating a 3-pattern by joining two frequent 2-patterns.
3458 A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476
In Step 1 of Fig. 5, the database is scanned to find all frequent 1-patterns and their dat-lists, after which a
hash table, H2, is constructed and the dat-lists are inserted into the ITP-tree, T. In Step 4, two frequent 1-pat-
terns, a and c, are joined to generate frequent 2-patterns and obtain E(a). Note that the joinable group of a
frequent 1-pattern is equal to the set containing all frequent 1-patterns. In Step 5, for each frequent 2-pattern
b, the DFS function is applied recursively in a depth-first search manner to find all frequent inter-transaction
patterns.
When running the ITP-Miner algorithm, we use the following pruning strategies to reduce the search space
and speed up the mining process.
(1) Infrequent 2-pattern pruning. In Step 1 of the algorithm (Fig. 5), if we generate n 1-patterns in Step 1, we
will perform n2
join operations in Step 4. Since many 2-patterns are infrequent, it is wasteful to perform
so many operations. To resolve the problem, we use a hashing approach [24] to check if a pair of 1-pat-
terns can generate a frequent 2-pattern before joining them. In Step 1, therefore, we construct a hash
table, H2, when searching for 1-patterns and dat-lists. Every bucket of H2 aggregates the counts of
{a(0)}1,2,3,4,5,6 Length-1
{a(0),b(0)} {a(0),d(0)} {a(0),a(1)} {a(0),d(1)} Length-2
1,4,5 2,4,5,6 1,2,3,4,5 1,3,4,5
{a(0),b(0),d(0)} {a(0),b(0),a(1)} {a(0),b(0),d(1)} Length-3
4,5 1,4,5 1,4,5
{a(0),b(0),a(1)} {a(0),b(0),d(1)} Length-3
1,4,5 1,4,5
{a(0),b(0),a(1),d(1)} Length-4
1,4,5
β J(β)
E(β)
α J(α)
E(α)
First Step
Second Step
Fig. 4. Generating frequent patterns in an ITP-tree.
Algorithm: ITP-Miner
Input: transaction database D, minimum support minsup, maximum span maxspan.
Output: all frequent inter-transaction patterns FP.
Method:
1 Scan D to find all frequent 1-patterns and their dat-lists. Construct a hash table, H2, and
insert the dat-lists into the ITP-tree T;
2 for each frequent 1-pattern α do
3 Insert α into FP;
4 for each frequent 1-pattern γ in J(α) do call Join2(α, γ, T, minsup, maxspan, |D|) to
get E(α);
5 for each frequent 2-pattern β∈E(α) do call DFS(β, FP, T, minsup, maxspan, |D|);
6 end for
7 Output FP.
Fig. 5. The ITP-Miner algorithm
A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476 3459
all length-2 candidates hashed to it. Given a candidate 2-pattern generated by items u0
and v0
with dats tu0
and tv0 , respectively, its hash value g is computed by g ¼ ððu0
à ðmaxspan þ 1Þ þ ðtv0 À tu0 ÞÞÃ
N1 þ v0
Þ mod Hashsize, where 0 6 tv0 À tu0 6 maxspan, N1 is the number of distinct items in the database,
and Hashsize is the size of the hash table. The count of H2½gŠ is then incremented by one. In Step 1 of
Fig. 6, before joining two 1-patterns, i.e., a ¼ fuð0Þg and c ¼ fvð0Þg, we compute all possible hash values
for both patterns by using h ¼ ððu à ðmaxspan þ 1Þ þ wÞ Ã N1 þ vÞ mod Hashsize, where 0 6 w 6
maxspan. If H2½hŠ is less than minsup à jDj for each possible hash value, h, we do not need to join a
and c to generate a 2-pattern.
(2) Infrequent k-pattern pruning. Before joining two k-patterns when k  1, we use the hash table H2 to filter
out the join operations. In Step 1 of Fig. 8, before joining two k-patterns, b ¼ fu0ð0Þ;
u1ði1Þ; . . . ; ukÀ1ðikÀ1Þg and c ¼ fv0ð0Þ; v1ðj1Þ; . . . ; vkÀ1ðjkÀ1Þg, we know that b joining c will yield
fu0ð0Þ; u1ði1Þ; . . . ; ukÀ1ðikÀ1Þ; vkÀ1ðjkÀ1Þg. Since fukÀ1ðikÀ1Þ; vkÀ1ðjkÀ1Þg is a subset of the joined pattern, it
can be transformed into a 2-pattern, fukÀ1ð0Þ; vkÀ1ðjkÀ1 À ikÀ1Þg, by taking ikÀ1 as the reference point.
Thus, we can compute its hash value using h ¼ ððukÀ1 Ã ðmaxspan þ 1Þ þ ðjkÀ1 À ikÀ1ÞÞÃ
Function: Join2(α, γ, T, minsup, maxspan, |D|)
Input: two frequent 1-patterns α ={u(0)} and γ ={v(0)}, the ITP-tree T, minimum support
minsup, maximum span maxspan, and the number of transactions in the database |D|.
Output: the ITP-tree T.
1 if there exists w, 0  w  maxspan, such that H2[((u * (maxspan+1) + w) * N1 + v) mod
Hashsize]  minsup * |D| then
2 for each dat tα∈ α’s dat-list do
3 for each dat tγ ∈γ’s dat-list do
4 if (u  v and 0  tγ-tα  maxspan) or (u  v and 0  tγ-tα  maxspan) then
5 Generate a candidate 2-pattern {u(0),v(tγ-tα)} if it has not been generated
already;
6 Add tα to its dat-list;
7 end if
8 end for
9 end for
10 for each candidate 2-pattern generated do
11 if it is frequent then
12 Insert its dat-list into T to be a child of α’s dat-list;
13 end if
14 end for
15 end if;
Fig. 6. The Join2 function
Function: DFS(β, FP, T, minsup, maxspan, |D|)
Input: a frequent pattern β, all frequent inter-transaction patterns FP, the ITP-tree T,
minimum support minsup, maximum span maxspan, and the number of transactions in
the database |D|.
Output: the ITP-tree T.
1 Insert β into FP;
2 for each pattern γ ∈J(β), call JoinK(β, γ, T, minsup, maxspan, |D|) to get E(β);
3 for each pattern θ∈E(β), call DFS(θ, FP, T, minsup, maxspan, |D|);
4 Remove β ’s dat-list from T;
Fig. 7. The DFS function.
3460 A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476
N1 þ vkÀ1Þ mod Hashsize. If H2½hŠ is less than minsup à jDj, we do not need to join b and c to generate a
ðk þ 1Þ-pattern.
(3) Unmatched dat pruning. In the ITP-Miner algorithm, we join two joinable k-patterns to generate fre-
quent ðk þ 1Þ-patterns and construct a dat-list for each of the latter. In Step 6 of Fig. 6 and Step 4 of
Fig. 8, when we construct the dat-list for a pattern, we only add the matched dats of both k-patterns
to the dat-list of the newly generated ðk þ 1Þ-pattern.
3.4. Memory management and complexity analysis of ITP-Miner
In Step 1 of Fig. 5, ITP-Miner only scans the database once to find all frequent 1-patterns and store their
dat-lists in the memory, where the space for these dat-lists is smaller than that for the original database. From
Definition 12, when k  1 we observe that a k-pattern, b, is only joinable to siblings larger than b in an ITP-
Tree. Since the siblings are sorted in increasing order and ITP-Miner processes the branches of an ITP-Tree in
a depth-first search manner, in Step 4 of Fig. 7, when the algorithm leaves the DFS function, the memory
space of b’s dat-list can be released from T immediately. Therefore, ITP-Miner’s memory requirements are
not substantial. If the memory requirements exceed available memory, ITP-Miner can be modified to read
and write temporary dat-lists from and to a disk, similar to the methods used in [21].
Assume that the number of frequent 1-patterns, the number of all frequent patterns, and the length of the
longest frequent pattern be jL1j; jFPj, and mp respectively. Let us consider a pattern a of various lengths in an
ITP-Tree T. (1) If a is a 1-pattern, from Definition 11, we observe that the number of join operations for a is
bounded by jL1j à ðmaxspan þ 1Þ. (2) If a is a 2-pattern, also from Definition 11, we observe that the number of
a’s siblings, jJðaÞj, is bounded by jL1j à ðmaxspan þ 1Þ, which is the maximum number of join operations for a
by Definition 12. (3) If a is a k-pattern, where k  2, from Lemma 1, we observe that the number of a’s siblings
is bounded by that of a’s parent in T, so the number of join operations for a is also bounded by
jL1j à ðmaxspan þ 1Þ. Therefore, by the above analyses, we conclude that for any pattern a in an ITP-Tree
T, the maximum number of join operations for a is bounded by jL1j à ðmaxspan þ 1Þ. Since the number of
nodes in T is equal to jFPj and the length of a dat-lists is bounded by OðjDjÞ, the time complexity of ITP-
Miner is bounded by OðjFPj à jL1j à ðmaxspan þ 1Þ Ã jDjÞ. Likewise, we can observe that for any pattern a in
an ITP-Tree T, the maximum number of JðaÞ that must be kept in the memory for join operations is bounded
by jL1j à ðmaxspan þ 1Þ. Since ITP-Miner processes branches in a depth-first search manner, the maximum
number of nodes kept in the memory for the recursive function calls is bounded by OðmpÞ. Therefore, the
space complexity of ITP-Miner is bounded by Oðmp à jL1j à ðmaxspan þ 1Þ Ã jDj þ HashsizeÞ, where Hashsize
is the memory space required by H2.
Function: JoinK(β, γ, T, minsup, maxspan, |D|)
Input: two frequent k-patterns β ={u0(0),u1(i1),...,uk-1(ik-1)} and γ ={v0(0),v1(j1),...,vk-1(jk-1)},
the ITP-tree T, minimum support minsup, maximum span maxspan, and the number of
transactions in the database |D|.
Output: the ITP-tree T.
1 if H2[((uk-1 * (maxspan+1) + (jk-1 - ik-1)) * N1 + vk-1) mod Hashsize]  minsup * |D|, then
2 Let θ = {u0(0),u1(i1),...,uk-1(ik-1),vk-1(jk-1)};
3 for each dat x∈β’s dat-list do
4 Check if any dat y∈γ’s dat-list such that y = x. If yes, increase θ’s count by one
and add x to θ’s dat-list;
5 end for
6 if θ is frequent then
7 Insert its dat-list into T to be a child of β’s dat-list;
8 end if
9 end if;
Fig. 8. The JoinK function.
A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476 3461
3.5. An example
Let us consider the example as shown in Table 1, where minsup ¼ 50% and maxspan = 1. The frequent
inter-transaction patterns and their dat-lists generated during the mining processes are shown in Fig. 9. In
order to explain how the hash table H2 is constructed, we assume that N1 ¼ 4; Hashsize ¼ 19, and the identity
number of each item is a ¼ 1; b ¼ 2; c ¼ 3, and d = 4. When the first transaction fa; bg is read from the data-
base, it forms a candidate 2-pattern, fað0Þ; bð0Þg. Its hash value gfað0Þ;bð0Þg ¼ ðða à ðmaxspan þ 1Þ þ ðtb À taÞÞÃ
N1 þ bÞ mod Hashsize ¼ ðð1 Ã ð1 þ 1Þ þ ð1 À 1ÞÞ Ã 4 þ 2Þ mod 19 ¼ 10, so the count of H2[10] is incremented
by 1, where ta and tb are the dats of a and b, respectively. Then, when the second transaction fa; c; dg is read
from the database, three candidate 2-patterns can be formed: fað0Þ; cð0Þg; fað0Þ; dð0Þg, and fcð0Þ; dð0Þg.
Meanwhile, the candidate 2-patterns formed by the first and second transactions are listed as follows:
fað0Þ; að1Þg; fað0Þ; cð1Þg; fað0Þ; dð1Þg; fbð0Þ; að1Þg; fbð0Þ; cð1Þg, and fbð0Þ; dð1Þg. The hash values of all these
candidate 2-patterns can be computed and the counts in H2 can be incremented in the same way as above.
Table 3 shows the content of H2 after the database has been scanned.
During the database is scanned, we also compute the count for each 1-pattern. The counts of patterns
fað0Þg; fbð0Þg; fcð0Þg, and fdð0Þg, are equal to 6, 3, 2, and 4, respectively. Since the minsup is 50%, a pattern
is frequent if its count is not less than 3. Consequently, fað0Þgh1; 2; 3; 4; 5; 6i, fbð0Þgh1; 4; 5i, and
fdð0Þgh2; 4; 5; 6i are inserted into the ITP-tree as shown in Fig. 9. Note that, since we have three frequent
1-patterns: fað0Þg; fbð0Þg, and fdð0Þg, there are three joinable groups: Jðfað0ÞgÞ ¼ ffað0Þg; fbð0Þg; fdð0Þgg,
Jðfbð0ÞgÞ ¼ ffað0Þg; fbð0Þg; fdð0Þgg, and Jðfdð0ÞgÞ ¼ ffað0Þg; fbð0Þg; fdð0Þgg.
We now describe how to join fað0Þg to every pattern in Jðfað0ÞgÞ to find frequent 2-patterns. For
fað0Þgh1; 2; 3; 4; 5; 6i and fbð0Þgh1; 4; 5i, according to the ‘‘infrequent 2-pattern pruning’’ strategy, we com-
pute all possible hash values: hfað0Þ;bð0Þg ¼ ðð1 Ã ð1 þ 1Þ þ 0Þ Ã 4 þ 2Þ mod 19 ¼ 10 and hfað0Þ;bð1Þg ¼
ðð1 Ã ð1 þ 1Þ þ 1Þ Ã 4 þ 2Þ mod 19 ¼ 14. Pattern fað0Þ; bð0Þg cannot be pruned since H2½10Š ¼ 5  3; however,
pattern fað0Þ; bð1Þg can be pruned since H2½14Š ¼ 2  3. The dat-lists of fað0Þg and fbð0Þg share three dats,
h1; 4; 5i; therefore, by joining them we obtain a frequent 2-pattern fað0Þ; bð0Þg with a list of dats, h1; 4; 5i. Sim-
ilarly, by joining fað0Þgh1; 2; 3; 4; 5; 6i and fdð0Þgh2; 4; 5; 6i, we obtain fað0Þ; dð0Þgh2; 4; 5; 6i. Moreover,
fað0Þ; að1Þgh1; 2; 3; 4; 5i can be generated, since fað0Þg’s dat-list contains five dats h1; 2; 3; 4; 5i, each of which
is one less than fað0Þg’s dats h2; 3; 4; 5; 6i; fað0Þ; dð1Þgh1; 3; 4; 5i can be generated in the same manner. Con-
sequently, Eðfað0ÞgÞ ¼ ffað0Þ; bð0Þg; fað0Þ; dð0Þg; fað0Þ; að1Þg; fað0Þ; dð1Þgg, as shown in Fig. 9. Note that
according to the ‘‘unmatched dat pruning’’ strategy, we only add the matched dats of both 1-patterns to
the dat-list of the newly generated 2-pattern.
Since we have Eðfað0ÞgÞ, we can join every 2-pattern b in Eðfað0ÞgÞ to every 2-pattern in JðbÞ to get
frequent 3-patterns. Consider the pattern fað0Þ; bð0Þg. We know that the patterns joinable to fað0Þ; bð0Þg
are those in Eðfað0ÞgÞ that are larger than fað0Þ; bð0Þg. In other words, Jðfað0Þ; bð0ÞgÞ ¼ ffað0Þ; dð0Þg;
fað0Þ; að1Þg; fað0Þ; dð1Þgg. According to the ‘‘infrequent k-pattern pruning’’ strategy, for joining
{} Length-0
{a(0)}1,2,3,4,5,6 {b(0)}1,4,5 {d(0)}2,4,5,6 Length-1
{a(0),b(0)} {a(0),d(0)} {a(0),a(1)} {a(0),d(1)} {b(0),a(1)} {b(0),d(1)} {d(0),a(1)} Length-2
1,4,5 2,4,5,6 1,2,3,4,5 1,3,4,5 1,4,5 1,4,5 2,4,5
{a(0),b(0),a(1)} {a(0),b(0),d(1)} {a(0),d(0),a(1)} {a(0),a(1),d(1)} {b(0),a(1),d(1)} Length-3
1,4,5 1,4,5 2,4,5 1,3,4,5 1,4,5
{a(0),b(0),a(1),d(1)} Length-4
1,4,5
Fig. 9. The ITP-tree for the database shown in Table 1.
3462 A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476
fað0Þ; bð0Þg and fað0Þ; dð0Þg, the last extended item of the former and that of the latter can be combined to
form a 2-pattern, fbð0Þ; dð0Þg, which is a subset of the joined pattern, fað0Þ; bð0Þ; dð0Þg. Since
hfbð0Þ;dð0Þg ¼ ðð2 Ã ð1 þ 1Þ þ 0Þ Ã 4 þ 4Þ mod 19 ¼ 1 and H2½1Š ¼ 2  3; fað0Þ; bð0Þ; dð0Þg can be pruned. Simi-
larly, for joining fað0Þ; bð0Þg and fað0Þ; að1Þg, since hfbð0Þ;að1Þg ¼ ðð2 Ã ð1 þ 1Þ þ 1Þ Ã 4 þ 1Þ mod 19 ¼ 2 and
H2½2Š ¼ 5 P 3; fað0Þ; bð0Þ; að1Þg cannot be pruned. For joining fað0Þ; bð0Þg and fað0Þ; dð1Þg, since
hfbð0Þ;dð1Þg ¼ ðð2 Ã ð1 þ 1Þ þ 1Þ Ã 4 þ 4Þ mod 19 ¼ 5 and H2½5Š ¼ 3 P 3, fað0Þ; bð0Þ; dð1Þg also cannot be
pruned. In this stage, the join operation is performed by matching the dat-lists of two joinable patterns, so
we join fað0Þ; bð0Þgh1; 4; 5i and fað0Þ; að1Þgh1; 2; 3; 4; 5i to get fað0Þ; bð0Þ; að1Þgh1; 4; 5i; and join
fað0Þ; bð0Þgh1; 4; 5i and fað0Þ; dð1Þgh1; 3; 4; 5i to get fað0Þ; bð0Þ; dð1Þggh1; 4; 5i. Therefore we get
Eðfað0Þ; bð0ÞgÞ ¼ ffað0Þ; bð0Þ; að1Þg; fað0Þ; bð0Þ; dð1Þgg, as shown in Fig. 9. After obtaining Eðfað0Þ; bð0ÞgÞ,
we join every 3-pattern b in Eðfað0Þ; bð0ÞgÞ and every 3-pattern in JðbÞ to find frequent 4-patterns. According
to the ‘‘infrequent k-pattern pruning’’ strategy, for joining fað0Þ; bð0Þ; að1Þg and fað0Þ; bð0Þ; dð1Þg, the last
extended item of the former and that of the latter can be combined to form a 2-pattern, fað1Þ; dð1Þg, which
is a subset of the joined pattern, fað0Þ; bð0Þ; að1Þ; dð1Þg. Pattern fað1Þ; dð1Þg can be transformed into
fað0Þ; dð0Þg by taking dat = 1 as the reference point. Since hfað0Þ;dð0Þg ¼ ðð1 Ã ð1 þ 1Þ þ 0Þ Ã 4þ
4Þ mod 19 ¼ 12 and H2½12Š ¼ 4 P 3, fað0Þ; bð0Þ; að1Þ; dð1Þg cannot be pruned. We join fað0Þ; bð0Þ;
að1Þgh1; 4; 5i and fað0Þ; bð0Þ; dð1Þgh1; 4; 5i to find frequent 4-patterns, and obtain fað0Þ; bð0Þ;
að1Þ; dð1Þgh1; 4; 5i, as shown in Fig. 9. The other two 3-patterns, fað0Þ; dð0Þ; að1Þg and fað0Þ; að1Þ; dð1Þg,
can be obtained in the same way.
A similar procedure can be applied to the nodes containing fbð0Þg and fdð0Þg by calling the DFS function
recursively. Fig. 9 shows the complete set of frequent patterns and their dat-lists. Starting from the node con-
taining fbð0Þg, the patterns found include fbð0Þg; fbð0Þ; að1Þg; fbð0Þ; dð1Þg, and fbð0Þ; að1Þ; dð1Þg. Meanwhile,
starting from the node containing fdð0Þg, the patterns found include fdð0Þg, and fdð0Þ; að1Þg. The mining
process is terminated once the node containing fdð0Þ; að1Þg has been processed, since no more frequent pat-
terns can be generated.
4. Performance evaluation
We evaluated the performance of the ITP-Miner algorithm on three synthetic datasets and four real data-
sets. The former were generated with different parameters synthetically, while the latter comprised real stock
index trading data collected from Yahoo [26], click-stream data from a small dot-com company called Gazel-
le.com [9], and a dataset of game state information [3]. We compared the ITP-Miner algorithm with the FITI
algorithm [24]. All experiments were performed on an IBM Compatible PC with an Intel Pentium IV CPU
2.8 GHz, 1G-byte main memory running on Microsoft Windows XP. Both algorithms were implemented
using Microsoft Visual C++ 6.0.
We have implemented two versions of the ITP-Miner and FITI algorithms, memory-based and disk-based.
The memory-based ITP-Miner algorithm was designed so that all data structures, including frequent patterns,
dat-lists, an ITP-Tree, and a hash table, were kept in the main memory during execution. In addition, a dat-list
was nominated as an integer array to facilitate data matching. The memory-based FITI algorithm, which con-
sisted of three phases, was designed so that it kept all structures, including candidate patterns, frequent pat-
terns, the FILT, the FIT tables, and a hash table, in the main memory during the execution [24]. In Phase I,
the Apriori algorithm was implemented as in [2], the ItemSet Hash Table was nominated as an itemset pointer
array, and the links in the FILT were pointers. Note that the Apriori algorithm scans the database repeatedly
to search for frequent intra-transaction itemsets. In Phase II, the database was transformed into an FIT table,
Table 3
The content of H2 for the database shown in Table 1 with Hashsize = 19
H2[0] H2[1] H2[2] H2[3] H2[4] H2[5] H2[6] H2[7] H2[8] H2[9]
2 2 5 1 1 3 0 0 0 2
H2[10] H2[11] H2[12] H2[13] H2[14] H2[15] H2[16] H2[17] H2[18]
5 3 4 6 2 2 4 0 3
A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476 3463
which was nominated as a 3-dimensional pointer array, where each pointer in a cell referred to a frequent
itemset in the ItemSet Hash Table. In Phase III, the level-wise mining of frequent inter-transaction itemsets
used the Apriori principle to scan the FIT table in the memory.
The disk-based versions of both algorithms were modified from the corresponding memory-based versions
so that the location information of the found patterns was stored on a disk during mining process. For the
disk-based FITI algorithm, the FIT tables were stored on a disk, where a table Fk of size k was created as
a file [24]. For the disk-based ITP-Miner algorithm, the dat-lists were stored on a disk. The experimental
results of the memory-based version are shown in Sections 4.2, 4.3, and 4.4, while the experimental results
of the disk-based version are shown in Section 4.5.
4.1. Generation of synthetic data
We use the same method as that in [24] to generate synthetic transactions. The process involves two steps:
first, we generate potential frequent itemsets, after which we generate transactions in the database from those
itemsets. The parameters used to generate synthetic data are shown in Table 4.
The potential frequent inter-transaction itemsets, Litem, may span several transactions. The length of each
itemset follows a Poisson distribution whose mean is equal to L, and the maximum length of an itemset is
MaxI. Every itemset in Litem is associated with a relative distance ranging from 0 to M(maxspan). Hence,
we can generate jDj transactions from Litem. The length of each transaction is selected from a Poisson distri-
bution whose mean is equal to A, and the maximum length of any transaction is MaxT.
We generated three synthetic datasets using the parameters shown in Table 5, the first two are similar to the
datasets used in [24]. The first dataset, Dataset 1, had 10,000 transactions, with an average of 5 items per trans-
action. Dataset 2, on the other hand, had only 1000 transactions, but each one contained 100 items on aver-
age. The last dataset, Dataset 3 with the largest maxspan = 15, had 10,000 transactions with an average of 15
items per transaction and contained 1000 distinct items.
4.2. Experiments on synthetic data
In this section, we compare the ITP-Miner algorithm with the FITI algorithm by varying one parameter,
while maintaining the other parameters at the default values shown in Table 5.
Fig. 10 shows the minimum support versus the run time for Dataset 1. The minimum support varies
between 0.04% and 0.1%, and the ITP-Miner algorithm runs 10–50 times faster than the FITI algorithm.
Fig. 11 illustrates the minimum support versus the run time for Dataset 2. In this case, the support varies
between 10% and 15%, and the ITP-Miner algorithm runs 3-8 times faster than the FITI algorithm.
Fig. 12 shows the same experiment for Dataset 3 where the support varies between 0.1% and 0.2%, and the
ITP-Miner algorithm runs 3–12 times faster than the FITI algorithm. The ITP-Miner algorithm outperforms
the FITI algorithm because the latter generates a huge number of candidates when the support is small. To
count the support of candidates, in Phase I, the FITI uses the Apriori algorithm to count the support of
intra-transaction candidates by repeatedly scanning the database. In Phase III, whenever the FITI algorithm
reads a new transaction from the FIT tables [24], each inter-transaction candidate must be matched with max-
Table 4
Parameters
Parameter Meaning
jDj Total number of transactions
A Average length of the transactions
F Number of potential frequent inter-transaction itemsets
MaxT Maximum length of the transactions
L Length of each potential frequent inter-transaction itemset
MaxI Maximum length of the potential frequent inter-transaction itemsets
S Number of distinct items
M Maxspan
3464 A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476
Table 5
Parameter settings for the three synthetic datasets
Parameter Dataset 1 Dataset 2 Dataset 3
jDj 10,000 1000 10,000
A 5 100 15
F 1000 1000 2000
MaxT 10 200 30
L 5 6 10
MaxI 10 10 20
S 500 500 1000
M 3 5 15
0
10
20
30
40
50
60
70
0.04% 0.06% 0.08% 0.10%
Minimum support
Runtime(sec)
FITI
ITP-Miner
Fig. 10. Minimum support versus run time, Dataset 1 with maxspan = 3.
0
20
40
60
80
100
120
140
10% 11% 12% 13% 14% 15%
Minimum support
Runtime(sec)
FITI
ITP-Miner
Fig. 11. Minimum support versus run time, Dataset 2 with maxspan = 5.
0
50
100
150
200
250
0.10% 0.12% 0.14% 0.16% 0.18% 0.20%
Minimum support
Runtime(sec)
FITI
ITP-Miner
Fig. 12. Minimum support versus run time, Dataset 3 with maxspan = 15.
A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476 3465
span + 1 consecutive transactions. Moreover, support counting of the FITI algorithm is quite time-consuming
because it needs to perform subset matching. In contrast, the ITP-Miner algorithm counts the supports by
joining the dat-lists of joinable patterns, and only needs to scan the database once; thus, it is more efficient
than the FITI algorithm.
Figs. 13–15 show the number of transactions in the database versus the run time for Dataset 1, Dataset 2,
and Dataset 3, respectively. The number of transactions varies from 10K to 2000K for Dataset 1, from 500 to
20K for Dataset 2, and from 10K to 1000K for Dataset 3. For Dataset 1, minsup = 0.05% and maxspan = 3;
for Dataset 2, minsup = 11% and maxspan = 5; however, for Dataset 3, minsup = 0.12% and maxspan = 15.
The run time of both algorithms increases linearly as the number of transactions increases; however, the
ITP-Miner algorithm runs 4–40 times faster than the FITI algorithm.
Next, we illustrate the impact of the maxspan on the performance of the algorithms for three datasets in
Figs. 16–18. As the maxspan increases, more candidates and patterns are generated. Thus, the run time of both
algorithms increases. We observe that the run time of the ITP-Miner algorithm increases slowly as the max-
span increases. For the FITI algorithm, as the maxspan increases, the range of the FIT tables [24] that must be
scanned to count the candidates’ supports increases. In addition, the number of generated candidates increases
rapidly in a breadth-first search manner. Hence, the time required to count the candidates’ supports increases.
In contrast, by joining patterns in the ITP-tree, the ITP-Miner algorithm uses a depth-first search manner to
avoid having to scan the database and count the support for a large number of candidates. Thus, the ITP-
Miner algorithm performs efficiently when the maxspan increases.
We also investigated the effect of the average length of a transaction on both algorithms. The results are
shown in Fig. 19. When the average length of a transaction increases, the run time increases in both cases.
The reason is that under the same support threshold, if the average transaction length increases, more patterns
are generated; therefore, more candidates must be counted. In the FITI algorithm, it is necessary to perform
subset matching with the transactions in order to count the support for each candidate. The longer the trans-
action, the more subset matching required. Instead of performing subset matching to count the support for
0
200
400
600
800
1000
0 500000 1000000 1500000 2000000
Number of transactions
Runtime(sec)
FITI
ITP-Miner
Fig. 13. Number of transactions versus run time, Dataset 1 with minsup = 0.05%, maxspan = 3.
0
100
200
300
400
500
0 5000 10000 15000 20000
Number of transactions
Runtime(sec)
FITI
ITP-Miner
Fig. 14. Number of transactions versus run time, Dataset 2 with minsup = 11%, maxspan = 5.
3466 A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476
0
300
600
900
1200
1500
0 250000 500000 750000 1000000
Number of transactions
Runtime(sec)
FITI
ITP-Miner
Fig. 15. Number of transactions versus run time, Dataset 3 with minsup = 0.12%, maxspan = 15.
0
20
40
60
80
100
0 1 2 3 4 5 6 7 8
Maxspan
Runtime(sec)
FITI
ITP-Miner
Fig. 16. Maxspan versus run time, Dataset 1 with minsup = 0.05%.
0
20
40
60
80
100
120
140
0 1 2 3 4 5 6 7 8
Maxspan
Runtime(sec)
FITI
ITP-Miner
Fig. 17. Maxspan versus run time, Dataset 2 with minsup = 11%.
0
10
20
30
40
50
0 3 6 9 12 15 18
Maxspan
Runtime(sec)
FITI
ITP-Miner
Fig. 18. Maxspan versus run time, Dataset 3 with minsup = 0.12%.
A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476 3467
each candidate, the ITP-Miner algorithm counts the support by joining the dat-lists of the patterns; thus, it is
more efficient than the FITI algorithm.
Fig. 20 shows the memory usage of both algorithms as the number of transactions increases from 10K to
2000K for Dataset 1. The FITI algorithm requires 3–7 times more memory than the ITP-Miner algorithm.
Figs. 21–23 show, respectively, the memory usage as the minimum support increases from 0.04% to 0.2%
for Dataset 1, from 10% to 16% for Dataset 2, and from 0.1% to 0.2% for Dataset 3. The FITI algorithm
requires 2–20 times more memory for Dataset 1, 2–35 times more memory for Dataset 2, and 16-36 times more
memory for Dataset 3, because it has to store the FILT data structure, the FIT table [24], and the information
about candidate patterns.
In summary, since the ITP-Miner algorithm employs the ITP-tree and dat-lists to mine frequent patterns, it
only requires one database scan and can localize candidate joining, pruning, and support counting to joinable
patterns. As the ‘‘infrequent 2-pattern pruning’’ and ‘‘infrequent k-pattern pruning’’ strategies are used to
check if both patterns are joinable, the ITP-Miner algorithm avoids many unnecessary join operations. It out-
performs the FITI algorithm by one order of magnitude and requires less main memory storage.
4.3. Experiments on real data
We now compare the FITI and the ITP-Miner algorithms by mining four datasets. The first two datasets,
called WINNER and LOSER, are similar to those used in [24]. The datasets consist of ten stock exchange
indices for 3,607 trading days from January 1, 1991 to December 31, 2005. The indices are the ASX All Ordi-
naries Index (ASX), CAC40 Index (CAC), DAX Index (DAX), Dow Jones Index (DOW), FT-SE 100 Index
(FTS), Hang Seng Index (HSI), NASDAQ Index (NDQ), Nikkei 225 Index (NKY), Swiss Market Index
(SMI), and Singapore ST Index (STI). The first dataset, WINNER, comprises stock indices that rise for
the day, while the other dataset, LOSER, consists of those that fall. The third dataset, called GAZELLE,
was obtained from the KDDCup 2000 competition and corresponds to click-stream data for the Gazelle.com
0
30
60
90
120
150
5 10 15 20
Average transaction lengthRuntime(sec)
FITI
ITP-Miner
Fig. 19. Average transaction length versus run time, Dataset 1 with minsup = 0.4%, maxspan = 3.
0
50
100
150
200
250
300
350
0 500000 1000000 1500000 2000000
Number of transactions
Memory(MByte)
FITI
ITP-Miner
Fig. 20. Number of transactions versus memory usage, Dataset 1 with minsup = 0.05%, maxspan = 3.
3468 A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476
web-site [9]. This dataset, which can be downloaded from http://www.ecn.purdue.edu/KDDCUP/, contains
59,601 transactions and 498 distinct items; the average transaction length is 2.5, and the maximum transaction
length is 267. Finally, the fourth dataset, called CHESS, was compiled from the game state information [3],
which can be downloaded from http://www.ics.uci.edu/~mlearn/MLRepository.html. The dataset contains
3196 transactions and 76 distinct items, where the length of every transaction is equal to 37.
Figs. 24–26 illustrate the run time versus the minimum support for the WINNER, LOSER, and GAZELLE
datasets, respectively. The ITP-Miner algorithm runs 2–5 times faster than the FITI algorithm for the WIN-
NER dataset, 3–8 times faster for the LOSER dataset, and 4–77 times faster for the GAZELLE dataset. Figs.
27–29 present the run time versus maxspan for the three datasets. Similarly, the ITP-Miner algorithm runs 2–5
times faster than the FITI algorithm for the WINNER dataset, 3-6 times faster for the LOSER dataset, and 6–
10 times faster for the GAZELLE dataset. Thus, the experimental results show that the ITP-Miner algorithm
outperforms the FITI algorithm in all cases.
0
50
100
150
200
0.04% 0.08% 0.12% 0.16% 0.20%
Minimum support
Memory(MByte)
FITI
ITP-Miner
Fig. 21. Minimum support versus memory usage, Dataset 1 with maxspan = 3.
0
100
200
300
400
500
10% 12% 14% 16%
Minimum support
Memory(MByte)
FITI
ITP-Miner
Fig. 22. Minimum support versus memory usage, Dataset 2 with maxspan = 5.
0
200
400
600
800
0.10% 0.12% 0.14% 0.16% 0.18% 0.20%
Minimum support
Memory(MByte)
FITI
ITP-Miner
Fig. 23. Minimum support versus memory usage, Dataset 3 with maxspan = 15.
A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476 3469
0
100
200
300
400
6% 8% 10% 12% 14%
Minimum supportRuntime(sec)
FITI
ITP-Miner
Fig. 24. WINNER: run time versus minimum support with maxspan = 4.
0
100
200
300
400
500
600
700
800
4% 6% 8% 10% 12%
Minimum support
Runtime(sec)
FITI
ITP-Miner
Fig. 25. LOSER: run time versus minimum support with maxspan = 4.
0
100
200
300
400
0.06% 0.07% 0.08% 0.09% 0.10%
Minimum support
Runtime(sec)
FITI
ITP-Miner
Fig. 26. GAZELLE: run time versus minimum support with maxspan = 4.
0
40
80
120
160
0 1 2 3 4 5 6
Maxspan
Runtime(sec)
FITI
ITP-Miner
Fig. 27. WINNER: run time versus maxspan with minsup = 10%.
3470 A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476
When mining the WINNER and the LOSER datasets, some interesting inter-transaction rules were found.
For WINNER, a sample rule found was ‘‘{FTS(0),DOW(2),NDQ(2)} ! {ASX(3)}’’. That is, if FTS rises on
the first day, DOW and NDQ rise on the third day, then ASX will rise on the fourth day with support = 14%
and confidence = 71%. On the other hand, for LOSER, a sample rule found was ‘‘{NKY(0),-
DOW(1),NDQ(1)} ! {ASX(3)}’’. That is, if NKY falls on the first day, DOW and NDO fall on the second
day, then ASX will fall on the fourth day with support = 11% and confidence = 70%. Then, for the
GAZELLE dataset, we found that the run time of the FITI algorithm increases slowly when maxspan  1,
as shown in Fig. 29. This is because the customer behavior patterns in the GAZELLE dataset are quite inde-
pendent. Thus, few inter-transaction patterns are generated when maxspan  1.
For the CHESS dataset, Fig. 30 shows the run time versus the minimum support and Fig. 31 shows the run
time versus maxspan. In both figures, we observe that when the minimum support is higher then 93% or the
maxspan is lower than 2, the FITI algorithm is faster than the ITP-Miner algorithm. The reason is that
0
30
60
90
120
0 1 2 3 4 5 6
Maxspan
Runtime(sec)
FITI
ITP-Miner
Fig. 28. LOSER: run time versus maxspan with minsup = 9%.
0
20
40
60
80
100
0 4321 6 7 85
Maxspan
Runtime(sec)
FITI
ITP-Miner
Fig. 29. GAZELLE: run time versus maxspan with minsup = 0.07%.
0
1
10
100
1000
88% 90% 92% 94% 96% 98%
Minimum support
Runtime(sec)
FITI
ITP-Miner
Fig. 30. CHESS: run time versus minimum support with maxspan = 1.
A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476 3471
because the CHESS dataset contains only 76 distinct items and is very dense, where short patterns tend to
appear very frequently. When the minimum support is high or the maxspan is low, the branches of short pat-
terns in an ITP-Tree are flourishing. Thus, it is time-consuming for the ITP-Miner algorithm to generate fre-
quent patterns in this case. On the other hand, when the minimum support is low or the maxspan is high, more
candidates are generated. Thus, the FITI algorithm spends much time in support counting in this case.
4.4. Experiments on mining intra-transaction patterns
When maxspan is set to 0, inter-transaction pattern mining is the same as traditional intra-transaction pat-
tern mining, so the FITI algorithm only uses the Apriori algorithm to mine frequent patterns [24]. Therefore,
in this section, we compare the ITP-Miner and the Apriori algorithms with maxspan = 0. Fig. 32 shows the
minimum support versus the run time for Dataset 2, where the minimum support varies between 5% and
10%. The ITP-Miner algorithm runs 10–50 times faster than the Apriori algorithm. Fig. 33 illustrates the min-
imum support versus the run time for GAZELLE, where the minimum support varies between 0.06% and
0.08%. The ITP-Miner algorithm runs 5-416 times faster than the Apriori algorithm. Besides Dataset 2 and
GAZELLE, the experiments on other datasets had similar results. In all cases, the ITP-Miner algorithm out-
performed the Aprori algorithm by several orders of magnitude.
We observe that there are several reasons why ITP-Miner outperforms Aprori. First, Apriori must scan the
whole database to find frequent patterns, which incurs high I/O costs. In contrast, ITP-Miner uses a simple
join operation for dat-lists. Because of unmatched dat pruning, as the length of a frequent pattern increases,
the size of its dat-list decreases, resulting in a very fast join operation. Second, to count support by subset
matching, Apriori uses a complex hash tree [2], which has poor locality and is quite time-consuming. On
the other hand, ITP-Miner has good locality, and a join operation only requires a linear scan of two dat-lists.
0
0 1 2 3
1
10
100
1000
10000
Maxspan
Runtime(sec)
FITI
ITP-Miner
Fig. 31. CHESS: run time versus maxspan with minsup = 95%.
0
40
80
120
160
200
5% 6% 7% 8% 9% 10%
Minimum support
Runtime(sec)
Apriori
ITP-Miner
Fig. 32. Minimum support versus run time, Dataset 2 with maxspan = 0.
3472 A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476
Third, as the minimum support is decreased, more candidates are generated. Since Apriori generates frequent
patterns in the breadth-first search manner, it requires more memory requirement and computing resources to
handle the very large number of candidates at certain levels. In contrast, ITP-Miner generates frequent pattern
in the depth-first search manner and efficiently join patterns in an extended group, so the number of searched
patterns is quite limited.
4.5. Experiments on the disk-based ITP-Miner and FITI algorithms
In the above experiments, we implement the memory-based ITP-Miner and FITI methods and compare
their performance. In this section, we modify memory-based to disk-based for both FITI and ITP-Miner
methods and use two datasets, Dataset 20
and WINNER100, to compare their performance. Note that the
transactions in Dataset 20
are generated in similar way like those in Dataset 2 and the number of transactions
in Dataset 20
is 100 times as many as that in Dataset 2, and the transactions in WINNER100 are replicated
from WINNER by 100 times.
Figs. 34 and 35 show the minimum support versus the run time and the maxspan versus the run time for
Dataset 20
with jDj ¼ 100; 000, where the former varies minimum support between 10% and 15% with max-
span = 5 and the latter varies maxspan between 0 and 8 with minsup = 11%. In both figures, the disk-based
ITP-Miner algorithm runs 3–30 times faster than the disk-based FITI algorithm. Similarly, Figs. 36 and 37
illustrate the minimum support versus the run time and the maxspan versus the run time for WINNER100
with jDj ¼ 360; 700, where the former varies minimum support between 6% and 14% with maxspan = 4
and the latter varies maxspan between 0 and 6 with minsup = 10%. In both figures, the disk-based ITP-Miner
algorithm runs 2–10 times faster than the disk-based FITI algorithm. Besides Dataset 20
and WINNER100,
the experiments on other datasets had similar results. In all cases, the disk-based ITP-Miner algorithm outper-
formed the disk-based FITI algorithm by one order of magnitude.
0
1000
2000
3000
4000
5000
0.060% 0.065% 0.070% 0.075% 0.080%
Minimum support
Runtime(sec)
Apriori
ITP-Miner
Fig. 33. Minimum support versus run time, GAZELLE with maxspan = 0.
0
800
1600
2400
3200
4000
10% 11% 12% 13% 14% 15%
Minimum support
Runtime(sec)
FITI
ITP-Miner
Fig. 34. Minimum support versus run time, Dataset 20
with maxspan = 5 and jDj ¼ 100; 000.
A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476 3473
5. Conclusions and future work
In this paper, we have proposed an efficient method for mining all frequent inter-transaction patterns. The
method consists of two phases. First, we design a data structure, called a dat-list, to record the dimensional
attribute information of a frequent item in a database. Then, we devise a data structure, called an ITP-tree, to
store the frequent patterns found. Second, we propose an efficient algorithm, called ITP-Miner, to mine the
frequent patterns in a depth-first search manner. By using the ITP-tree and dat-lists to mine frequent patterns,
our proposed algorithm only requires one database scan and can localize joining, pruning, and support count-
ing to a small number of dat-lists. Therefore, our proposed algorithm is more efficient than the FITI algo-
0
4000
8000
12000
16000
20000
24000
6% 8% 10% 12% 14%
Minimum support
Runtime(sec)
FITI
ITP-Miner
Fig. 36. Minimum support versus run time, WINNER100 with maxspan = 4 and jDj ¼ 360; 700.
0
500
1000
1500
2000
2500
3000
3500
0 87654321
MaxspanRuntime(sec)
FITI
ITP-Miner
Fig. 35. Maxspan versus run time, Dataset 20
with minsup = 11% and jDj ¼ 100; 000.
0
0 1 2 3 4 5 6
2000
4000
6000
8000
10000
Maxspan
Runtime(sec)
FITI
ITP-Miner
Fig. 37. Maxspan versus run time, WINNER100 with minsup = 10% and jDj ¼ 360; 700.
3474 A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476
rithm. The experiment results show that ITP-Miner outperforms the FITI algorithm by one order of magni-
tude and requires less main memory storage space.
Although we have shown that the ITP-Miner algorithm can efficiently mine frequent inter-transaction pat-
terns, there are still some issues to be addressed in future research. First, we may also extend the algorithm
from 1-dimensional transaction databases to higher dimension databases. Second, without generalization,
too many patterns may be mined and they may be too detailed. However, by generalizing with a concept hier-
archy, we may be able to obtain patterns or rules that are more abstract and meaningful.
Acknowledgements
The authors are grateful to the anonymous referees for their helpful comments and suggestions. This re-
search was supported in part by the National Science Council of Republic of China under Grant No. NSC
95-2416-H-002-053.
References
[1] R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large databases, in: Proceedings of ACM
SIGMOD, 1993, pp. 207–216.
[2] R. Agrawal, R. Srikant, Fast algorithms for mining association rules, in: Proceedings of International Conference on Very Large Data
Bases, 1994, pp. 487–499.
[3] R.J. Bayardo, Efficiently mining long patterns from databases, in: Proceedings of ACM SIGMOD, 1998, pp. 85–93.
[4] G. Chen, Q. Wei, Fuzzy association rules and the extended mining algorithms, Information Sciences 147 (1–4) (2002) 201–228.
[5] D.K.Y. Chiu, A.K.C. Wong, Multiple pattern associations for interpreting structural and functional characteristics of biomolecules,
Information Sciences 167 (1–4) (2004) 23–39.
[6] L. Feng, T.S. Dillon, J. Liu, Inter-transactional association rules for multi-dimensional contests for prediction and their application to
studying meteorological data, Data and Knowledge Engineering 37 (1) (2001) 85–115.
[7] L. Feng, J.X. Yu, H. Lu, J. Han, A template model for multidimensional inter-transactional association rules, The VLDB Journal 11
(2) (2002) 153–175.
[8] P.Y. Hsu, Y.L. Chen, C.C. Ling, Algorithms for mining association rules in bag databases, Information Sciences 166 (1–4) (2004) 31–
47.
[9] R. Kohavi, C. Broadley, B. Frasca, L. Mason, Z. Zheng, KDD-cup 2000 organizers’ report: peeling the onion, SIGKDD
Explorations 2 (2) (2000) 86–98.
[10] A.J.T. Lee, R.W. Hong, W.M. Ko, W.K. Tsao, H.H. Lin, Mining spatial association rules in image databases, Information Sciences
177 (7) (2007) 1593–1608.
[11] A.J.T. Lee, W.C. Lin, C.S. Wang, Mining association rules with multi-dimensional constraints, The Journal of Systems and Software
79 (1) (2006) 79–92.
[12] A.J.T. Lee, Y.T. Wang, Efficient data mining for calling path patterns in GSM networks, Information Systems 28 (8) (2003) 929–948.
[13] G. Lee, W. Yang, J.M. Lee, A parallel algorithm for mining multiple partial periodic patterns, Information Sciences 176 (24) (2006)
3591–3609.
[14] Q. Li, L. Feng, A. Wong, From intra-transaction to generalized inter-transaction: landscaping multidimensional contexts in
association rule mining, Information Sciences 172 (3–4) (2005) 361–395.
[15] D. Littau, D. Boley, Streaming data reduction using low-memory factored representations, Information Sciences 176 (14) (2006)
2016–2041.
[16] Y. Li, S. Zhu, X.S. Wang, S. Jajodia, Looking into the seeds of time: discovering temporal patterns in large transaction sets,
Information Sciences 176 (8) (2006) 1003–1031.
[17] H. Lu, J. Han, L. Feng, Stock movement prediction and n-dimensional inter-transaction association rules, in: Proceedings of ACM
SIGMOD Workshop on Research Issues on Data Mining and Knowledge, 1998, pp. 1–7.
[18] H. Lu, L. Feng, J. Han, Beyond intratransaction association analysis: mining multidimensional inter-transaction association rules,
ACM Transactions on Information Systems 18 (4) (2000) 423–454.
[19] A. Savasere, E. Omiecinski, S. Navathe, An efficient algorithm for mining association rules in large databases, in: Proceedings of
International Conference on Very Large Data Bases, 1995, pp. 432–443.
[20] Li. Shen, Hong Shen, Ling Cheng, New algorithms for efficient mining of association rules, Information Sciences 118 (1–4) (1999)
251–268.
[21] P. Shenoy, J.R. Haritsa, S. Sudarshan, G. Bhalotia, M. Bawa, D. Shah, Turbo-charging vertical mining of large databases, in:
Proceedings of ACM SIGMOD, 2000, pp. 22–33.
[22] Y.J. Tsay, J.Y. Chiang, An efficient cluster and decomposition algorithm for mining association rules, Information Sciences 160 (1–4)
(2004) 161–171.
[23] A.K.H. Tung, H. Lu, J. Han, L. Feng, Breaking the barrier of transactions: mining inter-transaction association rules, in: Proceedings
of ACM International Conference on Knowledge and Data Discovery, 1999, pp. 423–454.
A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476 3475
[24] A.K.H. Tung, H. Lu, J. Han, L. Feng, Efficient mining of intertransaction association rules, IEEE Transactions on Knowledge and
Data Engineering 15 (1) (2003) 43–56.
[25] C.Y. Wang, S.S. Tseng, T.P. Hong, Flexible online association rule mining based on multidimensional pattern relations, Information
Sciences 176 (12) (2006) 1752–1780.
[26] Yahoo, Finance, Available from: http://yahoo.com.
[27] J.X. Yu, Z. Chong, H. Lu, Z. Zhang, A. Zhou, A false negative approach to mining frequent itemsets from high speed transactional
data streams, Information Sciences 176 (14) (2006) 1986–2015.
3476 A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476

More Related Content

Similar to An efficient algorithm for mining frequent inter transaction patterns

Ijsrdv1 i2039
Ijsrdv1 i2039Ijsrdv1 i2039
Ijsrdv1 i2039ijsrd.com
 
Actionable Insights Through Association Mining of Exchange Rates: A Case Study
Actionable Insights Through Association Mining of Exchange Rates: A Case StudyActionable Insights Through Association Mining of Exchange Rates: A Case Study
Actionable Insights Through Association Mining of Exchange Rates: A Case StudyGurdal Ertek
 
Actionable Insights Through Association Mining of Exchange Rates: A Case Study
Actionable Insights Through Association Mining of Exchange Rates: A Case StudyActionable Insights Through Association Mining of Exchange Rates: A Case Study
Actionable Insights Through Association Mining of Exchange Rates: A Case Studyertekg
 
Implementation of Improved Apriori Algorithm on Large Dataset using Hadoop
Implementation of Improved Apriori Algorithm on Large Dataset using HadoopImplementation of Improved Apriori Algorithm on Large Dataset using Hadoop
Implementation of Improved Apriori Algorithm on Large Dataset using HadoopBRNSSPublicationHubI
 
Efficient Temporal Association Rule Mining
Efficient Temporal Association Rule MiningEfficient Temporal Association Rule Mining
Efficient Temporal Association Rule MiningIJMER
 
A Quantified Approach for large Dataset Compression in Association Mining
A Quantified Approach for large Dataset Compression in Association MiningA Quantified Approach for large Dataset Compression in Association Mining
A Quantified Approach for large Dataset Compression in Association MiningIOSR Journals
 
Ijiret ashwini-kc-deadlock-detection-in-homogeneous-distributed-database-systems
Ijiret ashwini-kc-deadlock-detection-in-homogeneous-distributed-database-systemsIjiret ashwini-kc-deadlock-detection-in-homogeneous-distributed-database-systems
Ijiret ashwini-kc-deadlock-detection-in-homogeneous-distributed-database-systemsIJIR JOURNALS IJIRUSA
 
BINARY DECISION TREE FOR ASSOCIATION RULES MINING IN INCREMENTAL DATABASES
BINARY DECISION TREE FOR ASSOCIATION RULES MINING IN INCREMENTAL DATABASESBINARY DECISION TREE FOR ASSOCIATION RULES MINING IN INCREMENTAL DATABASES
BINARY DECISION TREE FOR ASSOCIATION RULES MINING IN INCREMENTAL DATABASESIJDKP
 
BINARY DECISION TREE FOR ASSOCIATION RULES MINING IN INCREMENTAL DATABASES
BINARY DECISION TREE FOR ASSOCIATION RULES MINING IN INCREMENTAL DATABASES BINARY DECISION TREE FOR ASSOCIATION RULES MINING IN INCREMENTAL DATABASES
BINARY DECISION TREE FOR ASSOCIATION RULES MINING IN INCREMENTAL DATABASES IJDKP
 
A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...
A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...
A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...IOSR Journals
 
An improvised frequent pattern tree
An improvised frequent pattern treeAn improvised frequent pattern tree
An improvised frequent pattern treeIJDKP
 
Comparative study of frequent item set in data mining
Comparative study of frequent item set in data miningComparative study of frequent item set in data mining
Comparative study of frequent item set in data miningijpla
 
Comparative analysis of association rule generation algorithms in data streams
Comparative analysis of association rule generation algorithms in data streamsComparative analysis of association rule generation algorithms in data streams
Comparative analysis of association rule generation algorithms in data streamsIJCI JOURNAL
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Scienceresearchinventy
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Scienceresearchinventy
 

Similar to An efficient algorithm for mining frequent inter transaction patterns (20)

Ijsrdv1 i2039
Ijsrdv1 i2039Ijsrdv1 i2039
Ijsrdv1 i2039
 
Actionable Insights Through Association Mining of Exchange Rates: A Case Study
Actionable Insights Through Association Mining of Exchange Rates: A Case StudyActionable Insights Through Association Mining of Exchange Rates: A Case Study
Actionable Insights Through Association Mining of Exchange Rates: A Case Study
 
Ijcet 06 06_003
Ijcet 06 06_003Ijcet 06 06_003
Ijcet 06 06_003
 
Actionable Insights Through Association Mining of Exchange Rates: A Case Study
Actionable Insights Through Association Mining of Exchange Rates: A Case StudyActionable Insights Through Association Mining of Exchange Rates: A Case Study
Actionable Insights Through Association Mining of Exchange Rates: A Case Study
 
Implementation of Improved Apriori Algorithm on Large Dataset using Hadoop
Implementation of Improved Apriori Algorithm on Large Dataset using HadoopImplementation of Improved Apriori Algorithm on Large Dataset using Hadoop
Implementation of Improved Apriori Algorithm on Large Dataset using Hadoop
 
Efficient Temporal Association Rule Mining
Efficient Temporal Association Rule MiningEfficient Temporal Association Rule Mining
Efficient Temporal Association Rule Mining
 
Efficient Temporal Association Rule Mining
Efficient Temporal Association Rule MiningEfficient Temporal Association Rule Mining
Efficient Temporal Association Rule Mining
 
A04010105
A04010105A04010105
A04010105
 
A Quantified Approach for large Dataset Compression in Association Mining
A Quantified Approach for large Dataset Compression in Association MiningA Quantified Approach for large Dataset Compression in Association Mining
A Quantified Approach for large Dataset Compression in Association Mining
 
G0361034038
G0361034038G0361034038
G0361034038
 
Ijiret ashwini-kc-deadlock-detection-in-homogeneous-distributed-database-systems
Ijiret ashwini-kc-deadlock-detection-in-homogeneous-distributed-database-systemsIjiret ashwini-kc-deadlock-detection-in-homogeneous-distributed-database-systems
Ijiret ashwini-kc-deadlock-detection-in-homogeneous-distributed-database-systems
 
BINARY DECISION TREE FOR ASSOCIATION RULES MINING IN INCREMENTAL DATABASES
BINARY DECISION TREE FOR ASSOCIATION RULES MINING IN INCREMENTAL DATABASESBINARY DECISION TREE FOR ASSOCIATION RULES MINING IN INCREMENTAL DATABASES
BINARY DECISION TREE FOR ASSOCIATION RULES MINING IN INCREMENTAL DATABASES
 
BINARY DECISION TREE FOR ASSOCIATION RULES MINING IN INCREMENTAL DATABASES
BINARY DECISION TREE FOR ASSOCIATION RULES MINING IN INCREMENTAL DATABASES BINARY DECISION TREE FOR ASSOCIATION RULES MINING IN INCREMENTAL DATABASES
BINARY DECISION TREE FOR ASSOCIATION RULES MINING IN INCREMENTAL DATABASES
 
A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...
A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...
A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...
 
An improvised frequent pattern tree
An improvised frequent pattern treeAn improvised frequent pattern tree
An improvised frequent pattern tree
 
Comparative study of frequent item set in data mining
Comparative study of frequent item set in data miningComparative study of frequent item set in data mining
Comparative study of frequent item set in data mining
 
Comparative analysis of association rule generation algorithms in data streams
Comparative analysis of association rule generation algorithms in data streamsComparative analysis of association rule generation algorithms in data streams
Comparative analysis of association rule generation algorithms in data streams
 
Ds
DsDs
Ds
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
 

Recently uploaded

Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Caco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionCaco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionPriyansha Singh
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 

Recently uploaded (20)

Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Caco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionCaco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorption
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 

An efficient algorithm for mining frequent inter transaction patterns

  • 1. An efficient algorithm for mining frequent inter-transaction patterns Anthony J.T. Lee *, Chun-Sheng Wang Department of Information Management, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei 10617, Taiwan, ROC Received 3 May 2006; received in revised form 1 March 2007; accepted 10 March 2007 Abstract In this paper, we propose an efficient method for mining all frequent inter-transaction patterns. The method consists of two phases. First, we devise two data structures: a dat-list, which stores the item information used to find frequent inter- transaction patterns; and an ITP-tree, which stores the discovered frequent inter-transaction patterns. In the second phase, we apply an algorithm, called ITP-Miner (Inter-Transaction Patterns Miner), to mine all frequent inter-transaction pat- terns. By using the ITP-tree, the algorithm requires only one database scan and can localize joining, pruning, and support counting to a small number of dat-lists. The experiment results show that the ITP-Miner algorithm outperforms the FITI (First Intra Then Inter) algorithm by one order of magnitude. Ó 2007 Elsevier Inc. All rights reserved. Keywords: Association rules; Data mining; Inter-transaction patterns 1. Introduction Association rule mining is an important problem in the data mining field [1,2,4,5,8,10– 13,15,16,20,22,25,27]. Traditional association analysis is intra-transactional because it focuses on association relationships among itemsets within a transaction. For example, an intra-transaction association rule might be: ‘‘If the stock prices of Microsoft and IBM go up, the price of Intel is likely to go up on the same day.’’ However, intra-transactional approaches cannot capture a rule like: ‘‘If the stock prices of Microsoft and IBM go up, the price of Intel is likely to go up two days later.’’ Inter-transaction association rule mining [6,7,14,17,18,23,24] extends the association rules to describe association relationships among itemsets across several transactions. Many algorithms for mining inter-transaction association rules have been proposed. Lu et al. [17] and Feng et al. [6] applied inter-transaction association rule mining algorithms to the prediction of trends in meteoro- logical and stock market data. Lu et al. [17,18] proposed the EH-Apriori algorithm, which uses the Apriori 0020-0255/$ - see front matter Ó 2007 Elsevier Inc. All rights reserved. doi:10.1016/j.ins.2007.03.007 * Corresponding author. E-mail addresses: jtlee@ntu.edu.tw, d91725001@ntu.edu.tw (A.J.T. Lee). Information Sciences 177 (2007) 3453–3476 www.elsevier.com/locate/ins
  • 2. algorithm to discover frequent inter-transaction itemsets. To enhance their algorithm’s efficiency, the authors used a hashing technique to reduce the number of candidate itemsets of length two. Feng et al. [7] used tem- plates to reduce the number of rules. More recently, Tung et al. [23,24] developed an algorithm, called FITI (First Intra Then Inter), which discovers frequent intra-transaction itemsets and uses them to generate fre- quent inter-transaction itemsets. All the algorithms for mining inter-transaction association rules developed thus far have been based on Apriori-like breadth-first search (BFS) approaches that search for frequent itemsets level by level. At each level, a database must be scanned once to determine the support for each candidate itemset. It has been shown that Apriori-like approaches [19,22] perform well in finding frequent intra-transaction itemsets when the item- sets are short. However, when mining long frequent itemsets, or using very small support thresholds, the per- formance of such algorithms often deteriorates dramatically. The reason is that a frequent itemset of length k implies the presence of 2k À 2 additional frequent sub itemsets, each of which must be examined. Moreover, since Apriori-like approaches may generate a large number of candidate patterns at each level, they are prone to memory shortage during the mining process. We observe that Apriori-like methods for finding frequent inter-transaction itemsets have the same drawbacks as those for finding frequent intra-transaction itemsets. Therefore, in this paper, we propose an efficient method for mining a complete set of frequent inter-trans- action itemsets (patterns). The method consists of two phases. First, we find all frequent items. For each fre- quent item found, we construct a dat-list that records the item information required for finding the frequent inter-transaction patterns. Then, we devise a data structure, called an ITP-tree, to store the discovered fre- quent inter-transaction patterns. In the second phase, we propose an algorithm, called ITP-Miner, to effi- ciently find all frequent inter-transaction patterns in a depth-first search manner. By using the ITP-tree and dat-lists to mine the frequent inter-transaction patterns, the ITP-Miner algorithm requires only one database scan and can localize joining, pruning, and support counting to a small number of dat-lists. Therefore, it is more efficient than the FITI algorithm. The remainder of this paper is organized as follows. In Section 2, we describe the problem of mining fre- quent inter-transaction patterns. Section 3 introduces our proposed algorithm, ITP-Miner, for mining fre- quent inter-transaction patterns. We explain the basic concept of the algorithm and give an example to illustrate how it works. Section 4 describes the performance evaluation. Finally, we present the conclusions in Section 5. 2. Problem description In this section, we introduce the notations and describe the problem of mining frequent inter-transaction patterns. Definition 1. Let I be a set of data items, and N be a set of non-negative integers called domain attributes. A transaction database consists of a set of transactions, where a transaction is in the form of ht; si; t 2 N and s I; t is called a dimensional attribute (or dat), and s is called an itemset. The dimensional attribute describes the properties associated with the data items, such as time and location. As it is an ordinal, it can be divided into intervals of equal length. For example, time can be divided into days, weeks, etc. These intervals can be represented by non-negative integers 0, 1, 2, and so on. An itemset is denoted by fu1; u2; . . . ; ukg, where ui is an item, 1 6 i 6 k; and items in an itemset are listed in alphabetical order, i.e., we write fa; c; dg instead of fc; a; dg. Table 1 shows a transaction database containing six transactions. Definition 2. An itemset s ¼ fu1; u2; . . . ; ukg at the dimensional attribute t is called an extended itemset and denoted by Dts ¼ fu1ðtÞ; u2ðtÞ; . . . ; ukðtÞg. Before mining inter-transaction association rules, we need to find the frequent inter-transaction itemsets that span several transactions. Since an inter-transaction itemset can span many intervals, discovering all such itemsets would require a lot of resources, but a user may only be interested in rules that span a certain number of intervals. Therefore, to avoid wasting resources by mining unwanted rules, we introduce a parameter called maxspan. When mining for inter-transaction association rules, we only mine rules whose span is equal to or less than the maxspan intervals. 3454 A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476
  • 3. Definition 3. Let Dt0 a0; Dt1 a1; . . . ; Dtn an be extended itemsets, where t0 t1 Á Á Á tn, and tn À t0 is not greater than a user-specified maximum span threshold maxspan. Then, let a ¼ D0a0 [ Dt1Àt0 a1 [ Á Á Á [ DtnÀt0 an, where [ is an operator that combines multiple extended itemsets into single unit, a is called an inter-transaction itemset (or a pattern) and t0 is called the reference point. An item in the inter-transaction itemset is called an extended item. Since a takes t0 as the reference point, we say that a starts from t0. Note that the reference point of an inter-transaction itemset is defined as the smallest dimensional attribute of the extended itemsets in the inter- transaction itemset. For example, let us consider the database as shown in Table 1, where fað1Þ; bð1Þg is the extended itemset for the first transaction. Let maxspan = 1. The first two transactions form an inter-transaction itemset, fað0Þ; bð0Þ; að1Þ; cð1Þ; dð1Þg, where we take the dimensional attribute of the first transaction as the reference point. Definition 4. Let uðiÞ and vðjÞ be two extended items. If (i ¼ j and u ¼ v), we can say that uðiÞ ¼ vðjÞ. Also, if (i ¼ j and u v) or ði jÞ, we say that uðiÞ vðjÞ. Definition 5. Let a ¼ fu0ð0Þ; u1ði1Þ; u2ði2Þ; . . . ; unðinÞg and b ¼ fv0ð0Þ; v1ðj1Þ; v2ðj2Þ; . . . ; vnðjnÞg be two patterns, where n P 1. We say that a ¼ b if uiðiiÞ ¼ viðjiÞ for 0 6 i 6 n, where i0 ¼ j0 ¼ 0. We also say that a b, if (1) u0ð0Þ v0(0); or (2) there exists k ðP 0Þ such that uiðiiÞ ¼ viðjiÞ for 0 6 i 6 k n, and ukþ1ðikþ1Þ vkþ1ðjkþ1Þ. Definition 6. Let a ¼ fu0ð0Þ; u1ði1Þ; u2ði2Þ; . . . ; unðinÞg and b ¼ fv0ð0Þ; v1ðj1Þ; v2ðj2Þ; . . . ; vmðjmÞg be two patterns, where 1 6 n 6 m. a is a subset of b if we find n extended items vk1ðjk1Þ; vk2ðjk2Þ; . . . ; vknðjknÞ in b such that u0ð0Þ ¼ v0ð0Þ; u1ði1Þ ¼ vk1ðjk1Þ; u2ði2Þ ¼ vk2ðjk2Þ; . . . ; and unðinÞ ¼ vknðjknÞ. We can also say that b contains a. For example, both fbð0Þ; að1Þ; cð3Þg and fað0Þ; dð1Þg are subsets of fað0Þ; bð0Þ; að1Þ; dð1Þ; bð3Þ; cð3Þg. Definition 7. In a transaction database D, the number of transactions is denoted by jDj. Let a be a pattern, and Ta be all inter-transaction itemsets in D, where every inter-transaction itemset in T a contains a. The count of a, count(a), is denoted by jTaj and the support of a, sup(a), is denoted by countðaÞ=jDj. If sup(a) is not less than the user-specified minimum support threshold minsup, a is called a frequent inter-transaction pattern (or frequent pattern for short). Definition 8. The number of extended items in a pattern is called the length of the pattern. A pattern of length k is called a k-pattern. For example, the length of fað0Þ; að1Þ; dð1Þ; bð3Þ; cð3Þg is equal to 5. Definition 9. An inter-transaction association rule is written in the form of a ! b, where both a and a [ b are frequent patterns, a b ¼ /, conf ða ! bÞ is not less than the user-specified minimum confidence, and the con- fidence of the rule is defined as supða [ bÞ=supðaÞ. Suppose that minsup = 50% and maxspan = 1. Let us consider the database D shown in Table 1, where the database contains six transactions (i.e., jDj ¼ 6). To compute the support of fað0Þ; bð0Þ; dð1Þg, we first form an inter-transaction itemset fað0Þ; bð0Þ; að1Þ; cð1Þ; dð1Þg by using the first two transactions and find that it contains fað0Þ; bð0Þ; dð1Þg. Thus, countðfað0Þ; bð0Þ; dð1ÞgÞ ¼ 1. Next, we form an inter-transaction itemset Table 1 A transaction database Dat Itemset 1 {a,b} 2 {a,c,d} 3 {a} 4 {a,b,c,d} 5 {a,b,d} 6 {a,d} A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476 3455
  • 4. fað0Þ; bð0Þ; cð0Þ; dð0Þ; að1Þ; bð1Þ; dð1Þg by using the fourth and fifth transactions, and find that it contains fað0Þ; bð0Þ; dð1Þg. Thus, the count of fað0Þ; bð0Þ; dð1Þg is increased by one. Finally, we form an inter- transaction itemset fað0Þ; bð0Þ; dð0Þ; að1Þ; dð1Þg by using the last two transactions and find that it contains fað0Þ; bð0Þ; dð1Þg. Thus, the count of fað0Þ; bð0Þ; dð1Þg is increased by one. That is, countðfað0Þ; bð0Þ; dð1ÞgÞ ¼ 3, and supðfað0Þ; bð0Þ; dð1ÞgÞ ¼ 3=6 ¼ 0:5. Consequently, fað0Þ; bð0Þ; dð1Þg is a frequent pattern. Similarly, we find that supðfað0Þ; bð0ÞgÞ ¼ 0:5. That is, fað0Þ; bð0Þg is also a frequent pattern. Therefore, we have an inter-transaction association rule fað0Þ; bð0Þg ! fdð1Þg with the confidence = 0.5/ 0.5 = 100%. The objective of mining frequent inter-transaction patterns is to find a complete set of frequent patterns in a transaction database with respect to user-specified minsup and maxspan thresholds. 3. The ITP-Miner algorithm We now introduce the ITP-Miner algorithm. In Section 3.1 we describe a data structure, called a dat-list, and a search tree, called an ITP-tree. We then explain the main concept of our proposed algorithm in Section 3.2, and describe it in detail in Section 3.3. The memory management and complexity analysis of ITP-Miner are discussed in Section 3.4. Finally, we give an example to demonstrate how the algorithm works in Section 3.5. 3.1. The dat-list and ITP-tree We have devised a data structure, called the dimensional attribute list (dat-list), to store the dimensional attributes of a frequent pattern during the mining process. Let aht1; t2; . . . ; tni be a dat-list, where a is a pattern and ht1; t2; . . . ; tni is a list of dats in which the inter-transaction itemset starting from ti contains a; 1 6 i 6 n. For example, fað0Þ; dð1Þgh1; 3; 4; 5i is a dat-list containing 4 dats, 1, 3, 4, and 5; i.e., the inter-transaction item- sets starting from the first, third, fourth, and fifth transactions in the database shown in Table 1 contain the pattern fað0Þ; dð1Þg. Table 2 shows all dat-lists of the frequent patterns found in the database shown in Table 1, where minsup = 50% and maxspan = 1. Note that the count of a pattern is equal to the number of dats contained in its dat-list. For example, the support counts of fað0Þ; dð1Þgh1; 3; 4; 5i and fað0Þ; bð0Þ; að1Þ; dð1Þgh1; 4; 5i are 4 and 3, respectively. Next, we introduce a data structure called an ITP-tree, which stores the dat-lists generated during the min- ing process. Definition 10. An ITP-tree is a search tree denoted by T. A node in T represents a dat-list, aht1; t2; . . . ; tni, where n P 1; a is a frequent pattern, and the extended items in a are listed in increasing order. The parent of node aht1; t2; . . . ; tni is the dat-list b t0 1; t0 2; . . . ; t0 m , where m P n, and b is the pattern obtained by removing the last extended item from a. The root of T is a null pattern, i.e., {}. Siblings with the same parent are sorted in increasing order. Consider the complete ITP-tree shown in Fig. 1, where the database contains only two items, a and b; the maxspan is equal to 1; and the list of dats associated with each dat-list is omitted. Note that all possible inter- transaction patterns of the ITP-tree in Fig. 1 can be enumerated in a depth-first search manner. Table 2 The dat-lists Length Dat-list 1 fað0Þgh1; 2; 3; 4; 5; 6i, fbð0Þgh1; 4; 5i, fdð0Þgh2; 4; 5; 6i 2 fað0Þ; bð0Þgh1; 4; 5i, fað0Þ; dð0Þgh2; 4; 5; 6i, fað0Þ; að1Þgh1; 2; 3; 4; 5i, fað0Þ; dð1Þgh1; 3; 4; 5i, fbð0Þ; að1Þgh1; 4; 5i, fbð0Þ; dð1Þgh1; 4; 5i, fdð0Þ; að1Þgh2; 4; 5i 3 fað0Þ; bð0Þ; að1Þgh1; 4; 5i, fað0Þ; bð0Þ; dð1Þgh1; 4; 5i, fað0Þ; dð0Þ; að1Þgh2; 4; 5i, fað0Þ; að1Þ; dð1Þgh1; 3; 4; 5i, fbð0Þ; að1Þ; dð1Þgh1; 4; 5i 4 fað0Þ; bð0Þ; að1Þ; dð1Þgh1; 4; 5i 3456 A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476
  • 5. 3.2. Main concept First, we consider the generation of a candidate 2-pattern by joining two frequent 1-patterns. Definition 11. If both a and b are frequent 1-patterns, a is joinable to b. Let t1 and t2 be the dats in the dat-lists of a ¼ fuð0Þg and b ¼ fvð0Þg, respectively. If (u v and 0 6 t2 À t1 6 maxspan) or (u P v and 0 t2 À t1 6 maxspan), then: (1) we generate a candidate 2-pattern fuð0Þ; vðt2 À t1Þg if it has not been generated already; (2) we add t1 to the candidate’s dat-list; and (3) we increase the count of fuð0Þ; vðt2 À t1Þg by one. For each candidate 2-pattern generated, we check that its support is not less than the minsup. If this is the case, it is a frequent 2-pattern and we store its dat-list in the ITP-tree. Thus, for a frequent 2-pattern fuð0Þ; vðwÞg; 0 6 w 6 maxspan, we store its dat-list as a child of a’s dat-list in the ITP-tree. Fig. 2 illustrates how we find a frequent 2-pattern by joining the dat-lists of two frequent 1-patterns fað0Þgh1; 2; 3; 4; 5; 6i and fbð0Þgh2; 4; 5; 6i, where minsup = 50% and maxspan = 1. Here, we find two frequent 2-patterns, fað0Þ; bð0Þg and fað0Þ; bð1Þg. In Fig. 2(a), the dats used to generate pattern fað0Þ; bð0Þg are connected by black arrows, while the dats used to generate pattern fað0Þ; bð1Þg are connected by dashed arrows. Fig. 2(b) shows the ITP-tree formed by the dat-lists of the patterns found. Note that there are six dats in fað0Þgh1; 2; 3; 4; 5; 6i; however, there are only four in fað0Þ; bð0Þgh2; 4; 5; 6i and fað0Þ; bð1Þgh1; 3; 4; 5i. Definition 12. Let a ¼ fu0ð0Þ; u1ði1Þ; . . . ; ukÀ1ðikÀ1Þg and b ¼ fv0ð0Þ; v1ðj1Þ; . . . ; vkÀ1ðjkÀ1Þg be two frequent k-patterns, k 1. a is joinable to b if the first k À 1 extended items of a are equal to those of b, and the last extended item of a is less than that of b. The joined pattern is obtained by appending the last extended item of b to a; that is, the joined pattern is fu0ð0Þ; u1ði1Þ; . . . ; ukÀ1ðikÀ1Þ; vkÀ1ðjkÀ1Þg. We now consider how to find frequent ðk þ 1Þ-patterns by joining two joinable k-patterns. Assume there are two dat-lists of 2-patterns, fað0Þ; bð0Þgh1; 4; 5i and fað0Þ; að1Þgh1; 2; 3; 4; 5i. Note that the first extended item of fað0Þ; bð0Þg is equal to that of fað0Þ; að1Þg. To join both 2-patterns, we match the dats in fað0Þ; bð0Þgh1; 4; 5i with the dats in fað0Þ; að1Þgh1; 2; 3; 4; 5i. If a dat in the former is equal to a dat in the latter, {} Length-0 {a(0)} {b(0)} Length-1 {a(0),b(0)} {a(0),a(1)} {a(0),b(1)} {b(0),a(1)} {b(0),b(1)} Length-2 {a(0),b(0),a(1)} {a(0),b(0),b(1)}{a(0),a(1),b(1)} {b(0),a(1),b(1)} Length-3 {a(0),b(0),a(1),b(1)} Length-4 Fig. 1. An ITP-tree. {a(0)}1, 2, 3, 4, 5, 6 {b(0)} 2, 4, 5, 6 (a) Dat-lists (b) ITP-tree {} {a(0)}1,2,3,4,5,6 {a(0),b(0)}2,4,5,6 {a(0),b(1)}1,3,4,5 Fig. 2. Generating 2-patterns by joining two 1-patterns: (a) dat-lists and (b) ITP-tree. A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476 3457
  • 6. the count of the joined pattern ðfað0Þ; bð0Þ; að1ÞgÞ is increased by one. By matching every dat in fað0Þ; bð0Þgh1; 4; 5i with its counterpart in fað0Þ; að1Þgh1; 2; 3; 4; 5i, we find that the count of fað0Þ; bð0Þ; að1Þg is equal to 3. Thus, it is a frequent 3-pattern. We store fað0Þ; bð0Þ; að1Þgh1; 4; 5i in the ITP-tree as shown in Fig. 3, where h1; 4; 5i are the matched dats between h1; 4; 5i and h1; 2; 3; 4; 5i. The ITP-Miner algorithm generates the frequent patterns in a depth-first search manner. We use the fre- quent patterns obtained at this level (frequent k-pattern) to generate the frequent patterns at the next level (frequent ðk þ 1Þ-pattern). To do so, we first define the joinable and extended groups for a frequent k-pattern. A frequent k-pattern, a, only needs to join the patterns in its joinable group to generate the frequent ðk þ 1Þ- pattern extended from a. Generating the frequent ðk þ 1Þ-pattern in this manner can localize joining, pruning, and support counting to a small number of dat-lists. Definition 13. Let a be a frequent k-pattern, where k P 1. The joinable group of a is JðaÞ ¼ fa1; a2; . . . ; ang, where a is joinable to ai, which is a frequent k-pattern, 1 6 i 6 n. Also, the extended group of a is EðaÞ ¼ fb1; b2; . . . ; bmg, where every bj is a frequent ðk þ 1Þ-pattern extended from a; 1 6 j 6 m. Since bj is extended from a, its dat-list will be a child of a’s dat-list in the ITP-tree. Lemma 1. Let a be a frequent k-pattern; and b be a frequent ðk þ 1Þ-pattern that is extended from a, where k P 1. Then, we have: (1) EðaÞ ¼ fhjh ¼ ða join cÞ, h is frequent, and c 2 JðaÞg; and (2) JðbÞ ¼ fhjh is greater than b, and h 2 EðaÞg. We can use Lemma 1 to generate frequent patterns and store their dat-lists in the ITP-tree, where the fre- quent patterns are generated in a depth-first search manner. The generation of frequent patterns consists of two steps. Let a be a frequent k-pattern. First, a is joined to each c in JðaÞ, after which we can find all the frequent ðk þ 1Þ-pattern patterns extended from a, referred to as EðaÞ. Next, for each frequent ðk þ 1Þ-pattern b in EðaÞ; JðbÞ is the set containing the patterns in EðaÞ that are greater than b. We can therefore join b to every c in JðbÞ to find all the frequent (k + 2)-patterns extended from b, referred to as EðbÞ. Let a ¼ b and k ¼ k þ 1. We perform the second step recursively in a depth-first search manner until no more frequent pat- terns can be generated. This is the main concept of our proposed method. Let us consider the example shown in Fig. 4, where a ¼ fað0Þ; bð0Þg and b ¼ fað0Þ; bð0Þ; að1Þg. From the figure, we know that JðaÞ ¼ ffað0Þ; dð0Þg; fað0Þ; að1Þg; fað0Þ; dð1Þgg; EðaÞ ¼ ffað0Þ; bð0Þ; að1Þg; fað0Þ; bð0Þ; dð1Þgg; JðbÞ ¼ ffað0Þ; bð0Þ; dð1Þgg; and EðbÞ ¼ ffað0Þ; bð0Þ; að1Þ; dð1Þgg. The figure shows that EðaÞ can be obtained by joining a with every pattern in JðaÞ, while EðbÞ can be obtained by joining b with every pattern in JðbÞ. 3.3. The ITP-Miner algorithm The ITP-Miner algorithm is comprised of three functions: Join2, DFS, and JoinK. The algorithm and its functions are shown in Figs. 5–8, respectively. The Join2 function is used to find frequent 2-patterns by joining two frequent 1-patterns, while the JoinK function is used to find frequent ðk þ 1Þ-pattern by joining two fre- quent k-patterns. {a(0)}1,2,3,4,5,6 {a(0),b(0)}1, 4, 5 {a(0),b(0)}1,4,5 {a(0),a(1)}1,2,3,4,5 {a(0),a(1)}1,2,3,4,5 {a(0),b(0),a(1)}1,4,5 {} Fig. 3. Generating a 3-pattern by joining two frequent 2-patterns. 3458 A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476
  • 7. In Step 1 of Fig. 5, the database is scanned to find all frequent 1-patterns and their dat-lists, after which a hash table, H2, is constructed and the dat-lists are inserted into the ITP-tree, T. In Step 4, two frequent 1-pat- terns, a and c, are joined to generate frequent 2-patterns and obtain E(a). Note that the joinable group of a frequent 1-pattern is equal to the set containing all frequent 1-patterns. In Step 5, for each frequent 2-pattern b, the DFS function is applied recursively in a depth-first search manner to find all frequent inter-transaction patterns. When running the ITP-Miner algorithm, we use the following pruning strategies to reduce the search space and speed up the mining process. (1) Infrequent 2-pattern pruning. In Step 1 of the algorithm (Fig. 5), if we generate n 1-patterns in Step 1, we will perform n2 join operations in Step 4. Since many 2-patterns are infrequent, it is wasteful to perform so many operations. To resolve the problem, we use a hashing approach [24] to check if a pair of 1-pat- terns can generate a frequent 2-pattern before joining them. In Step 1, therefore, we construct a hash table, H2, when searching for 1-patterns and dat-lists. Every bucket of H2 aggregates the counts of {a(0)}1,2,3,4,5,6 Length-1 {a(0),b(0)} {a(0),d(0)} {a(0),a(1)} {a(0),d(1)} Length-2 1,4,5 2,4,5,6 1,2,3,4,5 1,3,4,5 {a(0),b(0),d(0)} {a(0),b(0),a(1)} {a(0),b(0),d(1)} Length-3 4,5 1,4,5 1,4,5 {a(0),b(0),a(1)} {a(0),b(0),d(1)} Length-3 1,4,5 1,4,5 {a(0),b(0),a(1),d(1)} Length-4 1,4,5 β J(β) E(β) α J(α) E(α) First Step Second Step Fig. 4. Generating frequent patterns in an ITP-tree. Algorithm: ITP-Miner Input: transaction database D, minimum support minsup, maximum span maxspan. Output: all frequent inter-transaction patterns FP. Method: 1 Scan D to find all frequent 1-patterns and their dat-lists. Construct a hash table, H2, and insert the dat-lists into the ITP-tree T; 2 for each frequent 1-pattern α do 3 Insert α into FP; 4 for each frequent 1-pattern γ in J(α) do call Join2(α, γ, T, minsup, maxspan, |D|) to get E(α); 5 for each frequent 2-pattern β∈E(α) do call DFS(β, FP, T, minsup, maxspan, |D|); 6 end for 7 Output FP. Fig. 5. The ITP-Miner algorithm A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476 3459
  • 8. all length-2 candidates hashed to it. Given a candidate 2-pattern generated by items u0 and v0 with dats tu0 and tv0 , respectively, its hash value g is computed by g ¼ ððu0 à ðmaxspan þ 1Þ þ ðtv0 À tu0 ÞÞà N1 þ v0 Þ mod Hashsize, where 0 6 tv0 À tu0 6 maxspan, N1 is the number of distinct items in the database, and Hashsize is the size of the hash table. The count of H2½gŠ is then incremented by one. In Step 1 of Fig. 6, before joining two 1-patterns, i.e., a ¼ fuð0Þg and c ¼ fvð0Þg, we compute all possible hash values for both patterns by using h ¼ ððu à ðmaxspan þ 1Þ þ wÞ Ã N1 þ vÞ mod Hashsize, where 0 6 w 6 maxspan. If H2½hŠ is less than minsup à jDj for each possible hash value, h, we do not need to join a and c to generate a 2-pattern. (2) Infrequent k-pattern pruning. Before joining two k-patterns when k 1, we use the hash table H2 to filter out the join operations. In Step 1 of Fig. 8, before joining two k-patterns, b ¼ fu0ð0Þ; u1ði1Þ; . . . ; ukÀ1ðikÀ1Þg and c ¼ fv0ð0Þ; v1ðj1Þ; . . . ; vkÀ1ðjkÀ1Þg, we know that b joining c will yield fu0ð0Þ; u1ði1Þ; . . . ; ukÀ1ðikÀ1Þ; vkÀ1ðjkÀ1Þg. Since fukÀ1ðikÀ1Þ; vkÀ1ðjkÀ1Þg is a subset of the joined pattern, it can be transformed into a 2-pattern, fukÀ1ð0Þ; vkÀ1ðjkÀ1 À ikÀ1Þg, by taking ikÀ1 as the reference point. Thus, we can compute its hash value using h ¼ ððukÀ1 à ðmaxspan þ 1Þ þ ðjkÀ1 À ikÀ1ÞÞà Function: Join2(α, γ, T, minsup, maxspan, |D|) Input: two frequent 1-patterns α ={u(0)} and γ ={v(0)}, the ITP-tree T, minimum support minsup, maximum span maxspan, and the number of transactions in the database |D|. Output: the ITP-tree T. 1 if there exists w, 0 w maxspan, such that H2[((u * (maxspan+1) + w) * N1 + v) mod Hashsize] minsup * |D| then 2 for each dat tα∈ α’s dat-list do 3 for each dat tγ ∈γ’s dat-list do 4 if (u v and 0 tγ-tα maxspan) or (u v and 0 tγ-tα maxspan) then 5 Generate a candidate 2-pattern {u(0),v(tγ-tα)} if it has not been generated already; 6 Add tα to its dat-list; 7 end if 8 end for 9 end for 10 for each candidate 2-pattern generated do 11 if it is frequent then 12 Insert its dat-list into T to be a child of α’s dat-list; 13 end if 14 end for 15 end if; Fig. 6. The Join2 function Function: DFS(β, FP, T, minsup, maxspan, |D|) Input: a frequent pattern β, all frequent inter-transaction patterns FP, the ITP-tree T, minimum support minsup, maximum span maxspan, and the number of transactions in the database |D|. Output: the ITP-tree T. 1 Insert β into FP; 2 for each pattern γ ∈J(β), call JoinK(β, γ, T, minsup, maxspan, |D|) to get E(β); 3 for each pattern θ∈E(β), call DFS(θ, FP, T, minsup, maxspan, |D|); 4 Remove β ’s dat-list from T; Fig. 7. The DFS function. 3460 A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476
  • 9. N1 þ vkÀ1Þ mod Hashsize. If H2½hŠ is less than minsup à jDj, we do not need to join b and c to generate a ðk þ 1Þ-pattern. (3) Unmatched dat pruning. In the ITP-Miner algorithm, we join two joinable k-patterns to generate fre- quent ðk þ 1Þ-patterns and construct a dat-list for each of the latter. In Step 6 of Fig. 6 and Step 4 of Fig. 8, when we construct the dat-list for a pattern, we only add the matched dats of both k-patterns to the dat-list of the newly generated ðk þ 1Þ-pattern. 3.4. Memory management and complexity analysis of ITP-Miner In Step 1 of Fig. 5, ITP-Miner only scans the database once to find all frequent 1-patterns and store their dat-lists in the memory, where the space for these dat-lists is smaller than that for the original database. From Definition 12, when k 1 we observe that a k-pattern, b, is only joinable to siblings larger than b in an ITP- Tree. Since the siblings are sorted in increasing order and ITP-Miner processes the branches of an ITP-Tree in a depth-first search manner, in Step 4 of Fig. 7, when the algorithm leaves the DFS function, the memory space of b’s dat-list can be released from T immediately. Therefore, ITP-Miner’s memory requirements are not substantial. If the memory requirements exceed available memory, ITP-Miner can be modified to read and write temporary dat-lists from and to a disk, similar to the methods used in [21]. Assume that the number of frequent 1-patterns, the number of all frequent patterns, and the length of the longest frequent pattern be jL1j; jFPj, and mp respectively. Let us consider a pattern a of various lengths in an ITP-Tree T. (1) If a is a 1-pattern, from Definition 11, we observe that the number of join operations for a is bounded by jL1j à ðmaxspan þ 1Þ. (2) If a is a 2-pattern, also from Definition 11, we observe that the number of a’s siblings, jJðaÞj, is bounded by jL1j à ðmaxspan þ 1Þ, which is the maximum number of join operations for a by Definition 12. (3) If a is a k-pattern, where k 2, from Lemma 1, we observe that the number of a’s siblings is bounded by that of a’s parent in T, so the number of join operations for a is also bounded by jL1j à ðmaxspan þ 1Þ. Therefore, by the above analyses, we conclude that for any pattern a in an ITP-Tree T, the maximum number of join operations for a is bounded by jL1j à ðmaxspan þ 1Þ. Since the number of nodes in T is equal to jFPj and the length of a dat-lists is bounded by OðjDjÞ, the time complexity of ITP- Miner is bounded by OðjFPj à jL1j à ðmaxspan þ 1Þ Ã jDjÞ. Likewise, we can observe that for any pattern a in an ITP-Tree T, the maximum number of JðaÞ that must be kept in the memory for join operations is bounded by jL1j à ðmaxspan þ 1Þ. Since ITP-Miner processes branches in a depth-first search manner, the maximum number of nodes kept in the memory for the recursive function calls is bounded by OðmpÞ. Therefore, the space complexity of ITP-Miner is bounded by Oðmp à jL1j à ðmaxspan þ 1Þ Ã jDj þ HashsizeÞ, where Hashsize is the memory space required by H2. Function: JoinK(β, γ, T, minsup, maxspan, |D|) Input: two frequent k-patterns β ={u0(0),u1(i1),...,uk-1(ik-1)} and γ ={v0(0),v1(j1),...,vk-1(jk-1)}, the ITP-tree T, minimum support minsup, maximum span maxspan, and the number of transactions in the database |D|. Output: the ITP-tree T. 1 if H2[((uk-1 * (maxspan+1) + (jk-1 - ik-1)) * N1 + vk-1) mod Hashsize] minsup * |D|, then 2 Let θ = {u0(0),u1(i1),...,uk-1(ik-1),vk-1(jk-1)}; 3 for each dat x∈β’s dat-list do 4 Check if any dat y∈γ’s dat-list such that y = x. If yes, increase θ’s count by one and add x to θ’s dat-list; 5 end for 6 if θ is frequent then 7 Insert its dat-list into T to be a child of β’s dat-list; 8 end if 9 end if; Fig. 8. The JoinK function. A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476 3461
  • 10. 3.5. An example Let us consider the example as shown in Table 1, where minsup ¼ 50% and maxspan = 1. The frequent inter-transaction patterns and their dat-lists generated during the mining processes are shown in Fig. 9. In order to explain how the hash table H2 is constructed, we assume that N1 ¼ 4; Hashsize ¼ 19, and the identity number of each item is a ¼ 1; b ¼ 2; c ¼ 3, and d = 4. When the first transaction fa; bg is read from the data- base, it forms a candidate 2-pattern, fað0Þ; bð0Þg. Its hash value gfað0Þ;bð0Þg ¼ ðða à ðmaxspan þ 1Þ þ ðtb À taÞÞà N1 þ bÞ mod Hashsize ¼ ðð1 à ð1 þ 1Þ þ ð1 À 1ÞÞ Ã 4 þ 2Þ mod 19 ¼ 10, so the count of H2[10] is incremented by 1, where ta and tb are the dats of a and b, respectively. Then, when the second transaction fa; c; dg is read from the database, three candidate 2-patterns can be formed: fað0Þ; cð0Þg; fað0Þ; dð0Þg, and fcð0Þ; dð0Þg. Meanwhile, the candidate 2-patterns formed by the first and second transactions are listed as follows: fað0Þ; að1Þg; fað0Þ; cð1Þg; fað0Þ; dð1Þg; fbð0Þ; að1Þg; fbð0Þ; cð1Þg, and fbð0Þ; dð1Þg. The hash values of all these candidate 2-patterns can be computed and the counts in H2 can be incremented in the same way as above. Table 3 shows the content of H2 after the database has been scanned. During the database is scanned, we also compute the count for each 1-pattern. The counts of patterns fað0Þg; fbð0Þg; fcð0Þg, and fdð0Þg, are equal to 6, 3, 2, and 4, respectively. Since the minsup is 50%, a pattern is frequent if its count is not less than 3. Consequently, fað0Þgh1; 2; 3; 4; 5; 6i, fbð0Þgh1; 4; 5i, and fdð0Þgh2; 4; 5; 6i are inserted into the ITP-tree as shown in Fig. 9. Note that, since we have three frequent 1-patterns: fað0Þg; fbð0Þg, and fdð0Þg, there are three joinable groups: Jðfað0ÞgÞ ¼ ffað0Þg; fbð0Þg; fdð0Þgg, Jðfbð0ÞgÞ ¼ ffað0Þg; fbð0Þg; fdð0Þgg, and Jðfdð0ÞgÞ ¼ ffað0Þg; fbð0Þg; fdð0Þgg. We now describe how to join fað0Þg to every pattern in Jðfað0ÞgÞ to find frequent 2-patterns. For fað0Þgh1; 2; 3; 4; 5; 6i and fbð0Þgh1; 4; 5i, according to the ‘‘infrequent 2-pattern pruning’’ strategy, we com- pute all possible hash values: hfað0Þ;bð0Þg ¼ ðð1 à ð1 þ 1Þ þ 0Þ Ã 4 þ 2Þ mod 19 ¼ 10 and hfað0Þ;bð1Þg ¼ ðð1 à ð1 þ 1Þ þ 1Þ Ã 4 þ 2Þ mod 19 ¼ 14. Pattern fað0Þ; bð0Þg cannot be pruned since H2½10Š ¼ 5 3; however, pattern fað0Þ; bð1Þg can be pruned since H2½14Š ¼ 2 3. The dat-lists of fað0Þg and fbð0Þg share three dats, h1; 4; 5i; therefore, by joining them we obtain a frequent 2-pattern fað0Þ; bð0Þg with a list of dats, h1; 4; 5i. Sim- ilarly, by joining fað0Þgh1; 2; 3; 4; 5; 6i and fdð0Þgh2; 4; 5; 6i, we obtain fað0Þ; dð0Þgh2; 4; 5; 6i. Moreover, fað0Þ; að1Þgh1; 2; 3; 4; 5i can be generated, since fað0Þg’s dat-list contains five dats h1; 2; 3; 4; 5i, each of which is one less than fað0Þg’s dats h2; 3; 4; 5; 6i; fað0Þ; dð1Þgh1; 3; 4; 5i can be generated in the same manner. Con- sequently, Eðfað0ÞgÞ ¼ ffað0Þ; bð0Þg; fað0Þ; dð0Þg; fað0Þ; að1Þg; fað0Þ; dð1Þgg, as shown in Fig. 9. Note that according to the ‘‘unmatched dat pruning’’ strategy, we only add the matched dats of both 1-patterns to the dat-list of the newly generated 2-pattern. Since we have Eðfað0ÞgÞ, we can join every 2-pattern b in Eðfað0ÞgÞ to every 2-pattern in JðbÞ to get frequent 3-patterns. Consider the pattern fað0Þ; bð0Þg. We know that the patterns joinable to fað0Þ; bð0Þg are those in Eðfað0ÞgÞ that are larger than fað0Þ; bð0Þg. In other words, Jðfað0Þ; bð0ÞgÞ ¼ ffað0Þ; dð0Þg; fað0Þ; að1Þg; fað0Þ; dð1Þgg. According to the ‘‘infrequent k-pattern pruning’’ strategy, for joining {} Length-0 {a(0)}1,2,3,4,5,6 {b(0)}1,4,5 {d(0)}2,4,5,6 Length-1 {a(0),b(0)} {a(0),d(0)} {a(0),a(1)} {a(0),d(1)} {b(0),a(1)} {b(0),d(1)} {d(0),a(1)} Length-2 1,4,5 2,4,5,6 1,2,3,4,5 1,3,4,5 1,4,5 1,4,5 2,4,5 {a(0),b(0),a(1)} {a(0),b(0),d(1)} {a(0),d(0),a(1)} {a(0),a(1),d(1)} {b(0),a(1),d(1)} Length-3 1,4,5 1,4,5 2,4,5 1,3,4,5 1,4,5 {a(0),b(0),a(1),d(1)} Length-4 1,4,5 Fig. 9. The ITP-tree for the database shown in Table 1. 3462 A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476
  • 11. fað0Þ; bð0Þg and fað0Þ; dð0Þg, the last extended item of the former and that of the latter can be combined to form a 2-pattern, fbð0Þ; dð0Þg, which is a subset of the joined pattern, fað0Þ; bð0Þ; dð0Þg. Since hfbð0Þ;dð0Þg ¼ ðð2 Ã ð1 þ 1Þ þ 0Þ Ã 4 þ 4Þ mod 19 ¼ 1 and H2½1Š ¼ 2 3; fað0Þ; bð0Þ; dð0Þg can be pruned. Simi- larly, for joining fað0Þ; bð0Þg and fað0Þ; að1Þg, since hfbð0Þ;að1Þg ¼ ðð2 Ã ð1 þ 1Þ þ 1Þ Ã 4 þ 1Þ mod 19 ¼ 2 and H2½2Š ¼ 5 P 3; fað0Þ; bð0Þ; að1Þg cannot be pruned. For joining fað0Þ; bð0Þg and fað0Þ; dð1Þg, since hfbð0Þ;dð1Þg ¼ ðð2 Ã ð1 þ 1Þ þ 1Þ Ã 4 þ 4Þ mod 19 ¼ 5 and H2½5Š ¼ 3 P 3, fað0Þ; bð0Þ; dð1Þg also cannot be pruned. In this stage, the join operation is performed by matching the dat-lists of two joinable patterns, so we join fað0Þ; bð0Þgh1; 4; 5i and fað0Þ; að1Þgh1; 2; 3; 4; 5i to get fað0Þ; bð0Þ; að1Þgh1; 4; 5i; and join fað0Þ; bð0Þgh1; 4; 5i and fað0Þ; dð1Þgh1; 3; 4; 5i to get fað0Þ; bð0Þ; dð1Þggh1; 4; 5i. Therefore we get Eðfað0Þ; bð0ÞgÞ ¼ ffað0Þ; bð0Þ; að1Þg; fað0Þ; bð0Þ; dð1Þgg, as shown in Fig. 9. After obtaining Eðfað0Þ; bð0ÞgÞ, we join every 3-pattern b in Eðfað0Þ; bð0ÞgÞ and every 3-pattern in JðbÞ to find frequent 4-patterns. According to the ‘‘infrequent k-pattern pruning’’ strategy, for joining fað0Þ; bð0Þ; að1Þg and fað0Þ; bð0Þ; dð1Þg, the last extended item of the former and that of the latter can be combined to form a 2-pattern, fað1Þ; dð1Þg, which is a subset of the joined pattern, fað0Þ; bð0Þ; að1Þ; dð1Þg. Pattern fað1Þ; dð1Þg can be transformed into fað0Þ; dð0Þg by taking dat = 1 as the reference point. Since hfað0Þ;dð0Þg ¼ ðð1 Ã ð1 þ 1Þ þ 0Þ Ã 4þ 4Þ mod 19 ¼ 12 and H2½12Š ¼ 4 P 3, fað0Þ; bð0Þ; að1Þ; dð1Þg cannot be pruned. We join fað0Þ; bð0Þ; að1Þgh1; 4; 5i and fað0Þ; bð0Þ; dð1Þgh1; 4; 5i to find frequent 4-patterns, and obtain fað0Þ; bð0Þ; að1Þ; dð1Þgh1; 4; 5i, as shown in Fig. 9. The other two 3-patterns, fað0Þ; dð0Þ; að1Þg and fað0Þ; að1Þ; dð1Þg, can be obtained in the same way. A similar procedure can be applied to the nodes containing fbð0Þg and fdð0Þg by calling the DFS function recursively. Fig. 9 shows the complete set of frequent patterns and their dat-lists. Starting from the node con- taining fbð0Þg, the patterns found include fbð0Þg; fbð0Þ; að1Þg; fbð0Þ; dð1Þg, and fbð0Þ; að1Þ; dð1Þg. Meanwhile, starting from the node containing fdð0Þg, the patterns found include fdð0Þg, and fdð0Þ; að1Þg. The mining process is terminated once the node containing fdð0Þ; að1Þg has been processed, since no more frequent pat- terns can be generated. 4. Performance evaluation We evaluated the performance of the ITP-Miner algorithm on three synthetic datasets and four real data- sets. The former were generated with different parameters synthetically, while the latter comprised real stock index trading data collected from Yahoo [26], click-stream data from a small dot-com company called Gazel- le.com [9], and a dataset of game state information [3]. We compared the ITP-Miner algorithm with the FITI algorithm [24]. All experiments were performed on an IBM Compatible PC with an Intel Pentium IV CPU 2.8 GHz, 1G-byte main memory running on Microsoft Windows XP. Both algorithms were implemented using Microsoft Visual C++ 6.0. We have implemented two versions of the ITP-Miner and FITI algorithms, memory-based and disk-based. The memory-based ITP-Miner algorithm was designed so that all data structures, including frequent patterns, dat-lists, an ITP-Tree, and a hash table, were kept in the main memory during execution. In addition, a dat-list was nominated as an integer array to facilitate data matching. The memory-based FITI algorithm, which con- sisted of three phases, was designed so that it kept all structures, including candidate patterns, frequent pat- terns, the FILT, the FIT tables, and a hash table, in the main memory during the execution [24]. In Phase I, the Apriori algorithm was implemented as in [2], the ItemSet Hash Table was nominated as an itemset pointer array, and the links in the FILT were pointers. Note that the Apriori algorithm scans the database repeatedly to search for frequent intra-transaction itemsets. In Phase II, the database was transformed into an FIT table, Table 3 The content of H2 for the database shown in Table 1 with Hashsize = 19 H2[0] H2[1] H2[2] H2[3] H2[4] H2[5] H2[6] H2[7] H2[8] H2[9] 2 2 5 1 1 3 0 0 0 2 H2[10] H2[11] H2[12] H2[13] H2[14] H2[15] H2[16] H2[17] H2[18] 5 3 4 6 2 2 4 0 3 A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476 3463
  • 12. which was nominated as a 3-dimensional pointer array, where each pointer in a cell referred to a frequent itemset in the ItemSet Hash Table. In Phase III, the level-wise mining of frequent inter-transaction itemsets used the Apriori principle to scan the FIT table in the memory. The disk-based versions of both algorithms were modified from the corresponding memory-based versions so that the location information of the found patterns was stored on a disk during mining process. For the disk-based FITI algorithm, the FIT tables were stored on a disk, where a table Fk of size k was created as a file [24]. For the disk-based ITP-Miner algorithm, the dat-lists were stored on a disk. The experimental results of the memory-based version are shown in Sections 4.2, 4.3, and 4.4, while the experimental results of the disk-based version are shown in Section 4.5. 4.1. Generation of synthetic data We use the same method as that in [24] to generate synthetic transactions. The process involves two steps: first, we generate potential frequent itemsets, after which we generate transactions in the database from those itemsets. The parameters used to generate synthetic data are shown in Table 4. The potential frequent inter-transaction itemsets, Litem, may span several transactions. The length of each itemset follows a Poisson distribution whose mean is equal to L, and the maximum length of an itemset is MaxI. Every itemset in Litem is associated with a relative distance ranging from 0 to M(maxspan). Hence, we can generate jDj transactions from Litem. The length of each transaction is selected from a Poisson distri- bution whose mean is equal to A, and the maximum length of any transaction is MaxT. We generated three synthetic datasets using the parameters shown in Table 5, the first two are similar to the datasets used in [24]. The first dataset, Dataset 1, had 10,000 transactions, with an average of 5 items per trans- action. Dataset 2, on the other hand, had only 1000 transactions, but each one contained 100 items on aver- age. The last dataset, Dataset 3 with the largest maxspan = 15, had 10,000 transactions with an average of 15 items per transaction and contained 1000 distinct items. 4.2. Experiments on synthetic data In this section, we compare the ITP-Miner algorithm with the FITI algorithm by varying one parameter, while maintaining the other parameters at the default values shown in Table 5. Fig. 10 shows the minimum support versus the run time for Dataset 1. The minimum support varies between 0.04% and 0.1%, and the ITP-Miner algorithm runs 10–50 times faster than the FITI algorithm. Fig. 11 illustrates the minimum support versus the run time for Dataset 2. In this case, the support varies between 10% and 15%, and the ITP-Miner algorithm runs 3-8 times faster than the FITI algorithm. Fig. 12 shows the same experiment for Dataset 3 where the support varies between 0.1% and 0.2%, and the ITP-Miner algorithm runs 3–12 times faster than the FITI algorithm. The ITP-Miner algorithm outperforms the FITI algorithm because the latter generates a huge number of candidates when the support is small. To count the support of candidates, in Phase I, the FITI uses the Apriori algorithm to count the support of intra-transaction candidates by repeatedly scanning the database. In Phase III, whenever the FITI algorithm reads a new transaction from the FIT tables [24], each inter-transaction candidate must be matched with max- Table 4 Parameters Parameter Meaning jDj Total number of transactions A Average length of the transactions F Number of potential frequent inter-transaction itemsets MaxT Maximum length of the transactions L Length of each potential frequent inter-transaction itemset MaxI Maximum length of the potential frequent inter-transaction itemsets S Number of distinct items M Maxspan 3464 A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476
  • 13. Table 5 Parameter settings for the three synthetic datasets Parameter Dataset 1 Dataset 2 Dataset 3 jDj 10,000 1000 10,000 A 5 100 15 F 1000 1000 2000 MaxT 10 200 30 L 5 6 10 MaxI 10 10 20 S 500 500 1000 M 3 5 15 0 10 20 30 40 50 60 70 0.04% 0.06% 0.08% 0.10% Minimum support Runtime(sec) FITI ITP-Miner Fig. 10. Minimum support versus run time, Dataset 1 with maxspan = 3. 0 20 40 60 80 100 120 140 10% 11% 12% 13% 14% 15% Minimum support Runtime(sec) FITI ITP-Miner Fig. 11. Minimum support versus run time, Dataset 2 with maxspan = 5. 0 50 100 150 200 250 0.10% 0.12% 0.14% 0.16% 0.18% 0.20% Minimum support Runtime(sec) FITI ITP-Miner Fig. 12. Minimum support versus run time, Dataset 3 with maxspan = 15. A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476 3465
  • 14. span + 1 consecutive transactions. Moreover, support counting of the FITI algorithm is quite time-consuming because it needs to perform subset matching. In contrast, the ITP-Miner algorithm counts the supports by joining the dat-lists of joinable patterns, and only needs to scan the database once; thus, it is more efficient than the FITI algorithm. Figs. 13–15 show the number of transactions in the database versus the run time for Dataset 1, Dataset 2, and Dataset 3, respectively. The number of transactions varies from 10K to 2000K for Dataset 1, from 500 to 20K for Dataset 2, and from 10K to 1000K for Dataset 3. For Dataset 1, minsup = 0.05% and maxspan = 3; for Dataset 2, minsup = 11% and maxspan = 5; however, for Dataset 3, minsup = 0.12% and maxspan = 15. The run time of both algorithms increases linearly as the number of transactions increases; however, the ITP-Miner algorithm runs 4–40 times faster than the FITI algorithm. Next, we illustrate the impact of the maxspan on the performance of the algorithms for three datasets in Figs. 16–18. As the maxspan increases, more candidates and patterns are generated. Thus, the run time of both algorithms increases. We observe that the run time of the ITP-Miner algorithm increases slowly as the max- span increases. For the FITI algorithm, as the maxspan increases, the range of the FIT tables [24] that must be scanned to count the candidates’ supports increases. In addition, the number of generated candidates increases rapidly in a breadth-first search manner. Hence, the time required to count the candidates’ supports increases. In contrast, by joining patterns in the ITP-tree, the ITP-Miner algorithm uses a depth-first search manner to avoid having to scan the database and count the support for a large number of candidates. Thus, the ITP- Miner algorithm performs efficiently when the maxspan increases. We also investigated the effect of the average length of a transaction on both algorithms. The results are shown in Fig. 19. When the average length of a transaction increases, the run time increases in both cases. The reason is that under the same support threshold, if the average transaction length increases, more patterns are generated; therefore, more candidates must be counted. In the FITI algorithm, it is necessary to perform subset matching with the transactions in order to count the support for each candidate. The longer the trans- action, the more subset matching required. Instead of performing subset matching to count the support for 0 200 400 600 800 1000 0 500000 1000000 1500000 2000000 Number of transactions Runtime(sec) FITI ITP-Miner Fig. 13. Number of transactions versus run time, Dataset 1 with minsup = 0.05%, maxspan = 3. 0 100 200 300 400 500 0 5000 10000 15000 20000 Number of transactions Runtime(sec) FITI ITP-Miner Fig. 14. Number of transactions versus run time, Dataset 2 with minsup = 11%, maxspan = 5. 3466 A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476
  • 15. 0 300 600 900 1200 1500 0 250000 500000 750000 1000000 Number of transactions Runtime(sec) FITI ITP-Miner Fig. 15. Number of transactions versus run time, Dataset 3 with minsup = 0.12%, maxspan = 15. 0 20 40 60 80 100 0 1 2 3 4 5 6 7 8 Maxspan Runtime(sec) FITI ITP-Miner Fig. 16. Maxspan versus run time, Dataset 1 with minsup = 0.05%. 0 20 40 60 80 100 120 140 0 1 2 3 4 5 6 7 8 Maxspan Runtime(sec) FITI ITP-Miner Fig. 17. Maxspan versus run time, Dataset 2 with minsup = 11%. 0 10 20 30 40 50 0 3 6 9 12 15 18 Maxspan Runtime(sec) FITI ITP-Miner Fig. 18. Maxspan versus run time, Dataset 3 with minsup = 0.12%. A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476 3467
  • 16. each candidate, the ITP-Miner algorithm counts the support by joining the dat-lists of the patterns; thus, it is more efficient than the FITI algorithm. Fig. 20 shows the memory usage of both algorithms as the number of transactions increases from 10K to 2000K for Dataset 1. The FITI algorithm requires 3–7 times more memory than the ITP-Miner algorithm. Figs. 21–23 show, respectively, the memory usage as the minimum support increases from 0.04% to 0.2% for Dataset 1, from 10% to 16% for Dataset 2, and from 0.1% to 0.2% for Dataset 3. The FITI algorithm requires 2–20 times more memory for Dataset 1, 2–35 times more memory for Dataset 2, and 16-36 times more memory for Dataset 3, because it has to store the FILT data structure, the FIT table [24], and the information about candidate patterns. In summary, since the ITP-Miner algorithm employs the ITP-tree and dat-lists to mine frequent patterns, it only requires one database scan and can localize candidate joining, pruning, and support counting to joinable patterns. As the ‘‘infrequent 2-pattern pruning’’ and ‘‘infrequent k-pattern pruning’’ strategies are used to check if both patterns are joinable, the ITP-Miner algorithm avoids many unnecessary join operations. It out- performs the FITI algorithm by one order of magnitude and requires less main memory storage. 4.3. Experiments on real data We now compare the FITI and the ITP-Miner algorithms by mining four datasets. The first two datasets, called WINNER and LOSER, are similar to those used in [24]. The datasets consist of ten stock exchange indices for 3,607 trading days from January 1, 1991 to December 31, 2005. The indices are the ASX All Ordi- naries Index (ASX), CAC40 Index (CAC), DAX Index (DAX), Dow Jones Index (DOW), FT-SE 100 Index (FTS), Hang Seng Index (HSI), NASDAQ Index (NDQ), Nikkei 225 Index (NKY), Swiss Market Index (SMI), and Singapore ST Index (STI). The first dataset, WINNER, comprises stock indices that rise for the day, while the other dataset, LOSER, consists of those that fall. The third dataset, called GAZELLE, was obtained from the KDDCup 2000 competition and corresponds to click-stream data for the Gazelle.com 0 30 60 90 120 150 5 10 15 20 Average transaction lengthRuntime(sec) FITI ITP-Miner Fig. 19. Average transaction length versus run time, Dataset 1 with minsup = 0.4%, maxspan = 3. 0 50 100 150 200 250 300 350 0 500000 1000000 1500000 2000000 Number of transactions Memory(MByte) FITI ITP-Miner Fig. 20. Number of transactions versus memory usage, Dataset 1 with minsup = 0.05%, maxspan = 3. 3468 A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476
  • 17. web-site [9]. This dataset, which can be downloaded from http://www.ecn.purdue.edu/KDDCUP/, contains 59,601 transactions and 498 distinct items; the average transaction length is 2.5, and the maximum transaction length is 267. Finally, the fourth dataset, called CHESS, was compiled from the game state information [3], which can be downloaded from http://www.ics.uci.edu/~mlearn/MLRepository.html. The dataset contains 3196 transactions and 76 distinct items, where the length of every transaction is equal to 37. Figs. 24–26 illustrate the run time versus the minimum support for the WINNER, LOSER, and GAZELLE datasets, respectively. The ITP-Miner algorithm runs 2–5 times faster than the FITI algorithm for the WIN- NER dataset, 3–8 times faster for the LOSER dataset, and 4–77 times faster for the GAZELLE dataset. Figs. 27–29 present the run time versus maxspan for the three datasets. Similarly, the ITP-Miner algorithm runs 2–5 times faster than the FITI algorithm for the WINNER dataset, 3-6 times faster for the LOSER dataset, and 6– 10 times faster for the GAZELLE dataset. Thus, the experimental results show that the ITP-Miner algorithm outperforms the FITI algorithm in all cases. 0 50 100 150 200 0.04% 0.08% 0.12% 0.16% 0.20% Minimum support Memory(MByte) FITI ITP-Miner Fig. 21. Minimum support versus memory usage, Dataset 1 with maxspan = 3. 0 100 200 300 400 500 10% 12% 14% 16% Minimum support Memory(MByte) FITI ITP-Miner Fig. 22. Minimum support versus memory usage, Dataset 2 with maxspan = 5. 0 200 400 600 800 0.10% 0.12% 0.14% 0.16% 0.18% 0.20% Minimum support Memory(MByte) FITI ITP-Miner Fig. 23. Minimum support versus memory usage, Dataset 3 with maxspan = 15. A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476 3469
  • 18. 0 100 200 300 400 6% 8% 10% 12% 14% Minimum supportRuntime(sec) FITI ITP-Miner Fig. 24. WINNER: run time versus minimum support with maxspan = 4. 0 100 200 300 400 500 600 700 800 4% 6% 8% 10% 12% Minimum support Runtime(sec) FITI ITP-Miner Fig. 25. LOSER: run time versus minimum support with maxspan = 4. 0 100 200 300 400 0.06% 0.07% 0.08% 0.09% 0.10% Minimum support Runtime(sec) FITI ITP-Miner Fig. 26. GAZELLE: run time versus minimum support with maxspan = 4. 0 40 80 120 160 0 1 2 3 4 5 6 Maxspan Runtime(sec) FITI ITP-Miner Fig. 27. WINNER: run time versus maxspan with minsup = 10%. 3470 A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476
  • 19. When mining the WINNER and the LOSER datasets, some interesting inter-transaction rules were found. For WINNER, a sample rule found was ‘‘{FTS(0),DOW(2),NDQ(2)} ! {ASX(3)}’’. That is, if FTS rises on the first day, DOW and NDQ rise on the third day, then ASX will rise on the fourth day with support = 14% and confidence = 71%. On the other hand, for LOSER, a sample rule found was ‘‘{NKY(0),- DOW(1),NDQ(1)} ! {ASX(3)}’’. That is, if NKY falls on the first day, DOW and NDO fall on the second day, then ASX will fall on the fourth day with support = 11% and confidence = 70%. Then, for the GAZELLE dataset, we found that the run time of the FITI algorithm increases slowly when maxspan 1, as shown in Fig. 29. This is because the customer behavior patterns in the GAZELLE dataset are quite inde- pendent. Thus, few inter-transaction patterns are generated when maxspan 1. For the CHESS dataset, Fig. 30 shows the run time versus the minimum support and Fig. 31 shows the run time versus maxspan. In both figures, we observe that when the minimum support is higher then 93% or the maxspan is lower than 2, the FITI algorithm is faster than the ITP-Miner algorithm. The reason is that 0 30 60 90 120 0 1 2 3 4 5 6 Maxspan Runtime(sec) FITI ITP-Miner Fig. 28. LOSER: run time versus maxspan with minsup = 9%. 0 20 40 60 80 100 0 4321 6 7 85 Maxspan Runtime(sec) FITI ITP-Miner Fig. 29. GAZELLE: run time versus maxspan with minsup = 0.07%. 0 1 10 100 1000 88% 90% 92% 94% 96% 98% Minimum support Runtime(sec) FITI ITP-Miner Fig. 30. CHESS: run time versus minimum support with maxspan = 1. A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476 3471
  • 20. because the CHESS dataset contains only 76 distinct items and is very dense, where short patterns tend to appear very frequently. When the minimum support is high or the maxspan is low, the branches of short pat- terns in an ITP-Tree are flourishing. Thus, it is time-consuming for the ITP-Miner algorithm to generate fre- quent patterns in this case. On the other hand, when the minimum support is low or the maxspan is high, more candidates are generated. Thus, the FITI algorithm spends much time in support counting in this case. 4.4. Experiments on mining intra-transaction patterns When maxspan is set to 0, inter-transaction pattern mining is the same as traditional intra-transaction pat- tern mining, so the FITI algorithm only uses the Apriori algorithm to mine frequent patterns [24]. Therefore, in this section, we compare the ITP-Miner and the Apriori algorithms with maxspan = 0. Fig. 32 shows the minimum support versus the run time for Dataset 2, where the minimum support varies between 5% and 10%. The ITP-Miner algorithm runs 10–50 times faster than the Apriori algorithm. Fig. 33 illustrates the min- imum support versus the run time for GAZELLE, where the minimum support varies between 0.06% and 0.08%. The ITP-Miner algorithm runs 5-416 times faster than the Apriori algorithm. Besides Dataset 2 and GAZELLE, the experiments on other datasets had similar results. In all cases, the ITP-Miner algorithm out- performed the Aprori algorithm by several orders of magnitude. We observe that there are several reasons why ITP-Miner outperforms Aprori. First, Apriori must scan the whole database to find frequent patterns, which incurs high I/O costs. In contrast, ITP-Miner uses a simple join operation for dat-lists. Because of unmatched dat pruning, as the length of a frequent pattern increases, the size of its dat-list decreases, resulting in a very fast join operation. Second, to count support by subset matching, Apriori uses a complex hash tree [2], which has poor locality and is quite time-consuming. On the other hand, ITP-Miner has good locality, and a join operation only requires a linear scan of two dat-lists. 0 0 1 2 3 1 10 100 1000 10000 Maxspan Runtime(sec) FITI ITP-Miner Fig. 31. CHESS: run time versus maxspan with minsup = 95%. 0 40 80 120 160 200 5% 6% 7% 8% 9% 10% Minimum support Runtime(sec) Apriori ITP-Miner Fig. 32. Minimum support versus run time, Dataset 2 with maxspan = 0. 3472 A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476
  • 21. Third, as the minimum support is decreased, more candidates are generated. Since Apriori generates frequent patterns in the breadth-first search manner, it requires more memory requirement and computing resources to handle the very large number of candidates at certain levels. In contrast, ITP-Miner generates frequent pattern in the depth-first search manner and efficiently join patterns in an extended group, so the number of searched patterns is quite limited. 4.5. Experiments on the disk-based ITP-Miner and FITI algorithms In the above experiments, we implement the memory-based ITP-Miner and FITI methods and compare their performance. In this section, we modify memory-based to disk-based for both FITI and ITP-Miner methods and use two datasets, Dataset 20 and WINNER100, to compare their performance. Note that the transactions in Dataset 20 are generated in similar way like those in Dataset 2 and the number of transactions in Dataset 20 is 100 times as many as that in Dataset 2, and the transactions in WINNER100 are replicated from WINNER by 100 times. Figs. 34 and 35 show the minimum support versus the run time and the maxspan versus the run time for Dataset 20 with jDj ¼ 100; 000, where the former varies minimum support between 10% and 15% with max- span = 5 and the latter varies maxspan between 0 and 8 with minsup = 11%. In both figures, the disk-based ITP-Miner algorithm runs 3–30 times faster than the disk-based FITI algorithm. Similarly, Figs. 36 and 37 illustrate the minimum support versus the run time and the maxspan versus the run time for WINNER100 with jDj ¼ 360; 700, where the former varies minimum support between 6% and 14% with maxspan = 4 and the latter varies maxspan between 0 and 6 with minsup = 10%. In both figures, the disk-based ITP-Miner algorithm runs 2–10 times faster than the disk-based FITI algorithm. Besides Dataset 20 and WINNER100, the experiments on other datasets had similar results. In all cases, the disk-based ITP-Miner algorithm outper- formed the disk-based FITI algorithm by one order of magnitude. 0 1000 2000 3000 4000 5000 0.060% 0.065% 0.070% 0.075% 0.080% Minimum support Runtime(sec) Apriori ITP-Miner Fig. 33. Minimum support versus run time, GAZELLE with maxspan = 0. 0 800 1600 2400 3200 4000 10% 11% 12% 13% 14% 15% Minimum support Runtime(sec) FITI ITP-Miner Fig. 34. Minimum support versus run time, Dataset 20 with maxspan = 5 and jDj ¼ 100; 000. A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476 3473
  • 22. 5. Conclusions and future work In this paper, we have proposed an efficient method for mining all frequent inter-transaction patterns. The method consists of two phases. First, we design a data structure, called a dat-list, to record the dimensional attribute information of a frequent item in a database. Then, we devise a data structure, called an ITP-tree, to store the frequent patterns found. Second, we propose an efficient algorithm, called ITP-Miner, to mine the frequent patterns in a depth-first search manner. By using the ITP-tree and dat-lists to mine frequent patterns, our proposed algorithm only requires one database scan and can localize joining, pruning, and support count- ing to a small number of dat-lists. Therefore, our proposed algorithm is more efficient than the FITI algo- 0 4000 8000 12000 16000 20000 24000 6% 8% 10% 12% 14% Minimum support Runtime(sec) FITI ITP-Miner Fig. 36. Minimum support versus run time, WINNER100 with maxspan = 4 and jDj ¼ 360; 700. 0 500 1000 1500 2000 2500 3000 3500 0 87654321 MaxspanRuntime(sec) FITI ITP-Miner Fig. 35. Maxspan versus run time, Dataset 20 with minsup = 11% and jDj ¼ 100; 000. 0 0 1 2 3 4 5 6 2000 4000 6000 8000 10000 Maxspan Runtime(sec) FITI ITP-Miner Fig. 37. Maxspan versus run time, WINNER100 with minsup = 10% and jDj ¼ 360; 700. 3474 A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476
  • 23. rithm. The experiment results show that ITP-Miner outperforms the FITI algorithm by one order of magni- tude and requires less main memory storage space. Although we have shown that the ITP-Miner algorithm can efficiently mine frequent inter-transaction pat- terns, there are still some issues to be addressed in future research. First, we may also extend the algorithm from 1-dimensional transaction databases to higher dimension databases. Second, without generalization, too many patterns may be mined and they may be too detailed. However, by generalizing with a concept hier- archy, we may be able to obtain patterns or rules that are more abstract and meaningful. Acknowledgements The authors are grateful to the anonymous referees for their helpful comments and suggestions. This re- search was supported in part by the National Science Council of Republic of China under Grant No. NSC 95-2416-H-002-053. References [1] R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large databases, in: Proceedings of ACM SIGMOD, 1993, pp. 207–216. [2] R. Agrawal, R. Srikant, Fast algorithms for mining association rules, in: Proceedings of International Conference on Very Large Data Bases, 1994, pp. 487–499. [3] R.J. Bayardo, Efficiently mining long patterns from databases, in: Proceedings of ACM SIGMOD, 1998, pp. 85–93. [4] G. Chen, Q. Wei, Fuzzy association rules and the extended mining algorithms, Information Sciences 147 (1–4) (2002) 201–228. [5] D.K.Y. Chiu, A.K.C. Wong, Multiple pattern associations for interpreting structural and functional characteristics of biomolecules, Information Sciences 167 (1–4) (2004) 23–39. [6] L. Feng, T.S. Dillon, J. Liu, Inter-transactional association rules for multi-dimensional contests for prediction and their application to studying meteorological data, Data and Knowledge Engineering 37 (1) (2001) 85–115. [7] L. Feng, J.X. Yu, H. Lu, J. Han, A template model for multidimensional inter-transactional association rules, The VLDB Journal 11 (2) (2002) 153–175. [8] P.Y. Hsu, Y.L. Chen, C.C. Ling, Algorithms for mining association rules in bag databases, Information Sciences 166 (1–4) (2004) 31– 47. [9] R. Kohavi, C. Broadley, B. Frasca, L. Mason, Z. Zheng, KDD-cup 2000 organizers’ report: peeling the onion, SIGKDD Explorations 2 (2) (2000) 86–98. [10] A.J.T. Lee, R.W. Hong, W.M. Ko, W.K. Tsao, H.H. Lin, Mining spatial association rules in image databases, Information Sciences 177 (7) (2007) 1593–1608. [11] A.J.T. Lee, W.C. Lin, C.S. Wang, Mining association rules with multi-dimensional constraints, The Journal of Systems and Software 79 (1) (2006) 79–92. [12] A.J.T. Lee, Y.T. Wang, Efficient data mining for calling path patterns in GSM networks, Information Systems 28 (8) (2003) 929–948. [13] G. Lee, W. Yang, J.M. Lee, A parallel algorithm for mining multiple partial periodic patterns, Information Sciences 176 (24) (2006) 3591–3609. [14] Q. Li, L. Feng, A. Wong, From intra-transaction to generalized inter-transaction: landscaping multidimensional contexts in association rule mining, Information Sciences 172 (3–4) (2005) 361–395. [15] D. Littau, D. Boley, Streaming data reduction using low-memory factored representations, Information Sciences 176 (14) (2006) 2016–2041. [16] Y. Li, S. Zhu, X.S. Wang, S. Jajodia, Looking into the seeds of time: discovering temporal patterns in large transaction sets, Information Sciences 176 (8) (2006) 1003–1031. [17] H. Lu, J. Han, L. Feng, Stock movement prediction and n-dimensional inter-transaction association rules, in: Proceedings of ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge, 1998, pp. 1–7. [18] H. Lu, L. Feng, J. Han, Beyond intratransaction association analysis: mining multidimensional inter-transaction association rules, ACM Transactions on Information Systems 18 (4) (2000) 423–454. [19] A. Savasere, E. Omiecinski, S. Navathe, An efficient algorithm for mining association rules in large databases, in: Proceedings of International Conference on Very Large Data Bases, 1995, pp. 432–443. [20] Li. Shen, Hong Shen, Ling Cheng, New algorithms for efficient mining of association rules, Information Sciences 118 (1–4) (1999) 251–268. [21] P. Shenoy, J.R. Haritsa, S. Sudarshan, G. Bhalotia, M. Bawa, D. Shah, Turbo-charging vertical mining of large databases, in: Proceedings of ACM SIGMOD, 2000, pp. 22–33. [22] Y.J. Tsay, J.Y. Chiang, An efficient cluster and decomposition algorithm for mining association rules, Information Sciences 160 (1–4) (2004) 161–171. [23] A.K.H. Tung, H. Lu, J. Han, L. Feng, Breaking the barrier of transactions: mining inter-transaction association rules, in: Proceedings of ACM International Conference on Knowledge and Data Discovery, 1999, pp. 423–454. A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476 3475
  • 24. [24] A.K.H. Tung, H. Lu, J. Han, L. Feng, Efficient mining of intertransaction association rules, IEEE Transactions on Knowledge and Data Engineering 15 (1) (2003) 43–56. [25] C.Y. Wang, S.S. Tseng, T.P. Hong, Flexible online association rule mining based on multidimensional pattern relations, Information Sciences 176 (12) (2006) 1752–1780. [26] Yahoo, Finance, Available from: http://yahoo.com. [27] J.X. Yu, Z. Chong, H. Lu, Z. Zhang, A. Zhou, A false negative approach to mining frequent itemsets from high speed transactional data streams, Information Sciences 176 (14) (2006) 1986–2015. 3476 A.J.T. Lee, C.-S. Wang / Information Sciences 177 (2007) 3453–3476