3. Introduction
Data mining is the analysis of large quantities of data to extract interesting
patterns such as :-
groups of data records- cluster analysis
unusual records -anomaly detection
dependencies- associative rules
Association rule mining which was first proposed in[2], is a popular and
well researched data mining method for discovering interesting relations
between variables in large databases.
4. Association Rule learning
The problem of association Rule Mining[2] is defined as :
Let I = {i1 ,i2 ,……,in,} be a set of n attributes called items.
Let D={ t1, t2,……., tm} be a set of transactions called the database.
Each transaction t in D has a unique transaction ID and contains a subset of
the items in I.
A rule is defined as an implication of the form XY where X,Y ⊆ I
and X ∩ Y = Ø.
Example of rule for a supermarket could be
{butter , bread}{milk}.This means if butter and bread are bought then
customers also buy milk.
5. Constraints
The Best known constraints are minimum threshold on support and
confidence.[3]
The Support of an item-set X is defined as the number of transaction in the
data set which contain the item-set. It is written as supp(X).
The confidence of a rule is defined as conf(XY)=supp(X U Y) / supp(X).
Association rule generation technique[16,17] can be split into two steps :
i) First ,we apply user defined minimum support on a database to find out
all the frequent item-sets.
ii) Second, these frequent item-sets and the user defined minimum
confidence are used to form the rules.
For the purpose of finding the frequent item-sets we use the Apriori algorithm.[4]
[5]
6. An Example
Supp(milk)= 2/5 Supp(bread)=3/5 Supp(butter)=2/5 Supp(beer)=1/5
Rule:{milk,bread}{butter} has a confidence =
supp(milk,bread,butter)/supp(milk,bread)
=2/4=50%
Transaction ID milk bread butter beer
1 1 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 1 1 0
5 0 1 0 0
8. Apriori Algorithm
Apriori[11]is a classic algorithm for finding the frequent item-set over
transactional databases.
It proceeds by identifying the frequent individual items in the database and
extending them to larger and larger item sets as long as those item sets
appear sufficiently often in the database i.e. satisfies minimum support for
the database.
• Frequent Item-set Property:
Any subset of a frequent item-set is frequent.
This algorithm is divided into two part :
Generating Candidate Item-set
Generating the Large Frequent Item-set
9. Apriori Algorithm Contd.
Lk: Set of frequent item-sets of size k (with min support)
Ck: Set of candidate item-set of size k (potentially frequent item-sets)
L1 = {frequent items where the size of item is 1};
for (k = 1; Lk !=∅; k++) do
Ck+1 = candidates generated from Lk ;
for each transaction t in database do
increment the count of all candidates in
Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
Return ∪k Lk;
11. Generation of Candidates
Input: Li-1 : set of frequent item-sets of size i-1
Output: Ci: set of candidate item-sets of size i
Ci = empty set;
for each item-set J in Li-1 do
for each item-set K in Li-1 s.t. K<> J do
if i-2 of the elements in J and K are equal then
if all subsets of {K ∪ J} are in Li-1 then
Ci = Ci ∪ {K ∪ J}
return Ci;
12. Example of finding Candidates
Say L3 consists of the item-sets{abc, abd, acd, ace, bcd}
Now to Generate C4 from L3
abcd from abc and abd
acde from acd and ace
Pruning the candidate set :
acde is removed because ade is not in L3
Hence C4 will have only {abcd}
13. Discovering Rules
for each frequent item-set I do
for each rule C I-C do
if (support(I) / support(C) >= min_conf) then [ as {(C) U (I-C)} I ]
output the rule (C I-C) ,with confidence = support(I) / support (C)
and support = support(I)
14. Example of Discovering Rules
Let use consider the 3-itemset {I2, I3, I5}:
Support of {I2,I3,I5}= 2
{I2 , I3} I5 confidence = 2/2=100%
{I2 , I5} I3 confidence = 2/3=67%
{I3 , I5} I2 confidence = 2/2=100%
I2 {I3 , I5} confidence = 2/3=67%
I3 {I2 , I5} confidence = 2/3=67%
I5 {I2 , I3} confidence = 2/3=67%
TID Items
T1 1 3 4
T2 2 3 5
T3 1 2 3 5
T4 2 5
Database D
15. Advantage :
i) Apriori Method is very useful when the data size is huge as it uses level-
wise search method to find out the frequent item-sets.
ii) Apriori uses breadth-first search to count candidate item sets efficiently.
Disadvantage :
i) The Apriori Algorithm needs to go through all the database.
ii)The computation complexity does increase when the size of the candidate
increases.
16. Proposed Work
1. Modified Search Algorithm
2.Modified Association Rule Generation for
Classification of Data
17. Modified Search Algorithm
1. Add a tag Field to each Transaction in database
Format : if transaction is <T1> then the transaction
will be modified in to <T1,tag>.
2.Tag will contain the first ,middle and last instance of
the transaction.
3. Example : If a certain transaction <I4,I5,I6,I9,I11,I12>
then the tag field will be <I4,I6,I12>
18. Modified Search Algorithm Contd.
Step 1: First create a TAG field for each Transaction in the Dataset. TAG field will
contain 3 fields <Starting Value, Middle Value, End value>.
Step 2: For each item to search in the dataset first check whether the item is equal to
or greater than starting value and also less than or equal to end value.
Step 3: If the value does not match the condition in Step 2 then do not search in that
particular Transaction. If value does match with both the conditions in the Step 2
then go to Step 4.
Step 4: Check whether the item to be searched matches with the middle element. If
it matches then go to Step 6.If it does not match then go to Step 5
Step 5: Calculate the difference of the item to be searched from the starting, middle
and the end value. Choose the least difference of these three values and reduce the
range of data-set and go to Step 4 if the difference from any element is 0 then go to
Step 6
Step 6: Increase the count by 1 for that particular item when found in the particular
Transaction.
19. Example:
We randomly take 30 numbers for the example
(10,11,12,21,22,31,33,37,39,41,45,46,49,51,54,57,61,67,69,71,78,79,81,101,103,105,107,109,
111,127)
We need to find 51 among these data.
1st
Iteration
Middle Element
10,11,12,21,22,31,33,37,39,41,45,46,49,51,54,57,61,67,69,71,78,79,81,101,103,
105,107,109,111,127
41 3
51< 54 so the range must be 10-51.But we calculate the difference.
And from the difference we can say that item(51) is much closer to the 54 than 10.So the actual range can be
converted to 33-51 as at most middle position of the range 10-51 can be equal to the item(51)
20. Example:
2nd
Iteration :
Middle Element
10,11,12,21,22,31,33,37,39,41,45,46,49,51,54,57,61,67,69,71,78,79,81,101
, 103,105,107,109,111,127
6
51>45 so the range must be 46-51.But again we calculate the difference.
And difference of item (51) from 45 is 6 and from the 51 is 0.So the Search
will end. And counter for the item will be increased by 1.
So we can see that in only 2 iterations we can find out the data we need to
find.
21. Example:
Comparison With Binary Search :
10,11,12,21,22,31,33,37,39,41,45,46,49,51,54,57,61,67,69,71,78,79,81,101
, 103,105,107,109,111,127
For Binary Search we will have the following iteration :
1st
iteration:(check 51<,>, = 54) result: 51<54 search in the range 10 and 51
2nd
iteration:(check 51<, >, = 33)result: 51>33 search in the range 37 and 51
3rd
iteration: (check 51<,>, = 45)result: 51>45 search in the range 46 and 51
4th
iteration: (check 51<,>, = 49)result: 51>49 search in the range 51 and 51
5th
iteration: (check 51<,>, = 51)result: 51=51 search end, Data found
Conclusion :
From the comparison it is clear that our proposed algorithm for search can
find the desired data in lesser amount of iteration hence less time.
22. Modified Association Rule Generation
for Classification of Data
Issues : a) Minimal Number of Rules
b) Maximum Classification of data Correctly
Example :
For item value 1 there is 3 decisions : 1, 2 and 3. We calculate
count(1,1),count(1,2)and count(1,3).And
support(1)=max(count(1,1),count(1,2),count(1,3)).
I1 I2 I3 I4 DECISION
1 2 3 4 1
1 2 6 7 1
1 3 5 8 2
2 5 6 9 2
1 2 3 6 3
23. Modified Association Rule Generation
for Classification of Data
Algorithm :
Step 1 : Let k = 1
Step 2 : Generate frequent item-sets of length 1(GOTO STEP 11)
Step 3 : Repeat until no new frequent item-sets are identified
(i)Generate length (k+1) candidate item-sets from length k frequent
item-sets
(ii)Prune candidate item-sets containing subsets of length k that are
infrequent
(iii)Count the support of each candidate by scanning the DB(GOTO
STEP11)
(iv)Eliminate candidates that are infrequent, leaving only those that
are frequent.
Step 11: For each item in the dataset calculate the number of times the item
is present in the whole data-set and also their corresponding decision
values.( For example I2D1or I2D2or I2D3)
Step 12: Find the maximum of the calculated support for each item.
Step 13: Return the Support for the item.
24. DECISION TABLE AlgorithmPART AlgorithmProposed AlgorithmOne-R Algorithm
Experimental Results
We have IRIS data-set from UCI Machine Learning Repository
Total Number of Instances 148
Classes available 3 : Iris Setosa(A), Iris Versicolour(B), Iris Virginica(C)
We first classify this data-set using the existing algorithms using the
Weka Tool.
25. Conclusion
Comparative Studies :
From this comparative study we can say that using our proposed algorithm
we can classify the data-set more correctly than the existing algorithms.
ALGORITHMS
Classification
DECISION
TABLE
ONE-R PART Proposed
Method
Correctly Classified 134 136 134 138
In-Correctly Classified 13 11 13 10
Number of total
Instances classified
147 147 147 148
26. Future Scope
In future we will try to optimize the searching
technique for apriori algorithm
Also we will try to optimize the rule set generated to
have lesser number of rules.
27. References
1. Piatetsky-Shapiro, Gregory (1991), Discovery, analysis, and presentation of strong rules, in
Piatetsky-Shapiro, Gregory; and Frawley, William J.; eds., Knowledge Discovery in Databases,
AAAI/MIT Press, Cambridge,
2. MA.Agrawal, R.; Imieliński, T.; Swami, A. (1993). "Mining association rules between sets
of items in large databases". Proceedings of the 1993 ACM SIGMOD international conference
on Management of data-SIGMOD'93.pp. 207.
3. Liu, B., Hsu, W., Ma, Y. (1998).Integrating Classification and Association Rule Mining,
American Association for Artificial Intelligence.
4. Agrawal, R.,Faloutsos C. and Swami A.N.(1994).Efficient similarity search in sequence
datatabases.
5. Lomet D. (Ed.), Proceedings of the 4th International Conference of Foundations of Data
Organization and Algorithms (FODO), Chicago, Illinois, pp. 69-84. Springer Verlag.
6. www.en.wikipedia.org/wiki/Binary_search_algorithm.
7. Press, William H.; Flannery, Brian P.; Teukolsky, Saul A.; Vetterling, William T.
(1988), Numerical Recipes in C: The Art of Scientific Computing, Cambridge University Press,
pp. 98–99,
8. Hipp, J., Güntzer, U., and Nakhaeizadeh, G. (2000). Algorithms for association rule mining
— a general survey and comparison. SIGKDD Explor. Newsl. 2, 1 (Jun. 2000), 58-64.
9. Pingping W, Cuiru W, Baoyi W, Zhenxing Z, “Data Mining Technology and Its
Application in University Education System”. Computer Engineering, June 2003, pp.87-89.
10. Taorong Q, Xiaoming B, Liping Z, “An Apriori algorithm based on granular computing
and its application in Library management system”, Control & Automation, 2006, pp.218-221
28. References Contd.
• 11. R. Agrawal, and R. Srikant, “Fast Algorithms for Mining Association Rules”, In Proc.
VLDB 1994, pp.487-499.
• 12. Chai, S, Jia Y, and Yang C. "The research of improved Apriori algorithm for mining
association rules." Service Systems and Service Management, 2007 International
Conference on. IEEE, 2007.
• 13. Kumar, K. Saravana, and R. Manicka Chezian. "A Survey on Association Rule Mining
using Apriori Algorithm." International Journal of Computer Applications 45.5 (2012): 47-
50.
14. Saggar, M., Agrawal, A. K., & Lad, A. (2004, October). “Optimization of association
rule mining using improved genetic algorithms”. In Systems, Man and Cybernetics, 2004
IEEE International Conference on (Vol. 4, pp. 3725-3729). IEEE.
15. Christian, A. J., & Martin, G. P. (2010, November).” Optimization of association rules
with genetic algorithms”. In Chilean Computer Science Society (SCCC), 2010 XXIX
International Conference of the (pp. 193-197). IEEE.
• 16. Hipp, J., Güntzer, U., & Nakhaeizadeh, G. (2000).” Algorithms for association rule
mining—a general survey and comparison”. ACM SIGKDD Explorations Newsletter, 2(1),
58-64.
17. Mitra, S., & Acharya, T. (2003). “Data Mining: multimedia, soft computing, and
bioinformatics”. Wiley-Interscience,7-8