Your SlideShare is downloading. ×
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply



Published on

Published in: Technology

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. International Journal of Engineering Research and Development e-ISSN: 2278-067X, p-ISSN: 2278-800X, Volume 10, Issue 3 (March 2014), PP.01-05 1 Profit and Preferable Itemsets- A Study Goguri. Rashmitha1 , L.Vandana2 , Saturi. Rajesh3 , Mahipal Reddy Pulyala4 1,2,3 Assistant Professor, Dept Of Cse, Nalla Narasimha Reddy Educational Society's Group Of Institutions, Hyderabad, Ap, India. 4 Assistant Professor,Dept Of Cse, Vaagdevi College Of Engineering,Bollikunta,Warangal,Ap,India. __________________________________________________________________________________________ Abstract:- High utility itemset mining is a research area of utility based descriptive data mining, aimed at finding itemset that contributes most to the total utility. A specialized form of high utility itemset mining is utility-frequent itemset mining, which – in addition to subjectively defined utility – also takes into account itemset frequencies. This paper presents a profitable and preferable within the given utility and support constraints threshold. An extensive performance study using both synthetic and real data sets is reported to verify the effectiveness and efficiency of proposed algorithms. Keywords:- item-set, frequent, preferable, data mining; I. INTRODUCTION Data flow analysis is an emerging topic extensively studied in recent decade. A data stream is a continuously ordered sequence of transactions that arrives sequentially in real-time manner. There are many applications of data stream mining, such as knowledge discovery from online e-business or transaction flows, analysis of network flows, monitoring of sensor data, and so on. Different from traditional databases, data streams have some special properties, namely continuous, unbounded, high speed and time-varying data distribution. Therefore, some limitations are posed in data stream mining as follows [1-9]. First, since the infinite transactions could not be stored, multi-scan algorithms are no more allowed. Second, in order to capture the information of the high speed data streams, the algorithms must be as fast as possible [1-9]. Otherwise, the accuracy of the mining results will be reduced. Third, the data distribution of data streams should be kept to avoid concept drifting problem. Fourth, incremental processes are needed for mining data streams in order to make processing with the old data as little as possible. Therefore, efficient one-pass methods and compact data structures for characterizing the data flow are needed [1-9]. In recent years, utility mining emerges as a new research issue, which is to find the itemset with high utilities, i.e., the itemset whose utility values are greater than or equal to the user-specified minimum utility threshold. There exist rich applications of utility mining, such as business promotion, webpage organization and catalog design. Unlike traditional association rule mining, utility mining can find profitable itemset which may not appear frequently in databases. By the advantages mentioned above, we can see that it is important to push utility mining into data stream mining, such as for the streaming data of the chain hypermarkets. In view of this, we aim at finding maximal utility itemset from data streams with the purpose of reducing the number of patterns in this paper. This is motivated by the observation that there exists no study that explores advanced utility itemset patterns like maximal utility itemset II. LITERATURE Frequent itemset mining as a research area came into being in the nineties. The seminal paper appeared in 1994 [10-20]. In the subsequent decade numerous papers were published. The basic problem in frequent itemset mining is, given a series of sets, to find all subsets that are contained in at least minsup of them; here minsup is a user-specified threshold. The problem of frequent itemset mining plays an important role in several data mining fields [10-20], such as association rules [10-20]warehousing [10-20], correlations [10-20] and classification [10-20]. The subject is also related to rough sets [10-20] and logical analysis of data [10-20]. Moreover, frequent item-sets have many application areas, amongst others customer relationship management, fraud detection, product assortment decisions [10-20] episode mining [10-20], functional dependency discovery [10-20], etc. In the literature on frequent item-sets the algorithms are usually studied from a practical viewpoint. Almost any paper focuses on speed. To achieve an optimal running time, specific implementation tricks are applied. In the current paper, the theory of frequent item-sets is discussed. III. SYSTEM STRUCTURE Skyline queries have been studied since 1960s in the theory field where skyline points are known as Pareto sets and admissible points or maximal vectors. However, earlier algorithms such as are inefficient when there are many data points in a high-dimensional space. Skyline queries in database were first studied by Borzsonyi in 2001[21]. After that, various techniques were proposed to accelerate the computation of skyline
  • 2. Profit And Preferable Itemsets- A Study 2 and its variations. Here, we briefly summarize some of them. Some representative methods include a bitmap method a nearest neighbor (NN) algorithm and a branch-and-bound skylines (BBS) method. Disadvantages of this are the existing systems take much time complexities. These systems directly cannot be applied for finding the profitable products and popular products. In this paper, we identify and tackle the problem of finding top-k preferable products, which has not been studied before. We study two instances of preferable products, namely profitable products and popular products. We propose methods to find top-k profitable products and top-k popular products efficiently. An extensive performance study using both synthetic and real data sets is reported to verify its effectiveness and efficiency. As future work, we will study other instances of the problem of finding top-k preferable products by setting the utility function to other meaningful objective functions. One promising utility function is the function which returns the sum of the unit profits of the selected products multiplied by the number of customers interested in these products and the structure is shown in Fig 1. Advantages of this are, the proposed approach takes lesser computational cost and directly can be applied for finding profitable and popular products. Fig 1 System structure IV. MODULES IMPLEMENTED The proposed system consists of set of stages, these are discussed here Popular Products Dataset: The dataset should be prepared based on the existing products in the market by various vendors. For example: For experimenting on car products, the existing cars, features, cost etc. has to be gathered from different vendors Frequent Feature Set: Frequent feature set identification on the dataset of popular products. The high utility mining refers to the feature set extraction, by satisfying certain conditions. The condition in the work is high profitability. Considering the high profitability, identify the feature set that occurred more frequently by the other vendors. Finding the Smaller Frequency Sets: The Frequency set provides various levels of sets. Example:level1: with single item, level2: with two combinations of items, level3: with three combinations of items. Have to build the least sets to build a product. Price Correlation: For the price Correlation among the optimal feature set „t‟ and the existing feature and their costs. Here NP-Hard with Greedy Approach is applied. Finding Top-k Popular Products: For this problem the build set „t‟ in the previous module is the input. The set is cross checked against the user interest „dataset2‟ which is collected from the home website log. Web Log Preprocessing: The access log can be obtained from the web server.The log consists of all the request sends by the users. The attribute values can be obtained from the URI of the web logs. Converting Features to Binary Access Table: in these modules the set„t‟ to be converted to „0‟ and „1‟. The frequent item set cross checking will be done in the Brute force approach with linear fashion. If „t‟ is popular then sustains in the market. The process is iterative until we get the top-k sets, which provide profitability.
  • 3. Profit And Preferable Itemsets- A Study 3 V. EXAMPLE OF ITEMSETS This section describes a sample execution of the some of the algorithm. The transaction data of the transaction database D are given in Table 1; the minimum support is 0.4; n=5 is the number of items, and m=5 is the number of transactions. Therefore, the minimum support number minsupsh=2. The transaction database D is transformed into the Boolean matrix A5*5 as shown in the Figure 2. Transaction data of the transaction database d TID Item sets T1 A,D T2 B,C,E T3 A,B,C,E T4 B,E T5 A,B,C A B C D E T1 1 0 0 1 0 T2 0 1 1 0 1 T3 1 1 1 0 1 T4 0 1 0 0 1 T5 1 1 1 0 0 Figure 2. The Boolean matrix A5*5 We compute the sum of the element values of each column in the Boolean matrix A5*5 and the set of frequent 1- itemset is: L1= {{A},{B},{C},{D}} The fourth column of the Boolean matrix A5*5 is deleted because the support number of item D is smaller than the minimum support number 2. We then compute the sum of the element values of each row in the Boolean matrix and delete all rows where the sum of the element values is smaller than 2. Finally, the Boolean matrix A4*4 is generated as shown in figure 3. A B C E T2 0 1 1 1 T3 1 1 1 1 T4 0 1 0 1 T5 1 1 1 0 Figure 3 The Boolean matrix A4*4 The operation of 2-supports is executed for the all columns of the Boolean matrix A4*4, and the set of frequent 2-itemset is: L2={{A,B},{A,C},{B,C},{B,E},{C,E}} In pruning the Boolean matrix A4*4 by the set of frequent 2-itemsets L2 , the third row of the Boolean matrix A4*4 is deleted because sum of its element values is smaller than 3. Finally, the Boolean matrix A3*4 is generated as shown in figure 3. A B C E T2 0 1 1 1 T3 1 1 1 1 T5 1 1 1 0 Figure 4The Boolean matrix A3*4 The operation of 3-supports is executed for all columns of the Boolean matrix A3*4, and the set of frequent 3-itemset is: L3= {{A,B,C},{B,C,E}} According to Proposition 3, the proposed algorithm is terminated because there are two frequent 3-itemsets in the set of frequent 3-itemset L3. VI. COMPARATIVE STUDY First, we performed our method as well as the baseline method on the literature repositories. We extracted 10 common topics from the six sequences. For each topic, 10 topical words with highest probability p(w|z). We can see that all topics extracted by our method (sync) were meaningful and easy to understand. For example, #7 includes research topics such as data mining, high-dimensional/multidimensional data, data warehouse, association rule, workflow, etc., while #10 includes sensor network, privacy preserving,
  • 4. Profit And Preferable Itemsets- A Study 4 classification, ontology, top-k query, etc. All of these topical words accurately suggest most important research topics in the database area. Comparing the topics extracted by our method to those by the baseline method (no sync), we can see that our method provided highly discriminative topics. As a contrast, the baseline method suffered from the synchronism in the sequences and extracted many duplicated topical words (see Fig. 4). In asynchronous sequences, documents related to different topics may be indexed by the same time stamp, and documents related to the same topic may appear at different time stamps. As a result, common topics discovered by the conventional method contain redundant information, whereas our method is able to fix the synchronism and discover highly discriminative topics. To further prove that our time synchronization technique helped to generate more discriminative topics, we computed the pair wise KL-divergence between topics as follows: Note that larger KL-divergence indicates that the two topics are more discriminative to each other and 0 divergence means that two topics are identical. We present the results, where darker blocks mean smaller KL- divergence values. We can see that our method extracted much more discriminative topics than those extracted by the baseline method. As discussed above, this was due to the fact that our method successfully fixed the synchronism in the data set. Note that we used p(t|z), which can be computed from p(z|t) VII. CONCLUSIONS Data mining can be used extensively in the enterprise based applications with business intelligence characteristics to provide a deeper kind of analysis while meeting strict requirements for administration management and security. Business intelligence is information about a company's past performance that is used to help predict the company's future performance. In this paper, study two instances of preferable products, namely profitable products and popular products. This study will help the researchers who are begun to study in the related areas. REFERENCES [1]. Chan, R., Yang, Q. and Shen, Y. Mining high utility item-sets. In Proc. of Third IEEE Int'l Conf. on Data Mining, Nov., 2003, 19-26. [2]. Chi, Y., Wang, H., Yu, P. S. and Muntz, R. R. Moment: maintaining closed frequent item-sets over a stream sliding window. In Proc. of the IEEE Int'l Conf. on Data Mining, 2004. [3]. Gaber, M. M., Zaslavsky, A. B. and Krishnaswamy, S. Mining data streams: a review. ACM SIGMOD Record, Vol.34, No.2, 2005, 18-26. [4]. Golab, L. and Ozsu, M. T. Issues in data stream management. In ACM SIGMOD Record, Vol. 32, No. 2, 2003. [5]. Gouda, K. and Zaki, M. J. efficiently mining maximal frequent item-sets. In Proc. of the IEEE Int'l Conf. on Data Mining, 2001. [6]. Han, J., Pei, J., and Yin, Y. Mining frequent patterns without candidate generation. In Proc. of the ACM-SIGMOD Int'l Conf. on Management of Data, 2000, 1-12. [7]. Jiang, N. and Gruenwald, L. CFI-Stream: mining closed frequent item-sets in data streams. In Proc. of the Utility-Based Data Mining Workshop, ACM KDD, 2006. [8]. Jiang, N. and Gruenwald, L. Research issues in data stream association rule mining. In ACM SIGMOD Record, Vol. 35, No. 1, 2006. [9]. S.Rajesh, “Research methodologies for data mining”, GJCAT, Vol 2 (3), 2012, 1131-1137 [10]. Lee, D. and Lee, W. Finding maximal frequent item-sets over online data streams adaptively. In Proc. of 5th IEEE Int'l Conf. on Data Mining, 2005. [11]. Agrawal, R. and R. Srikant (1994), Fast algorithms for mining association rules, in: Proceedings 20th International Conference on Very Large Data Bases (VLDB'94), Morgan Kaufmann. [12]. S.Rajesh, “Pattern discovery through web mining”, Global Journal of Computer Application and Technology, GJCAT, Vol 2 (3), 2012, 1093-1098, ISSN: 2249-1945. [13]. Tan, P.-N., M. Steinbach, and V. Kumar (2006), Introduction to data mining, Addison Wesley. [14]. Wu, C. (2006), Applying frequent itemset mining to identify a small itemset that satisfies a large percentage of orders in a warehouse, Computers and Operations Research 33, 3161-3170. [15]. Brin, S., R. Motwani, and C. Silverstein (1998), beyond market baskets: Generalizing association rules to correlations, Data Mining and Knowledge Discovery 2, 39-68. [16]. Hu, K., Y. Lu, L. Zhou, and C. Shi (1999), Integrating classification and association rule mining: A concept lattice framework, in: Proceedings of the Seventh International Workshop on New Directions in Rough Sets, Data Mining and Granular Soft Computing, Springer.
  • 5. Profit And Preferable Itemsets- A Study 5 [17]. Kosters, W. A., E. Marchiori, and A. J. Oerlemans (1999), Mining clusters with association rules, in: Proceedings of the Third International Symposium on Advances in Intelligent Data Analysis (IDA'99), Springer Lecture Notes in Computer Science 1642 [18]. Pawlak, Z. (2000), Rough Sets, Theoretical Aspects of Reasoning about Data, Morgan Kaufman. [19]. Boros, E., P. Hammer, T. Ibaraki, A. Kogan, E. Mayoraz, and I. Muchnik (2000), An Implementation of Logical Analysis of Data, IEEE Transactions on Knowledge and Data Engineering 12, 292-306. [20]. Brijs, T., G. Swinnen, K. Vanhoof, and G. Wets (1999), Using association rules for product assortment decisions: A case study, in: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 1999). [21]. Huhtala, Y., J. KÄarkkÄainen, P. Porkka, and H. Toivonen (1999), TANE: An efficient algorithm for discovering functional and approximate dependencies, The Computer Journal 42, 100-111. [22]. Mannila, H., H. Toivonen, and A. Verkamo (1995), discover frequent episodes in sequences, in: Proceedings of the First International Conference on Knowledge Discovery and Data Mining. [23]. S. Borzsonyi, D. Kossmann, and K. Stocker, The Skyline Operator. In ICDE 2001 ABOUT AUTHORS G. Rasmitha Reddy working in Nalla Narasimha Reddy Education Society‟s Group of Institutions as Assistant Professor, Computer Science Department. Having 3+ Years of experience in teaching and published papers in various Journals. L.Vandana working in Nalla Narasimha Reddy Education Society‟s Group of Institutions as Assistant Professor, Computer Science Department. Having 14 Years of experience in teaching and published papers in various Journals. S.RAJESH,, Dept of CSE, NNRESGI, Hyd, AP, INDIA. Had 6 years of experience in teaching and Attended and presented papers in international and national conferences like IEEE, Springer etc. And published many papers in peer-reviewed national and international journals. Has done from university college of engineering, osmania university. His research interests include cloud computing, data mining and data bases Mr Mahipal Reddy Pulyala, Post Graduated in Computer science and Engineering (M.Tech) , JNTUH , Hyderabad in 2011 and Post Graduated in Master of Computer Applications(MCA),JNTUH , Hyderabad in 2009, He is working as an Assistant Professor in Department of Computer Science & Engineering in Vaagdevi College Of Engineering, Warangal Dist, AP, and India. He has 4 years of Teaching Experience. His Research Interests Include Data Mining