W26142147

208 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
208
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

W26142147

  1. 1. Venkata Lavanya Korada, Avala Atchyuta Rao / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.142-147 Analyzing & Identifying CFD’s using the Concepts of Data Mining Venkata Lavanya Korada*1, Avala Atchyuta Rao*2 *1 M.Tech Student, Gokul Institute of Technology & Science, Bobilli , INDIA *2 Asst.Professor, CSE Dept, Gokul Institute of Technology & Science, Bobilli, INDIAAbstract The recent extension of functional effort. To effectively identify data cleaning rules, wedependencies (FDs) are Conditional functional develop techniques for discovering CFDs fromdependencies (CFDs) that have recently been sample relations. We provide three methods forproposed which can apply to a pattern of CFD discovery. The first, referred to as CFDMiner,semantically related constraints and they can is based on techniques for mining closed itemsets,also be applied as a rules for cleaning relational and is used to discover constant CFDs, namely,data. It is often unrealistic to confine completely CFDs with constant patterns only. The other twoon human experts to design CFDs via an algorithms are developed for discovering generalexpensive and long manual process. CFD-based CFDs. The first algorithm, referred to as CTANE, iscleaning methods in order to be effective it is a levelwise algorithm that extends TANE, a well-necessary to have techniques in place that can known algorithm for mining FDs. The other,automatically discover or learn CFDs from referred to as FastCFD, is based on the depthfirstsample data. As it is already quite difficult for approach used in FastFD, a method for discoveringtraditional FDs, the discovery problem is more FDs. It leverages closed-itemset mining to reducedifficult for CFDs. New challenges have been search space. Our experimental results demonstrateintroduced for mining pattern in CFD’s. We the following.provide three methods for CFD discovery. The (i) CFDMiner can be multiple orders of magnitudefirst method referred to as CFDMiner, is for faster than CTANE and FastCFD for constant CFDconstant CFD discovery. It explores the discovery.connection between minimal constant CFDs and (ii) CTANE works well when a given sampleclosed and free patterns. The other two relation is large, but it does not scale well with thealgorithms are developed for discovering general arity of the relation.CFDs. Our second algorithm, referred to as (iii) FastCFD is far more efficient than CTANECTANE, it extends TANE to discover general when the arity of the relation is large.CFDs. It is based on an attribute-set/pattern As mentioned constant CFDs aretuple lattice and explores minimal CFDs only. particularly important for object identification, andOur third algorithm is FastCFD; elicit general thus deserve a separate treatment. One wantsCFDs by applying a depth-first search strategy efficient methods to discover constant CFDs alone,rather than the level wise approach. With the without paying the price of discovering all CFDs.purpose of these algorithms a set of promising Indeed, as will be seen later, constant CFDtools can be provided to help reduce manual discovery is often several orders of magnitude fastereffort in the design of data-quality rules, for than general CFD discovery. Levelwise algorithmsusers to choose for different applications. They may not perform well on sample relations of largehelp make CFD-based cleaning a practical data arity, given their inherent exponential complexity.quality tool More effective methods have to be in place to deal with datasets with a large arity. A host of techniquesKeywords – Privacy, Privelets, Data Publishing, have been developed for (non-redundant)and Range count Queries. association rule mining, and it is only natural to capitalize on these for CFD discovery. As we shallI. INTRODUCTION see, these techniques can not only be readily used in Many investigations are going on constant CFD discovery, but also significantly speedfunctional dependencies and conditional functional up general CFD discovery. To our knowledge, nodependencies are the recent extension of functional previous work has considered these issues for CFDdependences. In this paper investigates the discovery.discovery of conditional functional dependencies(CFDs) by supporting patterns of semantically II. PREVIOUS WORKrelated constants, and can be used as rules for The discovery problem has been studiedcleaning relational data. However, finding CFDs is for FDs for two decades [1], [3] for database design,an expensive process that involves intensive manual data archiving, OLAP and data mining. It was first 142 | P a g e
  2. 2. Venkata Lavanya Korada, Avala Atchyuta Rao / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.142-147investigated in [2], which shows that the problem is methods have to be in place to deal with datasetsinherently exponential in the arity |R| of the schema with a large arity. (3) A host of techniques haveR of sample data r. One of the best-known methods been developed for (non-redundant) association rulefor FD discovery is TANE [3], a levelwise mining, and it is only natural to capitalize on thesealgorithm [2] that searches an attribute-set for CFD discovery. As we shall see, thesecontainment lattice and derives FDs with k + 1 techniques can not only be readily used in constantattributes from sets of k attributes, with pruning CFD discovery, but also significantly speed upbased on FDs generated in previous levels. TANE general CFD discovery. To our knowledge, notakes linear time in the size |r| of input sample r, and previous work has considered these issues for CFDworks well when the arity |R| is not very large. The discovery.algorithms of [6], [7], [8] follow a similar levelwiseapproach. However, the levelwise algorithms may III. System Analysis & descriptiontake exponential time in |R| even if the output is not Levelwise algorithms may not performexponential in |R|. In light of this, another algorithm, well on sample relations of large arity, given theirreferred to as FastFD [4], explores the connection inherent exponential complexity. More effectivebetween FD discovery and the problem of finding methods have to be in place to deal with datasetsminimal covers of hypergraphs, and employs the with a large arity. A host of techniques have beendepth-first strategy to search minimal covers. Its developed for (non-redundant) association ruletakes (almost) linear-time in the size of the output, mining, and it is only natural to capitalize on thesei.e., in the size of the FD cover. It scales better than for CFD discovery. As we shall see, theseTANE when the arity is large, but it is more techniques can not only be readily used in constantsensitive to the size |r|. Indeed, it is in O(|r|2 log |r|) CFD discovery, but also significantly speed uptime, when considering data complexity (|R| is general CFD discovery. To our knowledge, noassumed constant). There has also been a bottom-up previous work has considered these issues for CFDapproach [5] based on techniques for learning discovery.general logical descriptions in a hypotheses space. In light of these considerations we provide theAs shown in [3], TANE outperforms the algorithm following modules for CFD discovery: one forof [5]. Recently two sets of algorithms have been discovering constant CFDs, and the other two fordeveloped for discovering CFDs [1], [2]. For a fixed general CFDs.traditional FD fd, [1] showed that it is NP-complete (Module: 1) we propose a notion of minimal CFDsto find useful patterns that, together with fd, make based on both the minimality of attributes and thequality CFDs. They provide efficient heuristic minimality of patterns. Intuitively, minimal CFDsalgorithms for discovering patterns from samples contain neither redundant attributes nor redundantw.r.t. a fixed FD. An algorithm for discovering patterns. Furthermore, we consider frequent CFDsCFDs,including both traditional FDs and their that hold on a sample dataset r, namely, CFDs inassociated patterns, was presented in [2], which is which the pattern tuples have a support in r above aan extension of TANE. certain threshold. Frequent CFDs allow us to Constant CFD discovery is closely related accommodate unreliable data with errors and noise.to association rule mining (e.g., [2]) and in Our algorithms find minimal and frequent CFDs toparticular, closed and free itemsets mining (e.g., [3], help users identify quality cleaning rules from a[24]).With 100% confidence, an association rule (X, possibly large set of CFDs that hold on the samples.tp) # (A, a) is a constant CFD (X " A, (tp ! a)), (Module: 2) our first algorithm, referred to aswhere tp is a constant pattern over attributes X and a CFDMiner, is for constant CFD discovery. Weis a value in the domain of attribute A. Better still, explore the connection between minimal constantthere is an intimate connection between left-reduced CFDs and closed and free patterns. Based on this,constant CFDs and non-redundant association rules, CFDMiner finds constant CFDs by leveraging awhich can be found by computing closed itemsets latest mining technique, which mines closedand free itemsets. The potential applications of itemsets and free itemsets in parallel following aCFDs in data cleaning highlight the need for further depth-first search scheme.investigations of CFD discovery. As remarked (Module: 3) our second algorithm, referred to asearlier, constant CFDs are particularly important for CTANE, extends TANE to discover general CFDs.object identification, and thus deserve a separate It is based on an attribute-set/pattern tuple lattice,treatment. One wants efficient methods to discover and mines CFDs at level k + 1 of the lattice (i.e.,constant CFDs alone, without paying the price of when each set at the level consists of k+1 attributes)discovering all CFDs. Indeed, as will be seen later, with pruning based on those at level k. CTANEconstant CFD discovery is often several orders of discovers minimal CFDs only.magnitude faster than general CFD discovery (Module: 4) our third algorithm, referred to as Levelwise algorithms [2] may not perform FastCFD, discovers general CFDs by employing awell on sample relations of large arity, given their depth-first search strategy instead of the levelwiseinherent exponential complexity.More effective approach. It is a nontrivial extension of FastFD 143 | P a g e
  3. 3. Venkata Lavanya Korada, Avala Atchyuta Rao / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.142-147mentioned above, by mining pattern tuples. A novelpruning technique is introduced by FastCFD, by Systemleveraging constant CFDs found by CFDMiner. As Provider EB Authorityopposed to CTANE, FastCFD does not takeexponential time in the arity of sample data when acanonical cover of CFDs is not exponentially large. Add User Details(Module: 5) Our fifth and final contribution is an Add Passwordexperimental study of the effectiveness andefficiency of our algorithms, based on real-life data Search User Details View User details(Wisconsin breast cancer and chess datasets fromUCI) and synthetic datasets generated from data Enter meter no or Area Code or Phone Noscraped from the Web. We evaluate the scalabilityof these methods by varying the sample size, the View User detailsarity of relation schema, the active domains ofattributes, and the support threshold for frequentCFDs. We find that constant CFD discovery (usingCFDMiner) is often 3 orders of magnitude fasterthan general CFD discovery (using CTANE orFastCFD). We also find that FastCFD scales wellwith the arity: it is up to 3 orders of magnitude fasterthan CTANE when the arity is between 10 and 15,and it performs well when the arity is greater than Fig.1: Inter-operational Sequence Diagram for the30; in contrast, CTANE cannot run to completion Framworkwhen the arity is above 17. On the other hand,CTANE is more sensitive to support threshold andoutperforms FastCFD when the threshold is largeand the arity is of a moderate size. We also find thatour pruning techniques via itemset mining areeffective: it improves the performance of FastCFD Provider Loginby 5-10 Folds and makes FastCFD scale well withthe sample size. These results provide a guidelinefor when to use CFDMiner, CTANE or FastCFD in Yes No Checkdifferent applications.These modules provide a setof promising tools to help reduce manual effort inthe design of data-quality rules, for users to choose Unauthorized Personfor different applications. They help make CFD- Add User Detailsbased cleaning a practical data quality tool.IV. SYSTEM DESIGN & View TablesIMPLEMENTATION This Component design diagram helps tomodel the physical aspects of an object orientedsoftware system i.e., for the proposed framework it Change Passwordillustrates the architecture of the dependenciesbetween service provider and consumer. View User Full Details A sequence diagram shows, as parallelvertical lines (lifelines), different processes or Fig.2: Inter-operational Use Activity diagram for theobjects that live simultaneously, and, as horizontal frameworkarrows, the messages exchanged between them, inthe order in which they occur. This allows thespecification of simple runtime scenarios in agraphical manner 144 | P a g e
  4. 4. Venkata Lavanya Korada, Avala Atchyuta Rao / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.142-147 V. RESULTS EB Authority Provider View Users Details Add Users Details Change Passsword Change Passsword Enter Meter Number View User Details- Table wise or( Area Code) or PhoneNumber View User Full details View User Full detailsProvider String Name() EB Authority getName() setAttribute() getAttribute() Set Session() Get Session() StringtoString() StringtoString() Fig.3: Inter-operational class diagram for Framework CTANE Algorithm levelwise algorithm for discovering minimal, k- frequent (variable and constant) CFDs. It is an extension of algorithm TANE [3] for discovering FDs. Fig.4 : To add the user Details Fig.5 : Welcome Screen for service provider 145 | P a g e
  5. 5. Venkata Lavanya Korada, Avala Atchyuta Rao / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.142-147 CONCLUSION We have developed and implemented three algorithms for discovering minimal CFDs: CFDMiner for mining minimal constant CFDs, a class of CFDs important for both data cleaning and data integration; CTANE for discovering general minimal CFDs based on the levelwise approach; and FastCFD for discovering general minimal CFDs based on a depth-first search strategy, and a novel optimization technique via closed-itemset mining. As suggested by our experimental results, these provide a set of tools for users to choose for different applications. When only constant CFDs are needed, one can simply use CFDMiner without paying the price of mining general CFDs. When the arity of a sample dataset is large, one should opt for FastCFD. When k-frequent CFDs are needed for a large k, one could use CTANE. REFERENCE [1] J. Chomicki and J. Marcinkowski, “Minimal-change integrity maintenance using tuple deletions,” Information and Computation, vol. 197, no. 1-2, pp. 90– 121, 2005. [2] J. Wijsen, “Database repairing using updates,” TODS, vol. 30, no. 3, pp. 722– 768, 2005.Fig.6 : To find the information of the user [3] L. Bravo, W. Fan, and S. Ma, “Extending dependencies with conditions,” in VLDB, 2007. [4] B. Goethals, W. L. Page, and H. Mannila, “Mining association rules of simple conjunctive queries,” in SDM, 2008. [5] S. Lopes, J.-M. Petit, and L. Lakhal, “Efficient discovery of functional dependencies and armstrong relations,” in EDBT, 2000. [6] T. Calders, R. T. Ng, and J. Wijsen, “Searching for dependencies at multiple abstraction levels,” TODS, vol. 27, no. 3, pp. 229–260, 2003. [7] R. S. King and J. J. Legendre, “Discovery of functional and approximate functional dependencies in relational databases,” JAMDS, vol. 7, no. 1, pp. 49–59, 2003. [8] I. F. Ilyas, V. Markl, P. J. Haas, P. Brown, and A. Aboulnaga, “Cords: Automatic discovery of correlations and soft functional dependencies,” in SIGMOD, 2004. [9] H. Mannila and H. Toivonen, “Levelwise search and borders of theories in knowledge discovery,” Data Min. Knowl. Discov., vol. 1, no. 3, pp. 259–289, 1997. [10] Gartner, “Forecast: Data quality tools, worldwide, 2006-2011,” 2007. [11] B. Goethals, W. L. Page, and H. Mannila,Fig.7 : User complete information “Mining association rules of simple conjunctive queries,” in SDM, 2008. 146 | P a g e
  6. 6. Venkata Lavanya Korada, Avala Atchyuta Rao / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.142-147 [12] R. Medina and N. Lhouari, “A unified hierarchy for functional dependencies, conditional functional dependencies and association rules,” in ICFCA, 2009.Author List: Venkata Lavanya Korada receivedB.Tech in Computer science and Engineering fromThandra Paparaya Instutite of Science andTechnology Affiliated to JNTUH, in 2005 andPursuing M.Tech in Computer science fromGOKUL Institute of Technology & SciencesAffiliated to JNTUK. Her research areas of interestare Data Mining and Computer Networks. Avala Atchyuta Rao received B.Tech inComputer science and Engineering from PrakasamEngineering College Affiliated to JNTUH, in 2005and M.Tech in Nural Networks from GOKULInstitute of Technology & Sciences Affiliated toJNTUK, in 2010. He is a live student Member ofCSR. His research areas of interest are SoftwareEngineering. 147 | P a g e

×