• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Duplicate Detection of Records in Queries using Clustering
 

Duplicate Detection of Records in Queries using Clustering

on

  • 549 views

The problem of detecting and eliminating duplicated data is one of the major problems in the broad area of data cleaning and data quality in data warehouse. Many times, the same logical real world ...

The problem of detecting and eliminating duplicated data is one of the major problems in the broad area of data cleaning and data quality in data warehouse. Many times, the same logical real world entity may have multiple representations in the data warehouse. Duplicate elimination is hard because it is caused by several types of errors like typographical errors, and different representations of the same logical value. Also, it is important to detect and clean equivalence errors because an equivalence error may result in several duplicate tuples. Recent research efforts have focused on the issue of duplicate elimination in data warehouses. This entails trying to match inexact duplicate records, which are records that refer to the same real-world entity while not being syntactically equivalent. This paper mainly focuses on efficient detection and elimination of duplicate data. The main objective of this research work is to detect exact and inexact duplicates by using duplicate detection and elimination rules. This approach is used to improve the efficiency of the data.

Statistics

Views

Total Views
549
Views on SlideShare
549
Embed Views
0

Actions

Likes
0
Downloads
5
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Duplicate Detection of Records in Queries using Clustering Duplicate Detection of Records in Queries using Clustering Document Transcript

    • International Journal of Research in Computer Science eISSN 2249-8265 Volume 2 Issue 2 (2012) pp. 29-32 © White Globe Publications www.ijorcs.org DUPLICATE DETECTION OF RECORDS IN QUERIES USING CLUSTERING M.Anitha1, A.Srinivas2, T.P.Shekhar3, D.Sagar4 * Sree Chaitanya College Of Engineering (SCCE), Karimnagar, India Abstract: The problem of detecting and eliminating multiple sources, the amount of the data increases andduplicated data is one of the major problems in the as well as data is duplicated. Data warehouse maybroad area of data cleaning and data quality in data have terabyte of data for the mining process. Thewarehouse. Many times, the same logical real world preprocessing of data is the initial and often crucialentity may have multiple representations in the data step of the data mining process. To increase thewarehouse. Duplicate elimination is hard because it is accuracy of the mining result one has to perform datacaused by several types of errors like typographical preprocessing because 80% of mining efforts oftenerrors, and different representations of the same spend their time on data quality. So, data cleaning islogical value. Also, it is important to detect and clean very much important in data warehouse before theequivalence errors because an equivalence error may mining process. The result of the data mining processresult in several duplicate tuples. Recent research will not be accurate because of the data duplicationefforts have focused on the issue of duplicate and poor quality of data. There are many existingelimination in data warehouses. This entails trying to methods available for duplicate data detection andmatch inexact duplicate records, which are records elimination. But the speed of the data cleaning processthat refer to the same real-world entity while not being is very slow and the time taken for the cleaningsyntactically equivalent. This paper mainly focuses on process is high with large amount of data. So, there is aefficient detection and elimination of duplicate data. need to reduce time and increase speed of the dataThe main objective of this research work is to detect cleaning process as well as need to improve the qualityexact and inexact duplicates by using duplicate of the data.detection and elimination rules. This approach is usedto improve the efficiency of the data. There are two issues to be considered for duplicate detection: Accuracy and Speed. The measure ofKeywords: Data Cleaning, Duplicate Data, Data accuracy in duplicate detection depends on the numberWarehouse, Data Mining of false negatives (duplicates you did not classify as such) and false positives (non-duplicates which were I. INTRODUCTION classified as duplicates) [12]. Data warehouse contains large amounts of data for In this research work, a duplicate detection anddata mining to analyze the data for decision making elimination rule is developed to handle any duplicateprocess. Data miners do not simply analyze data, they data in a data warehouse. Duplicate elimination is veryhave to bring the data in a format and state that allows important to identify which duplicate to retain andfor this analysis. It has been estimated that the actual duplicate is to be removed. The main objective of thismining of data only makes up 10% of the time research work is to reduce the number of falserequired for the complete knowledge discovery positives, to speed up the data cleaning process reduceprocess [3]. According to Jiawei, the precedent time the complexity and to improve the quality of data. Aconsuming step of preprocessing is of essential high quality, scalable duplicate elimination algorithmimportant for data mining. It is more than a tedious is used and evaluated it on real datasets from annecessity: The techniques used in the preprocessing operational data warehouse to achieve objective.step can deeply influence the results of the followingstep, the actual application of a data mining algorithm II. RECORD MATCHING OVER QUERY[6]. Hans-peter stated as the role of the impact on the RESULTSlink of data preprocessing to data mining will gainsteadily more interest over the coming years. A. First Problem DefinitionPreprocessing is one of the fourth future trend andmajor issues in data mining over the next years [7]. Our focus is to find the matching status among the records and to retain the non duplicate records. Then, In data warehouse, data is integrated or collected the goal is to cluster the matched records using fuzzyfrom multiple sources. While integrating data from ontological document clustering. www.ijorcs.org
    • 30 M.Anitha, A.Srinivas, T.P.Shekhar, D.SagarB. Element Identification Algorithm: Supervised learning methods use only some of the 1. D=φfields in a record for identification. This is the reason 2. Set the parameters W of C1 according to Nfor query results obtained using supervised learning to 3. Use C1 to get a set of duplicate vector pairs d1 and fcontain duplicate records. Unsupervised Duplicate from P and NElimination (UDE) does not suffer from these types of 4. P = P- d1user reference problems. A preprocessing step called 5. while | d1 |≠ 0exact matching is used for matching relevant records. 6. N’ = N - fIt requires the data format of the records to be the 7. D = D + d1 + fsame. So, the exact matching method is applicable 8. Train C2 using D and N’only for the records from the same data source. 9. Classify P using C2 and get a set of newly identifiedElement identification thus merges the records that are duplicate vector pairs d2exactly the same in relevant matching fields. 10. P = P - d2C. Ontology matching 11. D =D + d2 12. Adjust the parameters W of C1 according to N’ and The term Ontology is derived from the Greek words D‘onto’ which means being and ‘logia’ which means 13. Use C1 to get a new set of duplicate vector pairs d1written or spoken disclosure. In short, it refers to a and f from P and Nspecification of a conceptualization. 14. N=N’ Ontology basically refers to the set of concepts such 15. Return Das things, events and relations that are specified in Figure 1: UDE Algorithmsome way in order to create an agreed-uponvocabulary for exchanging information. Ontologies B. Certainty factorcan be represented in textual or graphical formats.Usually, graphical formats are preferred for easy In the existing method of duplicate data eliminationunderstandability. Ontologies with a large knowledge [10], certainty factor (CF) is calculated by classifyingbase [5] can be represented in different forms such as attributes with distinct and missing value, type and sizehierarchical trees, expandable hierarchical trees, of the attribute. These attributes are identifiedhyperbolic trees, etc. In the expandable hierarchical manually based on the type of the data and the mosttree format, the user has the freedom to expand only important of data in that data warehouse. For example,the node of interest and leave the rest in a collapsed if name, telephone and fax field are used for matchingstate [2]. If necessary, the entire tree can be expanded then high value is assigned for certainty factor. In thisto get the complete knowledge base. This type of research work, best attributes are identified in the earlyformat can be used only when there are a large number stages of data cleaning. The attributes are selectedof hierarchical relationships. Ontology matching is based on the specific criteria and quality of the data.used for finding the matching status of the record pairs Attribute threshold value is calculated based on theby matching the record attributes. measurement type and size of the data. These selected attributes are well suited for the data cleaning process. III. SYSTEM METHODOLOGY Certainty factor is assigned based on the attribute types. This is shown in the following table.A. Unsupervised Duplicate Elimination Table 1: Classification of attribute types UDE employs a similarity function to find field Distinct Missing Types ofsimilarity. We use similarity vector to represent a pair S. No Key Attribute Size of data values values dataof records. 1 √ √ √Input: Potential duplicate vector set P Non-duplicate 2 √ √vector set N 3 √ √Output: Duplicate vector set D 4 √ √ √ √ 5 √ √ √C1 : A classification algorithm with adjustableparameters W that identifies duplicate vector pairs 6 √ √ √from P C2 : a supervised classifier, SVM 7 √ √ √ 8 √ √ 9 √ √ 10 √ √ www.ijorcs.org
    • Duplicate Detection of Records in Queries using Clustering 31 • Matching key field with high type and high sizeRule 1: certainty factor 0.95 (No. 1 and No. 4) • And matching field with high distinct value and • Matching key field with high type and high size high value data type • And matching field with high distinct value, low missing value, high value data type and matching Rule 12: certainty factor 0.7 (No. 2 and No. 9) field with high range value • Matching key field with high size • And matching field with high distinct value andRule 2: certainty factor 0.9 (No. 2 and No. 4) high value data type • Matching key field with high range value • And matching field with high distinct value, low Rule 13: certainty factor 0.7 (No. 3 and No. 9) missing value, and matching field with high range • Matching key field with high type value • And matching field with high distinct value and high value data typeRule 3: certainty factor 0.9 (No. 3 and No. 4) • Matching key field with high type Rule 14: certainty factor 0.7 (No. 1 and No. 10) • And Matching field with high distinct value, • Matching key field with high type and high size low missing value, high value data type and • And matching field with high distinct value and matching field with high range value high range valueRule 4: certainty factor 0.85 (No. 1 and No. 5) Rule 15: certainty factor 0.7 (No. 2 and No. 10) • Matching key field with high type and high size • Matching key field with high size • And matching field with high distinct value, low • And matching field with high distinct value and missing value and high range value high range valueRule 5: certainty factor 0.85 (No. 1 and No. 5) Rule 16: certainty factor 0.7 (No. 3 and No. 10) • Matching key field and high size • Matching key field with high type • And matching field with high distinct value, low • And matching field with high distinct value and missing value and high range value high range valueRule 6: certainty factor 0.85 (No. 2 and No. 5) S Certainty Threshold Rules • Matching key field with high type No. Factor (CF) value (TH) • And matching field with high distinct value, low 1 {TS}, {D, M, DT, DS} 0.95 0.75 2 {T, S}, {D, M, DT, DS} 0.9 0.80 missing value and high range value {TS, T, S}, {D, M, DT}, 3 0.85 0.85Rule 7: certainty factor 0.85 (No. 1 and No. 6) {D, M, DS} 4 {TS, T, S}, {D, DT, DS} 0.8 0.9 • Matching key field with high size and high type 5 {TS, T, S}, {D, M} 0.75 0.95 • And matching field with high distinct value, low {TS, T, S}, {D, DT}, {D, 6 0.7 0.95 missing value and high value data type DS}Rule 8: certainty factor 0.8 (No. 3 and No. 7) TS – Type and Size of key attribute • Matching key field with high type T – Type of key attribute • And Matching field with high distinct value, S – Size of key attributes high value data type and high range value D – Distinct value of attributes M – Missing value of attributesRule 09: certainty factor 0.75 (No. 2 and No. 8) DT – Data type of attributes • Matching key field with high size DS – Data size of attributes • And matching field with high distinct value and low missing value Duplicate records are identified in each cluster to identify exact and inexact duplicate records. TheRule 10: certainty factor 0.75 (No. 3 and No. 8) duplicate records can be categorized as match, may be • Matching key field with high type match and no-match. Match and may be match duplicate records are used in the duplicate data • And matching field with high distinct value and elimination rule. Duplicate data elimination rule will low missing value identify the quality of the each duplicate record toRule 11: certainty factor 0.7 (No. 1 and No. 9) eliminate poor quality duplicate records. www.ijorcs.org
    • 32 M.Anitha, A.Srinivas, T.P.Shekhar, D.Sagar• Calculate the similarity of the documents matched in Data cleansing is a complex and challenging problem. main concepts (Xmc) and the similarity of the This rule-based strategy helps to manage the documents matched in detailed descriptions (Xdd). complexity, but does not remove that complexity. This• Evaluate Xmc and Xdd using the rules to derive the approach can be applied to any subject oriented data corresponding memberships. warehouse in any domain.• Compare the memberships and select the minimum V. REFERENCES membership from these two sets to represent the membership of the corresponding concept (high [1] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani, similarity, medium similarity, and low similarity) for “Robust and Efficient Fuzzy Match for Online Data each rule. Cleaning,” Proc. ACM SIGMOD, pp. 313-324, 2003.• Collect memberships which represent the same [2] Kuhanandha Mahalingam and Michael N.Huhns, “Representing and using Ontologies”,USC-CIT concept in one set. Technical Report 98-01.• Derive the maximum membership for each set, and [3] Weifeng Su, Jiying Wang, and Federick H.Lochovsky, compute the final inference result. ” Record Matching over Query Results from MultipleC. Evaluation Metric Web Databases” IEEE transactions on Knowledge and Data Engineering, vol. 22, N0.4,2010. The overall performance can be found using [4] R. Ananthakrishna, S. Chaudhuri, and V. Ganti,𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =precision and recall where Number of correctly identified duplicate pairs “Eliminating Fuzzy Duplicates in Data Warehouses. VLDB”, pages 586-597, 2002. Number of all identified duplicate pairs [5] Tetlow.P,Pan.J,Oberle.D,Wallace.E,Uschold.M,Kendall𝑅𝑒𝑐𝑎𝑙𝑙 = Number of correctly identified duplicate pairs .E,”Ontology Driven Architectures and Potential Uses Number of true duplicate pairs of the Semantic Web in Software Engineering”,W3C,Semantic Web Best Practices and Deployment Working Group,Draft(2006).The classification quality is evaluated using F-measure [6] Ji-Rong Wen, Fred Lochovsky, Wei-Ying Ma,which is the harmonic mean of precision and recall “Instance-based Schema Matching for Web Databases 𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = 2(𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛)(𝑟𝑒𝑐𝑎𝑙𝑙) by Domain-specific Query Probing”, Proceedings of the 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙 30th VLDB Conference, Toronto, Canada, 2004. [7] Amy J.C.Trappey, Charles V.Trappey, Fu-Chiang Hsu,and David W.Hsiao, “A Fuzzy Ontological IV. CONCLUSION Knowledge Document Clustering Methodology”,IEEE Transactions on Systems,Man,and Cybernetics-Part Deduplication and data linkage are important tasks B:Cybernetics,Vol.39,No.3,june 2009.in the pre-processing step for many data miningprojects. It is important to improve data quality beforedata is loaded into data warehouse. Locatingapproximate duplicates in large data warehouse is animportant part of data management and plays a criticalrole in the data cleaning process. In this research wok,a framework is designed to clean duplicate data forimproving data quality and also to support any subjectoriented data.y In this research work, efficient duplicate detectionand duplicate elimination approach is developed toobtain good result of duplicate detection andelimination by reducing false positives. Performanceof this research work shows that the time savedsignificantly and improved duplicate results thanexisting approach. The framework is mainly developed to increase thespeed of the duplicate data detection and eliminationprocess and to increase the quality of the data byidentifying true duplicates and strict enough to keepout false-positives. The accuracy and efficiency ofduplicate elimination strategies are improved byintroducing the concept of a certainty factor for a rule. www.ijorcs.org