Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1
Data Cleaning Techniques
Shahid Rajaee Teacher Training University
Faculty of Computer Engineering
PRESENTED BY:
Amir Ma...
2
Today’s Lecture Content
•Introduction
•Enhanced Technique to Clean Data in the Data Warehouse
• DWCLEANSER: A Framework ...
3
Today’s Lecture Content
•Introduction
•Enhanced Technique to Clean Data in the Data Warehouse
• DWCLEANSER: A Framework ...
4
Introduction
• Data quality is a main issue in quality information management.
• Data quality problems occur anywhere in...
5
Today’s Lecture Content
•Introduction
•Enhanced Technique to Clean Data in the Data Warehouse
• DWCLEANSER: A Framework ...
6
An Enhanced Technique to Clean Data in the Data Warehouse
• Using a new algorithm that detects and corrects most of the ...
7
Flowchart of proposed technique
Proposed model can easily be developed in a data -
warehouse, by the following algorithm:
8
user selects any rules needed in the data cleaning system. layout and descriptions for fields of the data set, which are...
COMPARISON OF THE PROPOSED TECHNIQUE WITH SOME EXISTING TECHNIQUES
Above 1009 records, containing a lot of anomalies have ...
10
Today’s Lecture Content
•Introduction
•Enhanced Technique to Clean Data in the Data Warehouse
• DWCLEANSER: A Framework...
11
DWCLEANSER: A Framework for Approximate Duplicate Detection
• A novel framework for detection of exact as well as appro...
Existing Framework
12
Existing Framework
Previously designed framework designed is a sequential, token-based framework that offers fundamental s...
14
Proposed Framework: DWCLEANSER
15
1.Field Selection
• Records are decomposed into fields:
• Fields are analyzed for gathering data about their type, rela...
16
2.Computation of Rules
Certain rules are computed that will be utilized during the implementation of the cleaning proce...
17
3. Formation of Clusters
• Using recursive record matching algorithm for initial cluster formation with slight modifica...
18
7. Updating Metadata/Repository:
Metadata and repositories will be an integral part of proposed framework:
important co...
19
Comparison of Existing and Proposed Framework
20
Today’s Lecture Content
•Introduction
•Enhanced Technique to Clean Data in the Data Warehouse
• DWCLEANSER: A Framework...
21
Data Quality Mining
Data mining process :
• Involves into the data collection, cleaning the data, building a model and ...
22
Today’s Lecture Content
•Introduction
•Enhanced Technique to Clean Data in the Data Warehouse
• DWCLEANSER: A Framework...
23
Data Quality Mining With Association Rules
Objective:
Used here to detect, quantify, explain and correct data quality d...
24
Data Cleaning Using Functional Dependencies
Functional Dependency(FD) is an important feature for referencing to the re...
25
Today’s Lecture Content
•Introduction
•Enhanced Technique to Clean Data in the Data Warehouse
• DWCLEANSER: A Framework...
26
SYSTEM ARCHITETURE
27
SYSTEM ARCHITETURE
Data collector
• Retrieve data from relational database and Improves some quality of data (corrects ...
28
SYSTEM ARCHITETURE
Cleaning Engine:
Receive:
• suspicious error tuples
• FD selected from the FD engine
Then:
Assign we...
29
SELECTING THE FD
Apply selectivity value for ranking the candidate in order to find the appropriate FD.
1 Selectivity v...
30
SELECTING THE FD
2 Ranking the candidate
After calculating the selectivity value for determining the ranks of candidate...
31
SELECTING THE FD
3 Improve the pruning step :
The pruning step is a step for generating the candidate set by computing ...
32
Improved pruning method
• Begins the pruning by getting the set of candidates in level - 1 and then, checks the candida...
33
Results
50,000 real customer tuples are used as a data source.
Separate the dataset into 3 sets, as follows:
o first da...
34
Strengths and Limitations of Data Quality Mining Methods :
Association rules Functional Dependency
Reduce the number of...
35
Main References:
1. Hamad, Mortadha M., and Alaa Abdulkhar Jihad. "An Enhanced Technique To Clean Data In The Data
Ware...
QUESTION??...
Data Cleaning Techniques
Upcoming SlideShare
Loading in …5
×
Upcoming SlideShare
Data cleansing
Next
Download to read offline and view in fullscreen.

5

Share

Download to read offline

Data Cleaning Techniques

Download to read offline

Description of four techniques for Data Cleaning:
1.DWCLEANER Framework
2.Data Mining Techniques include Association Rule and Functional Dependecies
,...

Related Books

Free with a 30 day trial from Scribd

See all

Data Cleaning Techniques

  1. 1. 1 Data Cleaning Techniques Shahid Rajaee Teacher Training University Faculty of Computer Engineering PRESENTED BY: Amir Masoud Sefidian
  2. 2. 2 Today’s Lecture Content •Introduction •Enhanced Technique to Clean Data in the Data Warehouse • DWCLEANSER: A Framework for Approximate Duplicate Detection •Data Quality Mining • Data Quality Mining With Association Rules • Data Cleaning Using Functional Dependencies
  3. 3. 3 Today’s Lecture Content •Introduction •Enhanced Technique to Clean Data in the Data Warehouse • DWCLEANSER: A Framework for Approximate Duplicate Detection • Data Quality Mining • Data Quality Mining With Association Rules • Data Cleaning Using Functional Dependencies
  4. 4. 4 Introduction • Data quality is a main issue in quality information management. • Data quality problems occur anywhere in information systems. • These problems are solved by Data Cleaning: • Is a process used to determine inaccurate, incomplete or unreasonable data and then improve the quality through correcting of detected errors => reduces errors and improves the data quality. • Data Cleaning can be a time consuming and tedious process but it cannot be ignored. • Data quality criterias : accuracy, integrity, completeness, validity, consistency, schema conformance, uniqueness,… .
  5. 5. 5 Today’s Lecture Content •Introduction •Enhanced Technique to Clean Data in the Data Warehouse • DWCLEANSER: A Framework for Approximate Duplicate Detection •Data Quality Mining • Data Quality Mining With Association Rules • Data Cleaning Using Functional Dependencies
  6. 6. 6 An Enhanced Technique to Clean Data in the Data Warehouse • Using a new algorithm that detects and corrects most of the error types and expected problems, such as lexical errors, domain format errors, irregularities, integrity constraint violation, and duplicates, missing value . • Presents a solution working on the quantitative data and any data that have limited values. • Offers the user interaction by selecting the rules and any sources and the desired targets. • Algorithm is able to clean the data completely, addressing all the mistakes and inconsistencies in the data or numerical values specified. • Time taken to process huge data is not as important as obtaining high quality data since a huge amount of data can be treated one-time. • Main focus has been on achieving good quality of the data. • Pace of implementation of this algorithm is adequate. • It well scales to large amount of data processing without a significant degradation of the most of relative performance issues.
  7. 7. 7 Flowchart of proposed technique Proposed model can easily be developed in a data - warehouse, by the following algorithm:
  8. 8. 8 user selects any rules needed in the data cleaning system. layout and descriptions for fields of the data set, which are used in implementing of the algorithm.
  9. 9. COMPARISON OF THE PROPOSED TECHNIQUE WITH SOME EXISTING TECHNIQUES Above 1009 records, containing a lot of anomalies have been examined before and after processing by different available methods (such as: statistics, clustering) a big difference in the number of anomalies which confirms the effectiveness and quality of this algorithm.
  10. 10. 10 Today’s Lecture Content •Introduction •Enhanced Technique to Clean Data in the Data Warehouse • DWCLEANSER: A Framework for Approximate Duplicate Detection •Data Quality Mining • Data Quality Mining With Association Rules • Data Cleaning Using Functional Dependencies
  11. 11. 11 DWCLEANSER: A Framework for Approximate Duplicate Detection • A novel framework for detection of exact as well as approximate duplicates in a data warehouse. • Decreases the complexity involved in the previously designed frameworks by providing efficient data cleaning techniques. • Provides a comprehensive metadata support to the whole cleaning process. • Provisions have also been suggested to take care of outliers and missing fields.
  12. 12. Existing Framework 12
  13. 13. Existing Framework Previously designed framework designed is a sequential, token-based framework that offers fundamental services of data cleaning in six steps : 1)Selection of attributes: Attributes are identified and selected for further processing in the following steps. 2) Formation of tokens: The selected attributes are utilized to form tokens for similarity computation. 3) Clustering/Blocking of records: The blocking/clustering algorithm is used to group the records based on the calculated similarity and block- token key. 4) Similarity computation for selected attributes: Jaccard similarity method is used for comparing token values of selected attributes in a field. 5) Detection and elimination of duplicate records: A rule based detection and elimination approach is employed for detecting and eliminating the duplicates in a cluster or in many clusters. 6) Merge: The cleansed data is combined and stored. 13
  14. 14. 14 Proposed Framework: DWCLEANSER
  15. 15. 15 1.Field Selection • Records are decomposed into fields: • Fields are analyzed for gathering data about their type, relationship with other fields, key fields and integrity constraints so that have enough metadata about the decomposed fields. • Missing fields stored in a separate temporary table and preserved in the repository along with their source record, relation name, data types and integrity constraints. • Missing fields are reviewed by the DBA to verify the reason for their existence. (1) if the data is missing it can be recaptured; (2) if the value is not known efforts can be made to gather the data to complete the record or fill the missing field with a valid value. if no valid data can be collected the values is preserved in the repository for further verification and not used in the cleaning procedure.
  16. 16. 16 2.Computation of Rules Certain rules are computed that will be utilized during the implementation of the cleaning process. Threshold value: The threshold value is calculated based on the experiments conducted in previous researches. Values lower than the thresholds increase the number of false positives. Values above thresholds are not able to detect all duplicates. Values in between can be used to recognize approximate duplicates. Rules for classification of fields: Selected fields are classified on the basis of their data types. Rules for data quality attributes: Previous framework only focused on 3 quality attributes of data: completeness, accuracy and consistency. 2 other quality attribute values proposed in new framework: Validity: Integrity:
  17. 17. 17 3. Formation of Clusters • Using recursive record matching algorithm for initial cluster formation with slight modification: • Use it for matching of fields rather than whole record. • Clusters are stored in priority queue. • Priorities of clusters in the queue are assigned on the basis of their ability to detect duplicates data sets. • The cluster that detected the recent match is stored assigned the highest priority. 4. Match Score Match scores are assigned by applying Smith-Waterman algorithm(An edit-distance based strategy). The calculations done in this method are stored in a matrix. 5. Detection of Exact and Approximate Duplicates When a new field is to be matched against any data set present in a cluster use Union-Find structure. If it fails in detecting any match then we employ Smith-Waterman. 6. Handling of Outliers and Missing Fields Records that do not match any of the clusters present are called outliers or singleton records. Singleton records may be stored in a separate file, stored in the repository for future analysis and comparisons.
  18. 18. 18 7. Updating Metadata/Repository: Metadata and repositories will be an integral part of proposed framework: important components of repositories: 1. Data dictionary: store the information about the relations, their sources, schema, etc. 2. Rules directory: All the calculated values of thresholds, quality attributes, matching scores, etc. 3. Log files: They are used to store: • information about the selected fields and their source record. • classification of the fields based on their data type explicitly under 3 categories numeric, strings and characters. 4. Outlier & Missing field files: stores the outliers and missing fields and their related information like-type, source relation.
  19. 19. 19 Comparison of Existing and Proposed Framework
  20. 20. 20 Today’s Lecture Content •Introduction •Enhanced Technique to Clean Data in the Data Warehouse • DWCLEANSER: A Framework for Approximate Duplicate Detection •Data Quality Mining • Data Quality Mining With Association Rules • Data Cleaning Using Functional Dependencies
  21. 21. 21 Data Quality Mining Data mining process : • Involves into the data collection, cleaning the data, building a model and monitoring the models. • Automatically extract hidden and intrinsic information from the collections of data. • Has various techniques that are suitable for data cleaning. Some commonly used data mining techniques: Association Rule Mining : • Takes an input and induces rules as output; the outputs can be association rules. • Association rules describe relationships among large data sets and co-occurrence of items. Functional dependency: shows the connection and association between attributes and shows how one specific combination of values on one set of attributes determines one specific combination of values on another set.
  22. 22. 22 Today’s Lecture Content •Introduction •Enhanced Technique to Clean Data in the Data Warehouse • DWCLEANSER: A Framework for Approximate Duplicate Detection •Data Quality Mining • Data Quality Mining With Association Rules • Data Cleaning Using Functional Dependencies
  23. 23. 23 Data Quality Mining With Association Rules Objective: Used here to detect, quantify, explain and correct data quality deficiencies in very large databases. find a relationship with the items in huge database in addition to that it improves the data quality. Association rules generates a rule for all the transactions which are checked by their confidence level. Find out the strength of all rules by the following steps: • Determine transaction type. • Generates the association rule. • Assign a score to each transaction based on the generated rules Score : summing the confidence values of the rules it violates. Rule violation occurs when a tuples must satisfy the rule body but not it’s consequent. Idea: assign high scores to a transaction is to suspect the deficiencies. Suggest minimal threshold for confidence to restrict the rule set in order to improve the results. Sort the transactions according to their score values. Based on the score, the system decides whether to accept or reject the data or else issue a warning.
  24. 24. 24 Data Cleaning Using Functional Dependencies Functional Dependency(FD) is an important feature for referencing to the relationship between attributes and candidate keys in tuples. FD discovery could find too many FDs and, if use directly in a cleaning process, could cause it to NP time => degrade the performance of the data cleaning. Developing a cleaning engine by combining: FD discovery technique + data cleaning technique + Use the feature in query optimization called Selectivity Value to decrease the number of FDs discovered(prune unlikely FDs).
  25. 25. 25 Today’s Lecture Content •Introduction •Enhanced Technique to Clean Data in the Data Warehouse • DWCLEANSER: A Framework for Approximate Duplicate Detection •Data Quality Mining • Data Quality Mining With Association Rules • Data Cleaning Using Functional Dependencies
  26. 26. 26 SYSTEM ARCHITETURE
  27. 27. 27 SYSTEM ARCHITETURE Data collector • Retrieve data from relational database and Improves some quality of data (corrects data from basic typos, invalid domains and invalid formats) and prepares it for the next module (in a relational format). FD engine • Is an FD finding module • Dirty data usually has some errors => use the Approximate FD technique to remove errors and find FD. • Apply the selectivity value technique to rank the candidates in its Pruning step and select the candidates only with high and low rank from the computing FD step. • At the same time, any errors detected from this modified FD engine are suspicious tuples for cleaning. • The errors can be separated into 2 types: o Errors from finding non-candidate key FDs are inconsistent data. o Errors from finding a candidate key FDs are potentially duplicated data. • Together with the (discovered FDs + all suspicious error tuples) will be sent to the next step.
  28. 28. 28 SYSTEM ARCHITETURE Cleaning Engine: Receive: • suspicious error tuples • FD selected from the FD engine Then: Assign weight to the data (high error produces a high weight). Tuples with low weights will repair the high weight tuples. FD repairing technique: After updating the weight, the engine brings the FD to clean the data by using the Cost-based algorithm (use low cost data to repair a high cost data). Duplicate Elimination: The last step is to find the duplicate data by improving the sorted neighbor-hood method algorithm through using the candidate key FD from the FD engine to assign key and sorting data from the attribute on the left-hand side of FDs. Relational database: Other modules storing and retrieving data from this module.
  29. 29. 29 SELECTING THE FD Apply selectivity value for ranking the candidate in order to find the appropriate FD. 1 Selectivity value the selectivity value determine distribution. If the selectivity value of any attribute • is high => the attribute value is highly distributed. • is low => the attribute value is more likely to be united. Highly distributed attribute is potentially a candidate key and can be used to eliminate duplicates. The lowest distributed attribute can be applied to improve the error of distortion of attribute values in the cleaning engine.
  30. 30. 30 SELECTING THE FD 2 Ranking the candidate After calculating the selectivity value for determining the ranks of candidates, we sort these ranks in ascending order. To choose potentially good candidates: Define the low ranking threshold and high ranking threshold as a pruning point. The selected candidates are chosen from the candidates with either high ranking or low ranking values. The high ranking candidate has high selectivity is potentially a candidate key . The low ranking candidates is potentially an invariant valued which can be functionally determined by some attribute in a trivial manner. Thus, it can be computed to be a non-candidate key on the right-hand side. The middle ranking is not precise so ignored.
  31. 31. 31 SELECTING THE FD 3 Improve the pruning step : The pruning step is a step for generating the candidate set by computing the candidates from level 1. Pruning lattice example
  32. 32. 32 Improved pruning method • Begins the pruning by getting the set of candidates in level - 1 and then, checks the candidates. • If they are not the FD and in either high or low accepted ranking => use StoreCandidate function to store new candidate from candidate_x and candidate_y in the current level. • Other candidates that are in a neither low nor high ranking will be ignored.
  33. 33. 33 Results 50,000 real customer tuples are used as a data source. Separate the dataset into 3 sets, as follows: o first dataset has 10% duplicates, o second dataset has 10% errors o last dataset has 10% duplicates and errors. Results showed that this work can identify duplicates and anomalies with high recall and low false positive. PROBLEM : Combining solution is sensitive to data size: • Data volume increase => discovery algorithm speed decrease • Number of attributes increase => the discovery creates more candidates of FD and generates too many FDs including noise ones.
  34. 34. 34 Strengths and Limitations of Data Quality Mining Methods : Association rules Functional Dependency Reduce the number of rules to generate for a transaction Easily identifies suspicious tuples for cleaning avoids a severe pitfall of association rule mining Decrease the number of functional dependency discovered difficult to generate association rules for all transactions is not suitable for large database because it is difficult to sort all the records
  35. 35. 35 Main References: 1. Hamad, Mortadha M., and Alaa Abdulkhar Jihad. "An Enhanced Technique To Clean Data In The Data Warehouse". 2011 Developments in E-systems Engineering (2011): n. pag. Web. 20 Dec. 2015. 2. Thakur, G., Singh, M., Pahwa, P. and Tyagi, N. (2011). DWCLEANSER: A Framework for Approximate Duplicate Detection. Advances in Computing and Information Technology, pp.355-364. 3. Natarajan, K., Li, J. and Koronios, A. (2010). Data mining techniques for data cleaning, Engineering Asset Lifecycle Management, Springer London, pp.796-804. 4. Kollayut Kaewbuadee, Yae Temtanapat, and Ratchata Peachavanish, (2006) Data cleaning using functional dependency from data mining process, International Journal on Computer Science and Information System (IADIS) V1 , no. 2, 117–131 ,ISBN: ISSN : 1646 – 3692.
  36. 36. QUESTION??...
  • befraanc

    Aug. 13, 2019
  • miguel0afd

    Feb. 3, 2019
  • cunniet1

    Jun. 16, 2017
  • AnitaWestrup

    Mar. 25, 2017
  • MohammedAlnuaimi

    Mar. 15, 2016

Description of four techniques for Data Cleaning: 1.DWCLEANER Framework 2.Data Mining Techniques include Association Rule and Functional Dependecies ,...

Views

Total views

29,456

On Slideshare

0

From embeds

0

Number of embeds

6

Actions

Downloads

673

Shares

0

Comments

0

Likes

5

×