SlideShare a Scribd company logo
1 of 37
1
Data Cleaning Techniques
Shahid Rajaee Teacher Training University
Faculty of Computer Engineering
PRESENTED BY:
Amir Masoud Sefidian
2
Today’s Lecture Content
•Introduction
•Enhanced Technique to Clean Data in the Data Warehouse
• DWCLEANSER: A Framework for Approximate Duplicate Detection
•Data Quality Mining
• Data Quality Mining With Association Rules
• Data Cleaning Using Functional Dependencies
3
Today’s Lecture Content
•Introduction
•Enhanced Technique to Clean Data in the Data Warehouse
• DWCLEANSER: A Framework for Approximate Duplicate Detection
• Data Quality Mining
• Data Quality Mining With Association Rules
• Data Cleaning Using Functional Dependencies
4
Introduction
• Data quality is a main issue in quality information management.
• Data quality problems occur anywhere in information systems.
• These problems are solved by Data Cleaning:
• Is a process used to determine inaccurate, incomplete or unreasonable data and then
improve the quality through correcting of detected errors => reduces errors and improves
the data quality.
• Data Cleaning can be a time consuming and tedious process but it cannot be ignored.
• Data quality criterias : accuracy, integrity, completeness, validity, consistency, schema
conformance, uniqueness,… .
5
Today’s Lecture Content
•Introduction
•Enhanced Technique to Clean Data in the Data Warehouse
• DWCLEANSER: A Framework for Approximate Duplicate Detection
•Data Quality Mining
• Data Quality Mining With Association Rules
• Data Cleaning Using Functional Dependencies
6
An Enhanced Technique to Clean Data in the Data Warehouse
• Using a new algorithm that detects and corrects most of the error types and expected problems, such as
lexical errors, domain format errors, irregularities, integrity constraint violation, and duplicates, missing
value .
• Presents a solution working on the quantitative data and any data that have limited values.
• Offers the user interaction by selecting the rules and any sources and the desired targets.
• Algorithm is able to clean the data completely, addressing all the mistakes and inconsistencies in the data
or numerical values specified.
• Time taken to process huge data is not as important as obtaining high quality data since a huge amount
of data can be treated one-time.
• Main focus has been on achieving good quality of the data.
• Pace of implementation of this algorithm is adequate.
• It well scales to large amount of data processing without a significant degradation of the most of relative
performance issues.
7
Flowchart of proposed technique
Proposed model can easily be developed in a data -
warehouse, by the following algorithm:
8
user selects any rules needed in the data cleaning system. layout and descriptions for fields of the data set, which are used
in implementing of the algorithm.
COMPARISON OF THE PROPOSED TECHNIQUE WITH SOME EXISTING TECHNIQUES
Above 1009 records, containing a lot of anomalies have been
examined before and after processing by different available methods
(such as: statistics, clustering) a big difference in the number of
anomalies which confirms the effectiveness and quality of this
algorithm.
10
Today’s Lecture Content
•Introduction
•Enhanced Technique to Clean Data in the Data Warehouse
• DWCLEANSER: A Framework for Approximate Duplicate Detection
•Data Quality Mining
• Data Quality Mining With Association Rules
• Data Cleaning Using Functional Dependencies
11
DWCLEANSER: A Framework for Approximate Duplicate Detection
• A novel framework for detection of exact as well as approximate duplicates in a data
warehouse.
• Decreases the complexity involved in the previously designed frameworks by providing
efficient data cleaning techniques.
• Provides a comprehensive metadata support to the whole cleaning process.
• Provisions have also been suggested to take care of outliers and missing fields.
Existing Framework
12
Existing Framework
Previously designed framework designed is a sequential, token-based framework that offers fundamental services of
data cleaning in six steps :
1)Selection of attributes:
Attributes are identified and selected for further processing in the following steps.
2) Formation of tokens:
The selected attributes are utilized to form tokens for similarity computation.
3) Clustering/Blocking of records:
The blocking/clustering algorithm is used to group the records based on the calculated similarity and block-
token key.
4) Similarity computation for selected attributes:
Jaccard similarity method is used for comparing token values of selected attributes in a field.
5) Detection and elimination of duplicate records:
A rule based detection and elimination approach is employed for detecting and eliminating the duplicates
in a cluster or in many clusters.
6) Merge:
The cleansed data is combined and stored.
13
14
Proposed Framework: DWCLEANSER
15
1.Field Selection
• Records are decomposed into fields:
• Fields are analyzed for gathering data about their type, relationship with other fields, key fields and integrity
constraints so that have enough metadata about the decomposed fields.
• Missing fields stored in a separate temporary table and preserved in the repository along with their source record,
relation name, data types and integrity constraints.
• Missing fields are reviewed by the DBA to verify the reason for their existence.
(1) if the data is missing it can be recaptured;
(2) if the value is not known efforts can be made to gather the data to complete the record or fill the missing field with a
valid value.
if no valid data can be collected the values is preserved in the repository for further verification and not used in the
cleaning procedure.
16
2.Computation of Rules
Certain rules are computed that will be utilized during the implementation of the cleaning process.
Threshold value:
The threshold value is calculated based on the experiments conducted in previous researches.
Values lower than the thresholds increase the number of false positives.
Values above thresholds are not able to detect all duplicates.
Values in between can be used to recognize approximate duplicates.
Rules for classification of fields:
Selected fields are classified on the basis of their data types.
Rules for data quality attributes:
Previous framework only focused on 3 quality attributes of data: completeness, accuracy and consistency.
2 other quality attribute values proposed in new framework:
Validity:
Integrity:
17
3. Formation of Clusters
• Using recursive record matching algorithm for initial cluster formation with slight modification:
• Use it for matching of fields rather than whole record.
• Clusters are stored in priority queue.
• Priorities of clusters in the queue are assigned on the basis of their ability to detect duplicates data sets.
• The cluster that detected the recent match is stored assigned the highest priority.
4. Match Score
Match scores are assigned by applying Smith-Waterman algorithm(An edit-distance based strategy).
The calculations done in this method are stored in a matrix.
5. Detection of Exact and Approximate Duplicates
When a new field is to be matched against any data set present in a cluster use Union-Find structure.
If it fails in detecting any match then we employ Smith-Waterman.
6. Handling of Outliers and Missing Fields
Records that do not match any of the clusters present are called outliers or singleton records.
Singleton records may be stored in a separate file, stored in the repository for future analysis and comparisons.
18
7. Updating Metadata/Repository:
Metadata and repositories will be an integral part of proposed framework:
important components of repositories:
1. Data dictionary: store the information about the relations, their sources, schema, etc.
2. Rules directory: All the calculated values of thresholds, quality attributes, matching scores, etc.
3. Log files: They are used to store:
• information about the selected fields and their source record.
• classification of the fields based on their data type explicitly under 3 categories numeric, strings and characters.
4. Outlier & Missing field files: stores the outliers and missing fields and their related information like-type, source relation.
19
Comparison of Existing and Proposed Framework
20
Today’s Lecture Content
•Introduction
•Enhanced Technique to Clean Data in the Data Warehouse
• DWCLEANSER: A Framework for Approximate Duplicate Detection
•Data Quality Mining
• Data Quality Mining With Association Rules
• Data Cleaning Using Functional Dependencies
21
Data Quality Mining
Data mining process :
• Involves into the data collection, cleaning the data, building a model and monitoring the models.
• Automatically extract hidden and intrinsic information from the collections of data.
• Has various techniques that are suitable for data cleaning.
Some commonly used data mining techniques:
Association Rule Mining :
• Takes an input and induces rules as output; the outputs can be association rules.
• Association rules describe relationships among large data sets and co-occurrence of items.
Functional dependency:
shows the connection and association between attributes and shows how one specific combination of
values on one set of attributes determines one specific combination of values on another set.
22
Today’s Lecture Content
•Introduction
•Enhanced Technique to Clean Data in the Data Warehouse
• DWCLEANSER: A Framework for Approximate Duplicate Detection
•Data Quality Mining
• Data Quality Mining With Association Rules
• Data Cleaning Using Functional Dependencies
23
Data Quality Mining With Association Rules
Objective:
Used here to detect, quantify, explain and correct data quality deficiencies in very large databases.
find a relationship with the items in huge database in addition to that it improves the data quality.
Association rules generates a rule for all the transactions which are checked by their confidence level.
Find out the strength of all rules by the following steps:
• Determine transaction type.
• Generates the association rule.
• Assign a score to each transaction based on the generated rules
Score : summing the confidence values of the rules it violates.
Rule violation occurs when a tuples must satisfy the rule body but not it’s consequent.
Idea: assign high scores to a transaction is to suspect the deficiencies.
Suggest minimal threshold for confidence to restrict the rule set in order to improve the results.
Sort the transactions according to their score values.
Based on the score, the system decides whether to accept or reject the data or else issue a warning.
24
Data Cleaning Using Functional Dependencies
Functional Dependency(FD) is an important feature for referencing to the relationship between attributes and
candidate keys in tuples.
FD discovery could find too many FDs and, if use directly in a cleaning process, could cause it to NP time =>
degrade the performance of the data cleaning.
Developing a cleaning engine by combining:
FD discovery technique + data cleaning technique
+
Use the feature in query optimization called Selectivity Value to decrease the number of
FDs discovered(prune unlikely FDs).
25
Today’s Lecture Content
•Introduction
•Enhanced Technique to Clean Data in the Data Warehouse
• DWCLEANSER: A Framework for Approximate Duplicate Detection
•Data Quality Mining
• Data Quality Mining With Association Rules
• Data Cleaning Using Functional Dependencies
26
SYSTEM ARCHITETURE
27
SYSTEM ARCHITETURE
Data collector
• Retrieve data from relational database and Improves some quality of data (corrects data from basic typos, invalid domains
and invalid formats) and prepares it for the next module (in a relational format).
FD engine
• Is an FD finding module
• Dirty data usually has some errors => use the Approximate FD technique to remove errors and find FD.
• Apply the selectivity value technique to rank the candidates in its Pruning step and select the candidates only with high
and low rank from the computing FD step.
• At the same time, any errors detected from this modified FD engine are suspicious tuples for cleaning.
• The errors can be separated into 2 types:
o Errors from finding non-candidate key FDs are inconsistent data.
o Errors from finding a candidate key FDs are potentially duplicated data.
• Together with the (discovered FDs + all suspicious error tuples) will be sent to the next step.
28
SYSTEM ARCHITETURE
Cleaning Engine:
Receive:
• suspicious error tuples
• FD selected from the FD engine
Then:
Assign weight to the data (high error produces a high weight).
Tuples with low weights will repair the high weight tuples.
FD repairing technique:
After updating the weight, the engine brings the FD to clean the data by using the Cost-based algorithm (use low cost data to
repair a high cost data).
Duplicate Elimination:
The last step is to find the duplicate data by improving the sorted neighbor-hood method algorithm through using the
candidate key FD from the FD engine to assign key and sorting data from the attribute on the left-hand side of FDs.
Relational database:
Other modules storing and retrieving data from this module.
29
SELECTING THE FD
Apply selectivity value for ranking the candidate in order to find the appropriate FD.
1 Selectivity value
the selectivity value determine distribution.
If the selectivity value of any attribute
• is high => the attribute value is highly distributed.
• is low => the attribute value is more likely to be united.
Highly distributed attribute is potentially a candidate key and can be used to eliminate duplicates.
The lowest distributed attribute can be applied to improve the error of distortion of attribute values in the
cleaning engine.
30
SELECTING THE FD
2 Ranking the candidate
After calculating the selectivity value for determining the ranks of candidates, we sort these ranks in ascending
order.
To choose potentially good candidates:
Define the low ranking threshold and high ranking threshold as a pruning point.
The selected candidates are chosen from the candidates with either high ranking or low ranking values.
The high ranking candidate has high selectivity is potentially a candidate key .
The low ranking candidates is potentially an invariant valued which can be functionally determined by some attribute in a
trivial manner. Thus, it can be computed to be a non-candidate key on the right-hand side.
The middle ranking is not precise so ignored.
31
SELECTING THE FD
3 Improve the pruning step :
The pruning step is a step for generating the candidate set by computing the candidates from level 1.
Pruning lattice example
32
Improved pruning method
• Begins the pruning by getting the set of candidates in level - 1 and then, checks the candidates.
• If they are not the FD and in either high or low accepted ranking => use StoreCandidate function to store new candidate
from candidate_x and candidate_y in the current level.
• Other candidates that are in a neither low nor high ranking will be ignored.
33
Results
50,000 real customer tuples are used as a data source.
Separate the dataset into 3 sets, as follows:
o first dataset has 10% duplicates,
o second dataset has 10% errors
o last dataset has 10% duplicates and errors.
Results showed that this work can identify duplicates and
anomalies with high recall and low false positive.
PROBLEM :
Combining solution is sensitive to data size:
• Data volume increase => discovery algorithm speed
decrease
• Number of attributes increase => the discovery creates
more candidates of FD and generates too many FDs
including noise ones.
34
Strengths and Limitations of Data Quality Mining Methods :
Association rules Functional Dependency
Reduce the number of rules to generate for
a transaction
Easily identifies suspicious tuples for cleaning
avoids a severe pitfall of association rule
mining
Decrease the number of functional dependency
discovered
difficult to generate association rules for all
transactions
is not suitable for large database because it is
difficult to sort all the records
35
Main References:
1. Hamad, Mortadha M., and Alaa Abdulkhar Jihad. "An Enhanced Technique To Clean Data In The Data
Warehouse". 2011 Developments in E-systems Engineering (2011): n. pag. Web. 20 Dec. 2015.
2. Thakur, G., Singh, M., Pahwa, P. and Tyagi, N. (2011). DWCLEANSER: A Framework for Approximate Duplicate
Detection. Advances in Computing and Information Technology, pp.355-364.
3. Natarajan, K., Li, J. and Koronios, A. (2010). Data mining techniques for data cleaning, Engineering Asset Lifecycle
Management, Springer London, pp.796-804.
4. Kollayut Kaewbuadee, Yae Temtanapat, and Ratchata Peachavanish, (2006) Data cleaning using functional
dependency from data mining process, International Journal on Computer Science and Information System
(IADIS) V1 , no. 2, 117–131 ,ISBN: ISSN : 1646 – 3692.
QUESTION??...

More Related Content

What's hot

Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profilingShailja Khurana
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data miningkavitha muneeshwaran
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis Peter Reimann
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining Sushil Kulkarni
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data miningDevakumar Jain
 
Data Reduction
Data ReductionData Reduction
Data ReductionRajan Shah
 
Data quality metrics infographic
Data quality metrics infographicData quality metrics infographic
Data quality metrics infographicIntellspot
 
Data cleaning and visualization
Data cleaning and visualizationData cleaning and visualization
Data cleaning and visualizationTapan Gautam
 
Data Quality
Data QualityData Quality
Data Qualityjerdeb
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolutionitnewsafrica
 
Data Wrangling
Data WranglingData Wrangling
Data WranglingGramener
 
DATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSINGDATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSINGAhtesham Ullah khan
 

What's hot (20)

Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data mining
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data Exploration.pptx
Data Exploration.pptxData Exploration.pptx
Data Exploration.pptx
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data mining
 
Data Reduction
Data ReductionData Reduction
Data Reduction
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data quality metrics infographic
Data quality metrics infographicData quality metrics infographic
Data quality metrics infographic
 
Data cleaning and visualization
Data cleaning and visualizationData cleaning and visualization
Data cleaning and visualization
 
Data Quality
Data QualityData Quality
Data Quality
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
 
Data Analytics Life Cycle
Data Analytics Life CycleData Analytics Life Cycle
Data Analytics Life Cycle
 
Data Mining
Data MiningData Mining
Data Mining
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
Data Wrangling
Data WranglingData Wrangling
Data Wrangling
 
Classification of data
Classification of dataClassification of data
Classification of data
 
DATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSINGDATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSING
 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
 

Similar to Data Cleaning Techniques

N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERS
N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERSN ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERS
N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERScsandit
 
overview of_data_processing
overview of_data_processingoverview of_data_processing
overview of_data_processingFEG
 
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...csandit
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningNandakumar P
 
IRJET- A Review of Data Cleaning and its Current Approaches
IRJET- A Review of Data Cleaning and its Current ApproachesIRJET- A Review of Data Cleaning and its Current Approaches
IRJET- A Review of Data Cleaning and its Current ApproachesIRJET Journal
 
Mind Map Test Data Management Overview
Mind Map Test Data Management OverviewMind Map Test Data Management Overview
Mind Map Test Data Management Overviewdublinx
 
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...ijaia
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data PreparationUmair Shafique
 
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREEA ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREEijcsa
 
Design and implementation for
Design and implementation forDesign and implementation for
Design and implementation forIJDKP
 
Data transformation and query management in personal health sensor network
Data transformation and query management in personal health sensor networkData transformation and query management in personal health sensor network
Data transformation and query management in personal health sensor networkTAIWAN
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineMichael Gerke
 
Subhaschamdrabhosesubhqschndrachose.pptx
Subhaschamdrabhosesubhqschndrachose.pptxSubhaschamdrabhosesubhqschndrachose.pptx
Subhaschamdrabhosesubhqschndrachose.pptxrocky170104
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data MiningValerii Klymchuk
 
Parallel Rule Generation For Efficient Classification System
Parallel Rule Generation For Efficient Classification SystemParallel Rule Generation For Efficient Classification System
Parallel Rule Generation For Efficient Classification SystemTalha Ghaffar
 

Similar to Data Cleaning Techniques (20)

Data mining
Data miningData mining
Data mining
 
N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERS
N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERSN ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERS
N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERS
 
overview of_data_processing
overview of_data_processingoverview of_data_processing
overview of_data_processing
 
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
 
Chromatography: Part 4 of 4 Pesticide Residue Analysis Webinar Series - Late...
Chromatography: Part 4 of 4 Pesticide Residue Analysis Webinar Series -  Late...Chromatography: Part 4 of 4 Pesticide Residue Analysis Webinar Series -  Late...
Chromatography: Part 4 of 4 Pesticide Residue Analysis Webinar Series - Late...
 
IRJET- A Review of Data Cleaning and its Current Approaches
IRJET- A Review of Data Cleaning and its Current ApproachesIRJET- A Review of Data Cleaning and its Current Approaches
IRJET- A Review of Data Cleaning and its Current Approaches
 
Module-4_Part-II.pptx
Module-4_Part-II.pptxModule-4_Part-II.pptx
Module-4_Part-II.pptx
 
Mind Map Test Data Management Overview
Mind Map Test Data Management OverviewMind Map Test Data Management Overview
Mind Map Test Data Management Overview
 
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
02 Related Concepts
02 Related Concepts02 Related Concepts
02 Related Concepts
 
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREEA ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
 
Design and implementation for
Design and implementation forDesign and implementation for
Design and implementation for
 
data mining
data miningdata mining
data mining
 
Data transformation and query management in personal health sensor network
Data transformation and query management in personal health sensor networkData transformation and query management in personal health sensor network
Data transformation and query management in personal health sensor network
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning Pipeline
 
Subhaschamdrabhosesubhqschndrachose.pptx
Subhaschamdrabhosesubhqschndrachose.pptxSubhaschamdrabhosesubhqschndrachose.pptx
Subhaschamdrabhosesubhqschndrachose.pptx
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data Mining
 
Parallel Rule Generation For Efficient Classification System
Parallel Rule Generation For Efficient Classification SystemParallel Rule Generation For Efficient Classification System
Parallel Rule Generation For Efficient Classification System
 

Recently uploaded

Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxStephen266013
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...yulianti213969
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsBrainSell Technologies
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024patrickdtherriault
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...Amil baba
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...yulianti213969
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证zifhagzkk
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfRobertoOcampo24
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesBoston Institute of Analytics
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeBoston Institute of Analytics
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationmuqadasqasim10
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...ThinkInnovation
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...BabaJohn3
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token PredictionNABLAS株式会社
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"John Sobanski
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Valters Lauzums
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshareraiaryan448
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...ThinkInnovation
 

Recently uploaded (20)

Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic information
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 

Data Cleaning Techniques

  • 1. 1 Data Cleaning Techniques Shahid Rajaee Teacher Training University Faculty of Computer Engineering PRESENTED BY: Amir Masoud Sefidian
  • 2. 2 Today’s Lecture Content •Introduction •Enhanced Technique to Clean Data in the Data Warehouse • DWCLEANSER: A Framework for Approximate Duplicate Detection •Data Quality Mining • Data Quality Mining With Association Rules • Data Cleaning Using Functional Dependencies
  • 3. 3 Today’s Lecture Content •Introduction •Enhanced Technique to Clean Data in the Data Warehouse • DWCLEANSER: A Framework for Approximate Duplicate Detection • Data Quality Mining • Data Quality Mining With Association Rules • Data Cleaning Using Functional Dependencies
  • 4. 4 Introduction • Data quality is a main issue in quality information management. • Data quality problems occur anywhere in information systems. • These problems are solved by Data Cleaning: • Is a process used to determine inaccurate, incomplete or unreasonable data and then improve the quality through correcting of detected errors => reduces errors and improves the data quality. • Data Cleaning can be a time consuming and tedious process but it cannot be ignored. • Data quality criterias : accuracy, integrity, completeness, validity, consistency, schema conformance, uniqueness,… .
  • 5. 5 Today’s Lecture Content •Introduction •Enhanced Technique to Clean Data in the Data Warehouse • DWCLEANSER: A Framework for Approximate Duplicate Detection •Data Quality Mining • Data Quality Mining With Association Rules • Data Cleaning Using Functional Dependencies
  • 6. 6 An Enhanced Technique to Clean Data in the Data Warehouse • Using a new algorithm that detects and corrects most of the error types and expected problems, such as lexical errors, domain format errors, irregularities, integrity constraint violation, and duplicates, missing value . • Presents a solution working on the quantitative data and any data that have limited values. • Offers the user interaction by selecting the rules and any sources and the desired targets. • Algorithm is able to clean the data completely, addressing all the mistakes and inconsistencies in the data or numerical values specified. • Time taken to process huge data is not as important as obtaining high quality data since a huge amount of data can be treated one-time. • Main focus has been on achieving good quality of the data. • Pace of implementation of this algorithm is adequate. • It well scales to large amount of data processing without a significant degradation of the most of relative performance issues.
  • 7. 7 Flowchart of proposed technique Proposed model can easily be developed in a data - warehouse, by the following algorithm:
  • 8. 8 user selects any rules needed in the data cleaning system. layout and descriptions for fields of the data set, which are used in implementing of the algorithm.
  • 9. COMPARISON OF THE PROPOSED TECHNIQUE WITH SOME EXISTING TECHNIQUES Above 1009 records, containing a lot of anomalies have been examined before and after processing by different available methods (such as: statistics, clustering) a big difference in the number of anomalies which confirms the effectiveness and quality of this algorithm.
  • 10. 10 Today’s Lecture Content •Introduction •Enhanced Technique to Clean Data in the Data Warehouse • DWCLEANSER: A Framework for Approximate Duplicate Detection •Data Quality Mining • Data Quality Mining With Association Rules • Data Cleaning Using Functional Dependencies
  • 11. 11 DWCLEANSER: A Framework for Approximate Duplicate Detection • A novel framework for detection of exact as well as approximate duplicates in a data warehouse. • Decreases the complexity involved in the previously designed frameworks by providing efficient data cleaning techniques. • Provides a comprehensive metadata support to the whole cleaning process. • Provisions have also been suggested to take care of outliers and missing fields.
  • 13. Existing Framework Previously designed framework designed is a sequential, token-based framework that offers fundamental services of data cleaning in six steps : 1)Selection of attributes: Attributes are identified and selected for further processing in the following steps. 2) Formation of tokens: The selected attributes are utilized to form tokens for similarity computation. 3) Clustering/Blocking of records: The blocking/clustering algorithm is used to group the records based on the calculated similarity and block- token key. 4) Similarity computation for selected attributes: Jaccard similarity method is used for comparing token values of selected attributes in a field. 5) Detection and elimination of duplicate records: A rule based detection and elimination approach is employed for detecting and eliminating the duplicates in a cluster or in many clusters. 6) Merge: The cleansed data is combined and stored. 13
  • 15. 15 1.Field Selection • Records are decomposed into fields: • Fields are analyzed for gathering data about their type, relationship with other fields, key fields and integrity constraints so that have enough metadata about the decomposed fields. • Missing fields stored in a separate temporary table and preserved in the repository along with their source record, relation name, data types and integrity constraints. • Missing fields are reviewed by the DBA to verify the reason for their existence. (1) if the data is missing it can be recaptured; (2) if the value is not known efforts can be made to gather the data to complete the record or fill the missing field with a valid value. if no valid data can be collected the values is preserved in the repository for further verification and not used in the cleaning procedure.
  • 16. 16 2.Computation of Rules Certain rules are computed that will be utilized during the implementation of the cleaning process. Threshold value: The threshold value is calculated based on the experiments conducted in previous researches. Values lower than the thresholds increase the number of false positives. Values above thresholds are not able to detect all duplicates. Values in between can be used to recognize approximate duplicates. Rules for classification of fields: Selected fields are classified on the basis of their data types. Rules for data quality attributes: Previous framework only focused on 3 quality attributes of data: completeness, accuracy and consistency. 2 other quality attribute values proposed in new framework: Validity: Integrity:
  • 17. 17 3. Formation of Clusters • Using recursive record matching algorithm for initial cluster formation with slight modification: • Use it for matching of fields rather than whole record. • Clusters are stored in priority queue. • Priorities of clusters in the queue are assigned on the basis of their ability to detect duplicates data sets. • The cluster that detected the recent match is stored assigned the highest priority. 4. Match Score Match scores are assigned by applying Smith-Waterman algorithm(An edit-distance based strategy). The calculations done in this method are stored in a matrix. 5. Detection of Exact and Approximate Duplicates When a new field is to be matched against any data set present in a cluster use Union-Find structure. If it fails in detecting any match then we employ Smith-Waterman. 6. Handling of Outliers and Missing Fields Records that do not match any of the clusters present are called outliers or singleton records. Singleton records may be stored in a separate file, stored in the repository for future analysis and comparisons.
  • 18. 18 7. Updating Metadata/Repository: Metadata and repositories will be an integral part of proposed framework: important components of repositories: 1. Data dictionary: store the information about the relations, their sources, schema, etc. 2. Rules directory: All the calculated values of thresholds, quality attributes, matching scores, etc. 3. Log files: They are used to store: • information about the selected fields and their source record. • classification of the fields based on their data type explicitly under 3 categories numeric, strings and characters. 4. Outlier & Missing field files: stores the outliers and missing fields and their related information like-type, source relation.
  • 19. 19 Comparison of Existing and Proposed Framework
  • 20. 20 Today’s Lecture Content •Introduction •Enhanced Technique to Clean Data in the Data Warehouse • DWCLEANSER: A Framework for Approximate Duplicate Detection •Data Quality Mining • Data Quality Mining With Association Rules • Data Cleaning Using Functional Dependencies
  • 21. 21 Data Quality Mining Data mining process : • Involves into the data collection, cleaning the data, building a model and monitoring the models. • Automatically extract hidden and intrinsic information from the collections of data. • Has various techniques that are suitable for data cleaning. Some commonly used data mining techniques: Association Rule Mining : • Takes an input and induces rules as output; the outputs can be association rules. • Association rules describe relationships among large data sets and co-occurrence of items. Functional dependency: shows the connection and association between attributes and shows how one specific combination of values on one set of attributes determines one specific combination of values on another set.
  • 22. 22 Today’s Lecture Content •Introduction •Enhanced Technique to Clean Data in the Data Warehouse • DWCLEANSER: A Framework for Approximate Duplicate Detection •Data Quality Mining • Data Quality Mining With Association Rules • Data Cleaning Using Functional Dependencies
  • 23. 23 Data Quality Mining With Association Rules Objective: Used here to detect, quantify, explain and correct data quality deficiencies in very large databases. find a relationship with the items in huge database in addition to that it improves the data quality. Association rules generates a rule for all the transactions which are checked by their confidence level. Find out the strength of all rules by the following steps: • Determine transaction type. • Generates the association rule. • Assign a score to each transaction based on the generated rules Score : summing the confidence values of the rules it violates. Rule violation occurs when a tuples must satisfy the rule body but not it’s consequent. Idea: assign high scores to a transaction is to suspect the deficiencies. Suggest minimal threshold for confidence to restrict the rule set in order to improve the results. Sort the transactions according to their score values. Based on the score, the system decides whether to accept or reject the data or else issue a warning.
  • 24. 24 Data Cleaning Using Functional Dependencies Functional Dependency(FD) is an important feature for referencing to the relationship between attributes and candidate keys in tuples. FD discovery could find too many FDs and, if use directly in a cleaning process, could cause it to NP time => degrade the performance of the data cleaning. Developing a cleaning engine by combining: FD discovery technique + data cleaning technique + Use the feature in query optimization called Selectivity Value to decrease the number of FDs discovered(prune unlikely FDs).
  • 25. 25 Today’s Lecture Content •Introduction •Enhanced Technique to Clean Data in the Data Warehouse • DWCLEANSER: A Framework for Approximate Duplicate Detection •Data Quality Mining • Data Quality Mining With Association Rules • Data Cleaning Using Functional Dependencies
  • 27. 27 SYSTEM ARCHITETURE Data collector • Retrieve data from relational database and Improves some quality of data (corrects data from basic typos, invalid domains and invalid formats) and prepares it for the next module (in a relational format). FD engine • Is an FD finding module • Dirty data usually has some errors => use the Approximate FD technique to remove errors and find FD. • Apply the selectivity value technique to rank the candidates in its Pruning step and select the candidates only with high and low rank from the computing FD step. • At the same time, any errors detected from this modified FD engine are suspicious tuples for cleaning. • The errors can be separated into 2 types: o Errors from finding non-candidate key FDs are inconsistent data. o Errors from finding a candidate key FDs are potentially duplicated data. • Together with the (discovered FDs + all suspicious error tuples) will be sent to the next step.
  • 28. 28 SYSTEM ARCHITETURE Cleaning Engine: Receive: • suspicious error tuples • FD selected from the FD engine Then: Assign weight to the data (high error produces a high weight). Tuples with low weights will repair the high weight tuples. FD repairing technique: After updating the weight, the engine brings the FD to clean the data by using the Cost-based algorithm (use low cost data to repair a high cost data). Duplicate Elimination: The last step is to find the duplicate data by improving the sorted neighbor-hood method algorithm through using the candidate key FD from the FD engine to assign key and sorting data from the attribute on the left-hand side of FDs. Relational database: Other modules storing and retrieving data from this module.
  • 29. 29 SELECTING THE FD Apply selectivity value for ranking the candidate in order to find the appropriate FD. 1 Selectivity value the selectivity value determine distribution. If the selectivity value of any attribute • is high => the attribute value is highly distributed. • is low => the attribute value is more likely to be united. Highly distributed attribute is potentially a candidate key and can be used to eliminate duplicates. The lowest distributed attribute can be applied to improve the error of distortion of attribute values in the cleaning engine.
  • 30. 30 SELECTING THE FD 2 Ranking the candidate After calculating the selectivity value for determining the ranks of candidates, we sort these ranks in ascending order. To choose potentially good candidates: Define the low ranking threshold and high ranking threshold as a pruning point. The selected candidates are chosen from the candidates with either high ranking or low ranking values. The high ranking candidate has high selectivity is potentially a candidate key . The low ranking candidates is potentially an invariant valued which can be functionally determined by some attribute in a trivial manner. Thus, it can be computed to be a non-candidate key on the right-hand side. The middle ranking is not precise so ignored.
  • 31. 31 SELECTING THE FD 3 Improve the pruning step : The pruning step is a step for generating the candidate set by computing the candidates from level 1. Pruning lattice example
  • 32. 32 Improved pruning method • Begins the pruning by getting the set of candidates in level - 1 and then, checks the candidates. • If they are not the FD and in either high or low accepted ranking => use StoreCandidate function to store new candidate from candidate_x and candidate_y in the current level. • Other candidates that are in a neither low nor high ranking will be ignored.
  • 33. 33 Results 50,000 real customer tuples are used as a data source. Separate the dataset into 3 sets, as follows: o first dataset has 10% duplicates, o second dataset has 10% errors o last dataset has 10% duplicates and errors. Results showed that this work can identify duplicates and anomalies with high recall and low false positive. PROBLEM : Combining solution is sensitive to data size: • Data volume increase => discovery algorithm speed decrease • Number of attributes increase => the discovery creates more candidates of FD and generates too many FDs including noise ones.
  • 34. 34 Strengths and Limitations of Data Quality Mining Methods : Association rules Functional Dependency Reduce the number of rules to generate for a transaction Easily identifies suspicious tuples for cleaning avoids a severe pitfall of association rule mining Decrease the number of functional dependency discovered difficult to generate association rules for all transactions is not suitable for large database because it is difficult to sort all the records
  • 35. 35 Main References: 1. Hamad, Mortadha M., and Alaa Abdulkhar Jihad. "An Enhanced Technique To Clean Data In The Data Warehouse". 2011 Developments in E-systems Engineering (2011): n. pag. Web. 20 Dec. 2015. 2. Thakur, G., Singh, M., Pahwa, P. and Tyagi, N. (2011). DWCLEANSER: A Framework for Approximate Duplicate Detection. Advances in Computing and Information Technology, pp.355-364. 3. Natarajan, K., Li, J. and Koronios, A. (2010). Data mining techniques for data cleaning, Engineering Asset Lifecycle Management, Springer London, pp.796-804. 4. Kollayut Kaewbuadee, Yae Temtanapat, and Ratchata Peachavanish, (2006) Data cleaning using functional dependency from data mining process, International Journal on Computer Science and Information System (IADIS) V1 , no. 2, 117–131 ,ISBN: ISSN : 1646 – 3692.
  • 36.