SlideShare a Scribd company logo
Data Preprocessing
Why Preprocess the Data
Data have quality if they satisfy the requirements of the intended use.
 There are many factors comprising data quality, including accuracy,
completeness, consistency, timeliness, believability, and interpretability
Three of the elements defining data quality: accuracy, completeness,
and consistency.
Inaccurate, incomplete, and inconsistent data are commonplace
properties of large real-world databases and data warehouses. There are
many possible reasons for inaccurate data (i.e., having incorrect attribute
values).
real-world data tend to be dirty, incomplete, and inconsistent.
 Data preprocessing techniques can improve data quality, thereby
helping to improve the accuracy and efficiency of the subsequent
mining process.
Data preprocessing is an important step in the knowledge
discovery process, because quality decisions must be based on
quality data.
Detecting data anomalies, rectifying them early, and reducing the
data to be analyzed can lead to huge payoffs for decision making.
Major Tasks in Data Preprocessing
Major steps involved in data preprocessing, namely, data
cleaning, data integration, data reduction, and data transformation
Data cleaning routines work to “clean” the data by filling in
missing values, smoothing noisy data, identifying or removing
outliers, and resolving inconsistencies.
If users believe the data are dirty, they are unlikely to trust the
results of any data mining that has been applied.
Data from multiple sources in your analysis. This would involve
integrating multiple databases, data cubes, or files (i.e., data
integration).
Data reduction obtains a reduced representation of the data set that is much
smaller in volume, yet produces the same (or almost the same) analytical results.
Data reduction strategies include
dimensionality reduction
numerosity reduction
In dimensionality reduction, data encoding schemes are applied so as to obtain a
reduced or “compressed” representation of the original data.
Examples include
 data compression techniques (e.g., wavelet transforms and principal
components analysis),
attribute subset selection (e.g., removing irrelevant attributes),
 attribute construction (e.g., where a small set of more useful attributes is
derived from the original set).
In numerosity reduction, the data are replaced by alternative,
smaller representations using
parametric models (e.g., regression or log-linear models) or
nonparametric models (e.g., histograms, clusters, sampling, or
data aggregation).
Normalization, data discretization, and concept hierarchy generation
are forms of data transformation.
Forms of data preprocessing
Data Cleaning
Data cleaning (or data cleansing) routines attempt to fill in missing
values, smooth out noise while identifying outliers, and correct
inconsistencies in the data.
Basic methods for data cleaning
ways of handling missing values.
 data smoothing techniques
1) Missing values
Following methods are used for filling in the missing values of the
attribute
i) Ignore the tuple:
 This is usually done when the class label is missing (assuming the
mining task involves classification).
This method is not very effective, unless the tuple contains several
attributes with missing values.
It is especially poor when the percentage of missing values per
attribute varies considerably.
By ignoring the tuple, we do not make use of the remaining
attributes’ values in the tuple.
Such data could have been useful to the task at hand.
ii) Fill in the missing value manually:
In general, this approach is time consuming and may not be feasible
given a large data set with many missing values.
iii)Use a global constant to fill in the missing value:
Replace all missing attribute values by the same constant such as a
label like “Unknown” or infinite
iv) Use a measure of central tendency for the attribute
(e.g., the mean or median) to fill in the missing value.
v) Use the attribute mean or median for all samples belonging to
the same class as the given tuple:
For example, if classifying customers according to credit risk, we
may replace the missing value with the mean income value for
customers in the same credit risk category as that of the given tuple.
Vi) Use the most probable value to fill in the missing value
This may be determined with regression, inference-based tools using
a Bayesian formalism, or decision tree induction
Methods 3 through 6 bias the data—the filled-in value may not be
correct. Method 6, however, is a popular strategy.
In comparison to the other methods, it uses the most information
from the present data to predict missing values.
ii) Noisy Data
Noise is a random error or variance in a measured variable.
Methods of data visualization can be used to identify outliers, which
may represent noise
Followings are data smoothing techniques.
Binning:
Binning methods smooth a sorted data value by consulting its
“neighborhood,” that is, the values around it.
The sorted values are distributed into a number of “buckets,” or
bins.
Because binning methods consult the neighborhood of values, they
perform local smoothing.
It illustrates some binning techniques.
smoothing by bin means
smoothing by bin medians
smoothing by bin boundaries
In smoothing by bin means, each value in a bin is replaced by the
mean value of the bin.
smoothing by bin medians can be employed, in which each bin
value is replaced by the bin median.
In smoothing by bin boundaries, the minimum and maximum
values in a given bin are identified as the bin boundaries. Each bin
value is then replaced by the closest boundary value
In this example, the data for price are first sorted and then
partitioned into equal-frequency bins of size 3 (i.e., each bin contains
three values).
Regression: Data smoothing can also be done by regression, a
technique that conforms data values to a function.
Linear regression involves finding the “best” line to fit two
attributes (or variables) so that one attribute can be used to predict the
other.
Multiple linear regression is an extension of linear regression, where
more than two attributes are involved and the data are fit to a
multidimensional surface.
Outlier analysis:
Outliers may be detected by clustering, for example, where similar
values are organized into groups, or “clusters.” Intuitively, values that
fall outside of the set of clusters may be considered outliers
Many data smoothing methods are also used for data discretization
(a form of data transformation) and data reduction.
iii) Data Cleaning as a Process
Data cleaning as a process is discrepancy detection.
Discrepancies can be caused by several factors, including poorly
designed data entry forms that have many optional fields, human error
in data entry, deliberate errors and data decay (e.g., outdated
addresses)
The data should also be examined regarding unique rules,
consecutive rules, and null rules.
A unique rule says that each value of the given attribute must be
different from all other values for that attribute.
A consecutive rule says that there can be no missing values between
the lowest and highest values for the attribute, and that all values must
also be unique (e.g., as in check numbers).
A null rule specifies the use of blanks, question marks, special
characters, or other strings that may indicate the null condition and
how such values should be handled
There are a number of different commercial tools that can aid in the
discrepancy detection step.
Data scrubbing tools use simple domain knowledge to detect errors
and make corrections in the data. These tools rely on parsing and
fuzzy matching techniques when cleaning data from multiple sources.
Data auditing tools find discrepancies by analyzing the data to
discover rules and relationships, and detecting data that violate such
conditions.
Data Integration
Merging of data from multiple data stores.
Sources may include multiple databases, data cubes, or flat files.
Careful integration can help reduce and avoid redundancies and
inconsistencies in the resulting data set.
This can help improve the accuracy and speed of the subsequent data
mining process
How can we match schema and objects from different sources?
This is the essence of the entity identification problem
i) Entity Identification Problem
There are a number of issues to consider during data integration.
Schema integration and object matching can be tricky
For example,
How can the data analyst or the computer be sure that customer id in
one database and cust_number in another refer to the same attribute?
 Metadata for each attribute include the name, meaning, data
type, and range of values permitted for the attribute, and null rules
for handling blank, zero, or null values .
Such metadata can be used to help avoid errors in schema
integration.
The metadata may also be used to help transform the data (e.g.,
where data codes for pay type in one database may be “H” and
“S” but 1 and 2 in another)
ii) Redundancy and Correlation Analysis
Redundancy is another important issue in data integration.
An attribute (such as annual revenue) may be redundant if it can be
“derived” from another attribute or set of attributes.
Inconsistencies in attribute or dimension naming can also cause
redundancies in the resulting data set.
Some redundancies can be detected by correlation analysis.
Given two attributes, such analysis can measure how strongly one
attribute implies the other, based on the available data.
For nominal data, we use the (chi-square) test.
For numeric attributes, we can use the correlation coefficient and
covariance, both of which access how one attribute’s values vary from
those of another.
Correlation Test for Nominal Data
The Chi-Square test of independence is used to determine if there is
a significant relationship between two nominal (categorical) variables.
 The frequency of each category for one nominal variable is
compared across the categories of the second nominal variable.
For nominal data, a correlation relationship between two attributes,
A and B, can be discovered by a (chi-square) test. Suppose A has
c distinct values, namely a1,a2,…ac . B has r distinct values, namely
b1,b2, …br .
Value (also known as the Pearson statistic) is computed as
The statistic tests the hypothesis that A and B are independent,
that is, there is no correlation between them.
The test is based on a significance level, with (r-1)*(c-1) degrees of
freedom.
If the hypothesis can be rejected, then we say that A and B are
statistically correlated.
For example
Expected frequency is calculated as
Using the formulae computation is as follows
For this 22 table, the degrees of freedom are (2-1)(2-1)= 1.
For 1 degree of freedom, the value needed to reject the
hypothesis at the 0.001 significance level is 10.828.
Since our computed value is above this, we can reject the hypothesis
that gender and preferred reading are independent and conclude that
the two attributes are (strongly) correlated for the given group of
people
Covariance and Correlation
(or)
Correlation
Covariance is a statistical measure that shows whether two
variables are related by measuring how the variables change in relation
to each other.
Correlation, like covariance, is a measure of how two variables
change in relation to each other, but it goes one step further than
covariance in that correlation tells how strong the relationship is.
For example
 looking for trends with ice cream shop
Temperature No of customers
98 15
87 12
90 10
85 10
95 16
75 7
Find mean of x is (98+87+90+85+95+75)/6= 88.33.
Find mean of y is (15+12+10+10+16+7)/6= 11.67
 Then Subtract each value from its respective mean and then
multiply these new values together.
add all the products together, which yields the value 125.66.
The final step is to divide by (n-1) = 6 - 1 = 5.
Use formulae 125.66/5 = 25.132
The covariance of this set of data is 25.132.
The number is positive, so we can state that the two variables do
have a positive relationship
as temperature rises, the number of customers in the store also
rises
To find the strength, we need to continue on to correlation
This formula will result in a number between -1 and 1,
with -1 being a perfect inverse correlation (the variables move
in opposite directions)
0 indicating no relationship between the two variables
1 being a perfect positive correlation (the variables reliably and
consistently move in the same direction as each other).
*****
iii) Tuple duplication
iv) Data value conflict detection and resolution
Data Reduction
Data reduction techniques can be applied to obtain a reduced
representation of the data set that is much smaller in volume, yet
closely maintains the integrity of the original Data
Overview of Data Reduction Strategies
Data reduction strategies include
i) dimensionality reduction
ii) numerosity reduction
iii) data compression
i) Dimensionality reduction
It is the process of reducing the number of random variables or
attributes under consideration.
Dimensionality reduction methods include wavelet transforms and
principal components analysis which transform or project the original
data onto a smaller space.
Attribute subset selection is a method of dimensionality reduction in
which irrelevant, weakly relevant, or redundant attributes or
dimensions are detected and removed.
Such attribute construction can help improve accuracy and
understanding of structure in high dimensional data
Basic heuristic methods of attribute subset selection include the
techniques that follows as
Greedy (heuristic) methods for attribute subset selection
Stepwise forward selection:
The procedure starts with an empty set of attributes as the
reduced set.
The best of the original attributes is determined and added to the
reduced set.
At each subsequent iteration or step, the best of the remaining
original attributes is added to the set.
Stepwise backward elimination:
The procedure starts with the full set of attributes. At each step,
it removes the worst attribute remaining in the set.
ii) Numerosity reduction techniques
It replace the original data volume by alternative, smaller forms of
data representation.
These techniques may be parametric or nonparametric.
For parametric methods, a model is used to estimate the data, so
that typically only the data parameters need to be stored, instead of the
actual data. (Outliers may also be stored.)
Regression and log-linear models are examples.
Nonparametric methods for storing reduced representations of the
data include histograms, clustering, sampling and data cube
aggregation.
iii) Data compression, transformations
They are applied to obtain a reduced or compressed representation
of the original data.
If the original data can be reconstructed from the compressed data
without any information loss, the data reduction is called lossless.
If, instead, we can reconstruct only an approximation of the original
data, then the data reduction is called lossy.
There are several lossless algorithms for string compression;
however, they typically allow only limited data manipulation.
 Dimensionality reduction and numerosity reduction techniques can
also be considered forms of data compression
Data Preprocessing

More Related Content

What's hot

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Jason Rodrigues
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
error007
 
Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
Mahbubur Rahman Shimul
 
2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts
Krish_ver2
 
Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees
Kush Kulshrestha
 
Decision tree lecture 3
Decision tree lecture 3Decision tree lecture 3
Decision tree lecture 3Laila Fatehy
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CARTXueping Peng
 
Data Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionData Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model Selection
Derek Kane
 
DATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptxDATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptx
AbdullahAbbasi55
 
Data preprocessing in Machine Learning
Data preprocessing in Machine LearningData preprocessing in Machine Learning
Data preprocessing in Machine Learning
Pyingkodi Maran
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
Mohammad Junaid Khan
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Salah Amean
 
Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
Iffat Firozy
 
Decision tree
Decision treeDecision tree
Decision tree
Soujanya V
 
CART – Classification & Regression Trees
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression Trees
Hemant Chetwani
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering Algorithm
Pınar Yahşi
 
Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPT
ANUSUYA T K
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
Gopal Sakarkar
 
Data preparation
Data preparationData preparation
Data preparation
Tony Nguyen
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
Kamal Acharya
 

What's hot (20)

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
 
2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts
 
Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees
 
Decision tree lecture 3
Decision tree lecture 3Decision tree lecture 3
Decision tree lecture 3
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
 
Data Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionData Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model Selection
 
DATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptxDATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptx
 
Data preprocessing in Machine Learning
Data preprocessing in Machine LearningData preprocessing in Machine Learning
Data preprocessing in Machine Learning
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
 
Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
 
Decision tree
Decision treeDecision tree
Decision tree
 
CART – Classification & Regression Trees
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression Trees
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering Algorithm
 
Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPT
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Data preparation
Data preparationData preparation
Data preparation
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 

Similar to Data Preprocessing

Chapter 2 Cond (1).ppt
Chapter 2 Cond (1).pptChapter 2 Cond (1).ppt
Chapter 2 Cond (1).ppt
kannaradhas
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingsuganmca14
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
Nandakumar P
 
Unit 3-2.ppt
Unit 3-2.pptUnit 3-2.ppt
Unit 3-2.ppt
Ankit506645
 
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
DrGnaneswariG
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data mining
Ujjawal
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2
Gokulks007
 
data processing.pdf
data processing.pdfdata processing.pdf
data processing.pdf
DimpyJindal4
 
Chapter 3. Data Preprocessing.ppt
Chapter 3. Data Preprocessing.pptChapter 3. Data Preprocessing.ppt
Chapter 3. Data Preprocessing.ppt
Subrata Kumer Paul
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
Yugal Kumar
 
SELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODSSELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODSKAMIL MAJEED
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ngsaranya12345
 
data mining
data miningdata mining
data mining
manasa polu
 
Privacy preservation techniques in data mining
Privacy preservation techniques in data miningPrivacy preservation techniques in data mining
Privacy preservation techniques in data mining
eSAT Journals
 
Privacy preservation techniques in data mining
Privacy preservation techniques in data miningPrivacy preservation techniques in data mining
Privacy preservation techniques in data mining
eSAT Publishing House
 

Similar to Data Preprocessing (20)

1234
12341234
1234
 
Chapter 2 Cond (1).ppt
Chapter 2 Cond (1).pptChapter 2 Cond (1).ppt
Chapter 2 Cond (1).ppt
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
 
Unit 3-2.ppt
Unit 3-2.pptUnit 3-2.ppt
Unit 3-2.ppt
 
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data mining
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2
 
data processing.pdf
data processing.pdfdata processing.pdf
data processing.pdf
 
Chapter 3. Data Preprocessing.ppt
Chapter 3. Data Preprocessing.pptChapter 3. Data Preprocessing.ppt
Chapter 3. Data Preprocessing.ppt
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
 
SELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODSSELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODS
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Preprocess
PreprocessPreprocess
Preprocess
 
Data1
Data1Data1
Data1
 
Data1
Data1Data1
Data1
 
Datapreprocess
DatapreprocessDatapreprocess
Datapreprocess
 
data mining
data miningdata mining
data mining
 
Privacy preservation techniques in data mining
Privacy preservation techniques in data miningPrivacy preservation techniques in data mining
Privacy preservation techniques in data mining
 
Privacy preservation techniques in data mining
Privacy preservation techniques in data miningPrivacy preservation techniques in data mining
Privacy preservation techniques in data mining
 

Recently uploaded

NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
AmarGB2
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
BrazilAccount1
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
BrazilAccount1
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 

Recently uploaded (20)

NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 

Data Preprocessing

  • 1. Data Preprocessing Why Preprocess the Data Data have quality if they satisfy the requirements of the intended use.  There are many factors comprising data quality, including accuracy, completeness, consistency, timeliness, believability, and interpretability Three of the elements defining data quality: accuracy, completeness, and consistency. Inaccurate, incomplete, and inconsistent data are commonplace properties of large real-world databases and data warehouses. There are many possible reasons for inaccurate data (i.e., having incorrect attribute values).
  • 2. real-world data tend to be dirty, incomplete, and inconsistent.  Data preprocessing techniques can improve data quality, thereby helping to improve the accuracy and efficiency of the subsequent mining process. Data preprocessing is an important step in the knowledge discovery process, because quality decisions must be based on quality data. Detecting data anomalies, rectifying them early, and reducing the data to be analyzed can lead to huge payoffs for decision making.
  • 3. Major Tasks in Data Preprocessing Major steps involved in data preprocessing, namely, data cleaning, data integration, data reduction, and data transformation Data cleaning routines work to “clean” the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. If users believe the data are dirty, they are unlikely to trust the results of any data mining that has been applied. Data from multiple sources in your analysis. This would involve integrating multiple databases, data cubes, or files (i.e., data integration).
  • 4. Data reduction obtains a reduced representation of the data set that is much smaller in volume, yet produces the same (or almost the same) analytical results. Data reduction strategies include dimensionality reduction numerosity reduction In dimensionality reduction, data encoding schemes are applied so as to obtain a reduced or “compressed” representation of the original data. Examples include  data compression techniques (e.g., wavelet transforms and principal components analysis), attribute subset selection (e.g., removing irrelevant attributes),  attribute construction (e.g., where a small set of more useful attributes is derived from the original set).
  • 5. In numerosity reduction, the data are replaced by alternative, smaller representations using parametric models (e.g., regression or log-linear models) or nonparametric models (e.g., histograms, clusters, sampling, or data aggregation). Normalization, data discretization, and concept hierarchy generation are forms of data transformation.
  • 6. Forms of data preprocessing
  • 7. Data Cleaning Data cleaning (or data cleansing) routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. Basic methods for data cleaning ways of handling missing values.  data smoothing techniques 1) Missing values Following methods are used for filling in the missing values of the attribute
  • 8. i) Ignore the tuple:  This is usually done when the class label is missing (assuming the mining task involves classification). This method is not very effective, unless the tuple contains several attributes with missing values. It is especially poor when the percentage of missing values per attribute varies considerably. By ignoring the tuple, we do not make use of the remaining attributes’ values in the tuple. Such data could have been useful to the task at hand.
  • 9. ii) Fill in the missing value manually: In general, this approach is time consuming and may not be feasible given a large data set with many missing values. iii)Use a global constant to fill in the missing value: Replace all missing attribute values by the same constant such as a label like “Unknown” or infinite iv) Use a measure of central tendency for the attribute (e.g., the mean or median) to fill in the missing value.
  • 10. v) Use the attribute mean or median for all samples belonging to the same class as the given tuple: For example, if classifying customers according to credit risk, we may replace the missing value with the mean income value for customers in the same credit risk category as that of the given tuple. Vi) Use the most probable value to fill in the missing value This may be determined with regression, inference-based tools using a Bayesian formalism, or decision tree induction Methods 3 through 6 bias the data—the filled-in value may not be correct. Method 6, however, is a popular strategy. In comparison to the other methods, it uses the most information from the present data to predict missing values.
  • 11. ii) Noisy Data Noise is a random error or variance in a measured variable. Methods of data visualization can be used to identify outliers, which may represent noise Followings are data smoothing techniques. Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,” that is, the values around it. The sorted values are distributed into a number of “buckets,” or bins. Because binning methods consult the neighborhood of values, they perform local smoothing.
  • 12. It illustrates some binning techniques. smoothing by bin means smoothing by bin medians smoothing by bin boundaries In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. smoothing by bin medians can be employed, in which each bin value is replaced by the bin median. In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value
  • 13. In this example, the data for price are first sorted and then partitioned into equal-frequency bins of size 3 (i.e., each bin contains three values).
  • 14. Regression: Data smoothing can also be done by regression, a technique that conforms data values to a function. Linear regression involves finding the “best” line to fit two attributes (or variables) so that one attribute can be used to predict the other. Multiple linear regression is an extension of linear regression, where more than two attributes are involved and the data are fit to a multidimensional surface. Outlier analysis: Outliers may be detected by clustering, for example, where similar values are organized into groups, or “clusters.” Intuitively, values that fall outside of the set of clusters may be considered outliers
  • 15. Many data smoothing methods are also used for data discretization (a form of data transformation) and data reduction. iii) Data Cleaning as a Process Data cleaning as a process is discrepancy detection. Discrepancies can be caused by several factors, including poorly designed data entry forms that have many optional fields, human error in data entry, deliberate errors and data decay (e.g., outdated addresses)
  • 16. The data should also be examined regarding unique rules, consecutive rules, and null rules. A unique rule says that each value of the given attribute must be different from all other values for that attribute. A consecutive rule says that there can be no missing values between the lowest and highest values for the attribute, and that all values must also be unique (e.g., as in check numbers). A null rule specifies the use of blanks, question marks, special characters, or other strings that may indicate the null condition and how such values should be handled
  • 17. There are a number of different commercial tools that can aid in the discrepancy detection step. Data scrubbing tools use simple domain knowledge to detect errors and make corrections in the data. These tools rely on parsing and fuzzy matching techniques when cleaning data from multiple sources. Data auditing tools find discrepancies by analyzing the data to discover rules and relationships, and detecting data that violate such conditions.
  • 18. Data Integration Merging of data from multiple data stores. Sources may include multiple databases, data cubes, or flat files. Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set. This can help improve the accuracy and speed of the subsequent data mining process How can we match schema and objects from different sources? This is the essence of the entity identification problem i) Entity Identification Problem There are a number of issues to consider during data integration. Schema integration and object matching can be tricky
  • 19. For example, How can the data analyst or the computer be sure that customer id in one database and cust_number in another refer to the same attribute?  Metadata for each attribute include the name, meaning, data type, and range of values permitted for the attribute, and null rules for handling blank, zero, or null values . Such metadata can be used to help avoid errors in schema integration. The metadata may also be used to help transform the data (e.g., where data codes for pay type in one database may be “H” and “S” but 1 and 2 in another)
  • 20. ii) Redundancy and Correlation Analysis Redundancy is another important issue in data integration. An attribute (such as annual revenue) may be redundant if it can be “derived” from another attribute or set of attributes. Inconsistencies in attribute or dimension naming can also cause redundancies in the resulting data set. Some redundancies can be detected by correlation analysis. Given two attributes, such analysis can measure how strongly one attribute implies the other, based on the available data. For nominal data, we use the (chi-square) test. For numeric attributes, we can use the correlation coefficient and covariance, both of which access how one attribute’s values vary from those of another.
  • 21. Correlation Test for Nominal Data The Chi-Square test of independence is used to determine if there is a significant relationship between two nominal (categorical) variables.  The frequency of each category for one nominal variable is compared across the categories of the second nominal variable. For nominal data, a correlation relationship between two attributes, A and B, can be discovered by a (chi-square) test. Suppose A has c distinct values, namely a1,a2,…ac . B has r distinct values, namely b1,b2, …br .
  • 22. Value (also known as the Pearson statistic) is computed as The statistic tests the hypothesis that A and B are independent, that is, there is no correlation between them. The test is based on a significance level, with (r-1)*(c-1) degrees of freedom. If the hypothesis can be rejected, then we say that A and B are statistically correlated.
  • 23. For example Expected frequency is calculated as Using the formulae computation is as follows
  • 24. For this 22 table, the degrees of freedom are (2-1)(2-1)= 1. For 1 degree of freedom, the value needed to reject the hypothesis at the 0.001 significance level is 10.828. Since our computed value is above this, we can reject the hypothesis that gender and preferred reading are independent and conclude that the two attributes are (strongly) correlated for the given group of people
  • 26. Covariance is a statistical measure that shows whether two variables are related by measuring how the variables change in relation to each other. Correlation, like covariance, is a measure of how two variables change in relation to each other, but it goes one step further than covariance in that correlation tells how strong the relationship is.
  • 27. For example  looking for trends with ice cream shop Temperature No of customers 98 15 87 12 90 10 85 10 95 16 75 7
  • 28. Find mean of x is (98+87+90+85+95+75)/6= 88.33. Find mean of y is (15+12+10+10+16+7)/6= 11.67  Then Subtract each value from its respective mean and then multiply these new values together.
  • 29. add all the products together, which yields the value 125.66. The final step is to divide by (n-1) = 6 - 1 = 5. Use formulae 125.66/5 = 25.132 The covariance of this set of data is 25.132. The number is positive, so we can state that the two variables do have a positive relationship as temperature rises, the number of customers in the store also rises
  • 30. To find the strength, we need to continue on to correlation This formula will result in a number between -1 and 1, with -1 being a perfect inverse correlation (the variables move in opposite directions) 0 indicating no relationship between the two variables 1 being a perfect positive correlation (the variables reliably and consistently move in the same direction as each other). ***** iii) Tuple duplication iv) Data value conflict detection and resolution
  • 31. Data Reduction Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original Data Overview of Data Reduction Strategies Data reduction strategies include i) dimensionality reduction ii) numerosity reduction iii) data compression
  • 32. i) Dimensionality reduction It is the process of reducing the number of random variables or attributes under consideration. Dimensionality reduction methods include wavelet transforms and principal components analysis which transform or project the original data onto a smaller space. Attribute subset selection is a method of dimensionality reduction in which irrelevant, weakly relevant, or redundant attributes or dimensions are detected and removed.
  • 33. Such attribute construction can help improve accuracy and understanding of structure in high dimensional data Basic heuristic methods of attribute subset selection include the techniques that follows as Greedy (heuristic) methods for attribute subset selection
  • 34. Stepwise forward selection: The procedure starts with an empty set of attributes as the reduced set. The best of the original attributes is determined and added to the reduced set. At each subsequent iteration or step, the best of the remaining original attributes is added to the set. Stepwise backward elimination: The procedure starts with the full set of attributes. At each step, it removes the worst attribute remaining in the set.
  • 35. ii) Numerosity reduction techniques It replace the original data volume by alternative, smaller forms of data representation. These techniques may be parametric or nonparametric. For parametric methods, a model is used to estimate the data, so that typically only the data parameters need to be stored, instead of the actual data. (Outliers may also be stored.) Regression and log-linear models are examples. Nonparametric methods for storing reduced representations of the data include histograms, clustering, sampling and data cube aggregation.
  • 36. iii) Data compression, transformations They are applied to obtain a reduced or compressed representation of the original data. If the original data can be reconstructed from the compressed data without any information loss, the data reduction is called lossless. If, instead, we can reconstruct only an approximation of the original data, then the data reduction is called lossy. There are several lossless algorithms for string compression; however, they typically allow only limited data manipulation.  Dimensionality reduction and numerosity reduction techniques can also be considered forms of data compression