SlideShare a Scribd company logo
1 of 37
Data Preprocessing
Why Preprocess the Data
Data have quality if they satisfy the requirements of the intended use.
 There are many factors comprising data quality, including accuracy,
completeness, consistency, timeliness, believability, and interpretability
Three of the elements defining data quality: accuracy, completeness,
and consistency.
Inaccurate, incomplete, and inconsistent data are commonplace
properties of large real-world databases and data warehouses. There are
many possible reasons for inaccurate data (i.e., having incorrect attribute
values).
real-world data tend to be dirty, incomplete, and inconsistent.
 Data preprocessing techniques can improve data quality, thereby
helping to improve the accuracy and efficiency of the subsequent
mining process.
Data preprocessing is an important step in the knowledge
discovery process, because quality decisions must be based on
quality data.
Detecting data anomalies, rectifying them early, and reducing the
data to be analyzed can lead to huge payoffs for decision making.
Major Tasks in Data Preprocessing
Major steps involved in data preprocessing, namely, data
cleaning, data integration, data reduction, and data transformation
Data cleaning routines work to “clean” the data by filling in
missing values, smoothing noisy data, identifying or removing
outliers, and resolving inconsistencies.
If users believe the data are dirty, they are unlikely to trust the
results of any data mining that has been applied.
Data from multiple sources in your analysis. This would involve
integrating multiple databases, data cubes, or files (i.e., data
integration).
Data reduction obtains a reduced representation of the data set that is much
smaller in volume, yet produces the same (or almost the same) analytical results.
Data reduction strategies include
dimensionality reduction
numerosity reduction
In dimensionality reduction, data encoding schemes are applied so as to obtain a
reduced or “compressed” representation of the original data.
Examples include
 data compression techniques (e.g., wavelet transforms and principal
components analysis),
attribute subset selection (e.g., removing irrelevant attributes),
 attribute construction (e.g., where a small set of more useful attributes is
derived from the original set).
In numerosity reduction, the data are replaced by alternative,
smaller representations using
parametric models (e.g., regression or log-linear models) or
nonparametric models (e.g., histograms, clusters, sampling, or
data aggregation).
Normalization, data discretization, and concept hierarchy generation
are forms of data transformation.
Forms of data preprocessing
Data Cleaning
Data cleaning (or data cleansing) routines attempt to fill in missing
values, smooth out noise while identifying outliers, and correct
inconsistencies in the data.
Basic methods for data cleaning
ways of handling missing values.
 data smoothing techniques
1) Missing values
Following methods are used for filling in the missing values of the
attribute
i) Ignore the tuple:
 This is usually done when the class label is missing (assuming the
mining task involves classification).
This method is not very effective, unless the tuple contains several
attributes with missing values.
It is especially poor when the percentage of missing values per
attribute varies considerably.
By ignoring the tuple, we do not make use of the remaining
attributes’ values in the tuple.
Such data could have been useful to the task at hand.
ii) Fill in the missing value manually:
In general, this approach is time consuming and may not be feasible
given a large data set with many missing values.
iii)Use a global constant to fill in the missing value:
Replace all missing attribute values by the same constant such as a
label like “Unknown” or infinite
iv) Use a measure of central tendency for the attribute
(e.g., the mean or median) to fill in the missing value.
v) Use the attribute mean or median for all samples belonging to
the same class as the given tuple:
For example, if classifying customers according to credit risk, we
may replace the missing value with the mean income value for
customers in the same credit risk category as that of the given tuple.
Vi) Use the most probable value to fill in the missing value
This may be determined with regression, inference-based tools using
a Bayesian formalism, or decision tree induction
Methods 3 through 6 bias the data—the filled-in value may not be
correct. Method 6, however, is a popular strategy.
In comparison to the other methods, it uses the most information
from the present data to predict missing values.
ii) Noisy Data
Noise is a random error or variance in a measured variable.
Methods of data visualization can be used to identify outliers, which
may represent noise
Followings are data smoothing techniques.
Binning:
Binning methods smooth a sorted data value by consulting its
“neighborhood,” that is, the values around it.
The sorted values are distributed into a number of “buckets,” or
bins.
Because binning methods consult the neighborhood of values, they
perform local smoothing.
It illustrates some binning techniques.
smoothing by bin means
smoothing by bin medians
smoothing by bin boundaries
In smoothing by bin means, each value in a bin is replaced by the
mean value of the bin.
smoothing by bin medians can be employed, in which each bin
value is replaced by the bin median.
In smoothing by bin boundaries, the minimum and maximum
values in a given bin are identified as the bin boundaries. Each bin
value is then replaced by the closest boundary value
In this example, the data for price are first sorted and then
partitioned into equal-frequency bins of size 3 (i.e., each bin contains
three values).
Regression: Data smoothing can also be done by regression, a
technique that conforms data values to a function.
Linear regression involves finding the “best” line to fit two
attributes (or variables) so that one attribute can be used to predict the
other.
Multiple linear regression is an extension of linear regression, where
more than two attributes are involved and the data are fit to a
multidimensional surface.
Outlier analysis:
Outliers may be detected by clustering, for example, where similar
values are organized into groups, or “clusters.” Intuitively, values that
fall outside of the set of clusters may be considered outliers
Many data smoothing methods are also used for data discretization
(a form of data transformation) and data reduction.
iii) Data Cleaning as a Process
Data cleaning as a process is discrepancy detection.
Discrepancies can be caused by several factors, including poorly
designed data entry forms that have many optional fields, human error
in data entry, deliberate errors and data decay (e.g., outdated
addresses)
The data should also be examined regarding unique rules,
consecutive rules, and null rules.
A unique rule says that each value of the given attribute must be
different from all other values for that attribute.
A consecutive rule says that there can be no missing values between
the lowest and highest values for the attribute, and that all values must
also be unique (e.g., as in check numbers).
A null rule specifies the use of blanks, question marks, special
characters, or other strings that may indicate the null condition and
how such values should be handled
There are a number of different commercial tools that can aid in the
discrepancy detection step.
Data scrubbing tools use simple domain knowledge to detect errors
and make corrections in the data. These tools rely on parsing and
fuzzy matching techniques when cleaning data from multiple sources.
Data auditing tools find discrepancies by analyzing the data to
discover rules and relationships, and detecting data that violate such
conditions.
Data Integration
Merging of data from multiple data stores.
Sources may include multiple databases, data cubes, or flat files.
Careful integration can help reduce and avoid redundancies and
inconsistencies in the resulting data set.
This can help improve the accuracy and speed of the subsequent data
mining process
How can we match schema and objects from different sources?
This is the essence of the entity identification problem
i) Entity Identification Problem
There are a number of issues to consider during data integration.
Schema integration and object matching can be tricky
For example,
How can the data analyst or the computer be sure that customer id in
one database and cust_number in another refer to the same attribute?
 Metadata for each attribute include the name, meaning, data
type, and range of values permitted for the attribute, and null rules
for handling blank, zero, or null values .
Such metadata can be used to help avoid errors in schema
integration.
The metadata may also be used to help transform the data (e.g.,
where data codes for pay type in one database may be “H” and
“S” but 1 and 2 in another)
ii) Redundancy and Correlation Analysis
Redundancy is another important issue in data integration.
An attribute (such as annual revenue) may be redundant if it can be
“derived” from another attribute or set of attributes.
Inconsistencies in attribute or dimension naming can also cause
redundancies in the resulting data set.
Some redundancies can be detected by correlation analysis.
Given two attributes, such analysis can measure how strongly one
attribute implies the other, based on the available data.
For nominal data, we use the (chi-square) test.
For numeric attributes, we can use the correlation coefficient and
covariance, both of which access how one attribute’s values vary from
those of another.
Correlation Test for Nominal Data
The Chi-Square test of independence is used to determine if there is
a significant relationship between two nominal (categorical) variables.
 The frequency of each category for one nominal variable is
compared across the categories of the second nominal variable.
For nominal data, a correlation relationship between two attributes,
A and B, can be discovered by a (chi-square) test. Suppose A has
c distinct values, namely a1,a2,…ac . B has r distinct values, namely
b1,b2, …br .
Value (also known as the Pearson statistic) is computed as
The statistic tests the hypothesis that A and B are independent,
that is, there is no correlation between them.
The test is based on a significance level, with (r-1)*(c-1) degrees of
freedom.
If the hypothesis can be rejected, then we say that A and B are
statistically correlated.
For example
Expected frequency is calculated as
Using the formulae computation is as follows
For this 22 table, the degrees of freedom are (2-1)(2-1)= 1.
For 1 degree of freedom, the value needed to reject the
hypothesis at the 0.001 significance level is 10.828.
Since our computed value is above this, we can reject the hypothesis
that gender and preferred reading are independent and conclude that
the two attributes are (strongly) correlated for the given group of
people
Covariance and Correlation
(or)
Correlation
Covariance is a statistical measure that shows whether two
variables are related by measuring how the variables change in relation
to each other.
Correlation, like covariance, is a measure of how two variables
change in relation to each other, but it goes one step further than
covariance in that correlation tells how strong the relationship is.
For example
 looking for trends with ice cream shop
Temperature No of customers
98 15
87 12
90 10
85 10
95 16
75 7
Find mean of x is (98+87+90+85+95+75)/6= 88.33.
Find mean of y is (15+12+10+10+16+7)/6= 11.67
 Then Subtract each value from its respective mean and then
multiply these new values together.
add all the products together, which yields the value 125.66.
The final step is to divide by (n-1) = 6 - 1 = 5.
Use formulae 125.66/5 = 25.132
The covariance of this set of data is 25.132.
The number is positive, so we can state that the two variables do
have a positive relationship
as temperature rises, the number of customers in the store also
rises
To find the strength, we need to continue on to correlation
This formula will result in a number between -1 and 1,
with -1 being a perfect inverse correlation (the variables move
in opposite directions)
0 indicating no relationship between the two variables
1 being a perfect positive correlation (the variables reliably and
consistently move in the same direction as each other).
*****
iii) Tuple duplication
iv) Data value conflict detection and resolution
Data Reduction
Data reduction techniques can be applied to obtain a reduced
representation of the data set that is much smaller in volume, yet
closely maintains the integrity of the original Data
Overview of Data Reduction Strategies
Data reduction strategies include
i) dimensionality reduction
ii) numerosity reduction
iii) data compression
i) Dimensionality reduction
It is the process of reducing the number of random variables or
attributes under consideration.
Dimensionality reduction methods include wavelet transforms and
principal components analysis which transform or project the original
data onto a smaller space.
Attribute subset selection is a method of dimensionality reduction in
which irrelevant, weakly relevant, or redundant attributes or
dimensions are detected and removed.
Such attribute construction can help improve accuracy and
understanding of structure in high dimensional data
Basic heuristic methods of attribute subset selection include the
techniques that follows as
Greedy (heuristic) methods for attribute subset selection
Stepwise forward selection:
The procedure starts with an empty set of attributes as the
reduced set.
The best of the original attributes is determined and added to the
reduced set.
At each subsequent iteration or step, the best of the remaining
original attributes is added to the set.
Stepwise backward elimination:
The procedure starts with the full set of attributes. At each step,
it removes the worst attribute remaining in the set.
ii) Numerosity reduction techniques
It replace the original data volume by alternative, smaller forms of
data representation.
These techniques may be parametric or nonparametric.
For parametric methods, a model is used to estimate the data, so
that typically only the data parameters need to be stored, instead of the
actual data. (Outliers may also be stored.)
Regression and log-linear models are examples.
Nonparametric methods for storing reduced representations of the
data include histograms, clustering, sampling and data cube
aggregation.
iii) Data compression, transformations
They are applied to obtain a reduced or compressed representation
of the original data.
If the original data can be reconstructed from the compressed data
without any information loss, the data reduction is called lossless.
If, instead, we can reconstruct only an approximation of the original
data, then the data reduction is called lossy.
There are several lossless algorithms for string compression;
however, they typically allow only limited data manipulation.
 Dimensionality reduction and numerosity reduction techniques can
also be considered forms of data compression
Data Preprocessing

More Related Content

What's hot

Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning pyingkodi maran
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientistsAjay Ohri
 
Data mining an introduction
Data mining an introductionData mining an introduction
Data mining an introductionDr-Dipali Meher
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data MiningDHIVYADEVAKI
 
Unsupervised Anomaly Detection with Isolation Forest - Elena Sharova
Unsupervised Anomaly Detection with Isolation Forest - Elena SharovaUnsupervised Anomaly Detection with Isolation Forest - Elena Sharova
Unsupervised Anomaly Detection with Isolation Forest - Elena SharovaPyData
 
Data Reduction
Data ReductionData Reduction
Data ReductionRajan Shah
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingankur bhalla
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and BoostingMohit Rajput
 
Data Reduction Stratergies
Data Reduction StratergiesData Reduction Stratergies
Data Reduction StratergiesAnjaliSoorej
 
Data Analysis with Python Pandas
Data Analysis with Python PandasData Analysis with Python Pandas
Data Analysis with Python PandasNeeru Mittal
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis Peter Reimann
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data miningEr. Nawaraj Bhandari
 
Clustering - Machine Learning Techniques
Clustering - Machine Learning TechniquesClustering - Machine Learning Techniques
Clustering - Machine Learning TechniquesKush Kulshrestha
 

What's hot (20)

Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
 
Data mining an introduction
Data mining an introductionData mining an introduction
Data mining an introduction
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
Unsupervised Anomaly Detection with Isolation Forest - Elena Sharova
Unsupervised Anomaly Detection with Isolation Forest - Elena SharovaUnsupervised Anomaly Detection with Isolation Forest - Elena Sharova
Unsupervised Anomaly Detection with Isolation Forest - Elena Sharova
 
Data Reduction
Data ReductionData Reduction
Data Reduction
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and Boosting
 
Data Reduction Stratergies
Data Reduction StratergiesData Reduction Stratergies
Data Reduction Stratergies
 
Missing data handling
Missing data handlingMissing data handling
Missing data handling
 
Data Visualization With R
Data Visualization With RData Visualization With R
Data Visualization With R
 
Data Analysis with Python Pandas
Data Analysis with Python PandasData Analysis with Python Pandas
Data Analysis with Python Pandas
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis
 
Data cleaning-outlier-detection
Data cleaning-outlier-detectionData cleaning-outlier-detection
Data cleaning-outlier-detection
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data mining
 
Bayesian learning
Bayesian learningBayesian learning
Bayesian learning
 
Clustering - Machine Learning Techniques
Clustering - Machine Learning TechniquesClustering - Machine Learning Techniques
Clustering - Machine Learning Techniques
 

Similar to Data Preprocessing

Chapter 2 Cond (1).ppt
Chapter 2 Cond (1).pptChapter 2 Cond (1).ppt
Chapter 2 Cond (1).pptkannaradhas
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingsuganmca14
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningNandakumar P
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data miningUjjawal
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2Gokulks007
 
data processing.pdf
data processing.pdfdata processing.pdf
data processing.pdfDimpyJindal4
 
Chapter 3. Data Preprocessing.ppt
Chapter 3. Data Preprocessing.pptChapter 3. Data Preprocessing.ppt
Chapter 3. Data Preprocessing.pptSubrata Kumer Paul
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedYugal Kumar
 
SELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODSSELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODSKAMIL MAJEED
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ngsaranya12345
 
Privacy preservation techniques in data mining
Privacy preservation techniques in data miningPrivacy preservation techniques in data mining
Privacy preservation techniques in data miningeSAT Journals
 

Similar to Data Preprocessing (20)

1234
12341234
1234
 
Data Mining: Data Preprocessing
Data Mining: Data PreprocessingData Mining: Data Preprocessing
Data Mining: Data Preprocessing
 
Chapter 2 Cond (1).ppt
Chapter 2 Cond (1).pptChapter 2 Cond (1).ppt
Chapter 2 Cond (1).ppt
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
 
Unit 3-2.ppt
Unit 3-2.pptUnit 3-2.ppt
Unit 3-2.ppt
 
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data mining
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2
 
data processing.pdf
data processing.pdfdata processing.pdf
data processing.pdf
 
Chapter 3. Data Preprocessing.ppt
Chapter 3. Data Preprocessing.pptChapter 3. Data Preprocessing.ppt
Chapter 3. Data Preprocessing.ppt
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
 
SELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODSSELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODS
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Preprocess
PreprocessPreprocess
Preprocess
 
Data1
Data1Data1
Data1
 
Data1
Data1Data1
Data1
 
Datapreprocess
DatapreprocessDatapreprocess
Datapreprocess
 
data mining
data miningdata mining
data mining
 
Privacy preservation techniques in data mining
Privacy preservation techniques in data miningPrivacy preservation techniques in data mining
Privacy preservation techniques in data mining
 

Recently uploaded

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 

Recently uploaded (20)

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 

Data Preprocessing

  • 1. Data Preprocessing Why Preprocess the Data Data have quality if they satisfy the requirements of the intended use.  There are many factors comprising data quality, including accuracy, completeness, consistency, timeliness, believability, and interpretability Three of the elements defining data quality: accuracy, completeness, and consistency. Inaccurate, incomplete, and inconsistent data are commonplace properties of large real-world databases and data warehouses. There are many possible reasons for inaccurate data (i.e., having incorrect attribute values).
  • 2. real-world data tend to be dirty, incomplete, and inconsistent.  Data preprocessing techniques can improve data quality, thereby helping to improve the accuracy and efficiency of the subsequent mining process. Data preprocessing is an important step in the knowledge discovery process, because quality decisions must be based on quality data. Detecting data anomalies, rectifying them early, and reducing the data to be analyzed can lead to huge payoffs for decision making.
  • 3. Major Tasks in Data Preprocessing Major steps involved in data preprocessing, namely, data cleaning, data integration, data reduction, and data transformation Data cleaning routines work to “clean” the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. If users believe the data are dirty, they are unlikely to trust the results of any data mining that has been applied. Data from multiple sources in your analysis. This would involve integrating multiple databases, data cubes, or files (i.e., data integration).
  • 4. Data reduction obtains a reduced representation of the data set that is much smaller in volume, yet produces the same (or almost the same) analytical results. Data reduction strategies include dimensionality reduction numerosity reduction In dimensionality reduction, data encoding schemes are applied so as to obtain a reduced or “compressed” representation of the original data. Examples include  data compression techniques (e.g., wavelet transforms and principal components analysis), attribute subset selection (e.g., removing irrelevant attributes),  attribute construction (e.g., where a small set of more useful attributes is derived from the original set).
  • 5. In numerosity reduction, the data are replaced by alternative, smaller representations using parametric models (e.g., regression or log-linear models) or nonparametric models (e.g., histograms, clusters, sampling, or data aggregation). Normalization, data discretization, and concept hierarchy generation are forms of data transformation.
  • 6. Forms of data preprocessing
  • 7. Data Cleaning Data cleaning (or data cleansing) routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. Basic methods for data cleaning ways of handling missing values.  data smoothing techniques 1) Missing values Following methods are used for filling in the missing values of the attribute
  • 8. i) Ignore the tuple:  This is usually done when the class label is missing (assuming the mining task involves classification). This method is not very effective, unless the tuple contains several attributes with missing values. It is especially poor when the percentage of missing values per attribute varies considerably. By ignoring the tuple, we do not make use of the remaining attributes’ values in the tuple. Such data could have been useful to the task at hand.
  • 9. ii) Fill in the missing value manually: In general, this approach is time consuming and may not be feasible given a large data set with many missing values. iii)Use a global constant to fill in the missing value: Replace all missing attribute values by the same constant such as a label like “Unknown” or infinite iv) Use a measure of central tendency for the attribute (e.g., the mean or median) to fill in the missing value.
  • 10. v) Use the attribute mean or median for all samples belonging to the same class as the given tuple: For example, if classifying customers according to credit risk, we may replace the missing value with the mean income value for customers in the same credit risk category as that of the given tuple. Vi) Use the most probable value to fill in the missing value This may be determined with regression, inference-based tools using a Bayesian formalism, or decision tree induction Methods 3 through 6 bias the data—the filled-in value may not be correct. Method 6, however, is a popular strategy. In comparison to the other methods, it uses the most information from the present data to predict missing values.
  • 11. ii) Noisy Data Noise is a random error or variance in a measured variable. Methods of data visualization can be used to identify outliers, which may represent noise Followings are data smoothing techniques. Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,” that is, the values around it. The sorted values are distributed into a number of “buckets,” or bins. Because binning methods consult the neighborhood of values, they perform local smoothing.
  • 12. It illustrates some binning techniques. smoothing by bin means smoothing by bin medians smoothing by bin boundaries In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. smoothing by bin medians can be employed, in which each bin value is replaced by the bin median. In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value
  • 13. In this example, the data for price are first sorted and then partitioned into equal-frequency bins of size 3 (i.e., each bin contains three values).
  • 14. Regression: Data smoothing can also be done by regression, a technique that conforms data values to a function. Linear regression involves finding the “best” line to fit two attributes (or variables) so that one attribute can be used to predict the other. Multiple linear regression is an extension of linear regression, where more than two attributes are involved and the data are fit to a multidimensional surface. Outlier analysis: Outliers may be detected by clustering, for example, where similar values are organized into groups, or “clusters.” Intuitively, values that fall outside of the set of clusters may be considered outliers
  • 15. Many data smoothing methods are also used for data discretization (a form of data transformation) and data reduction. iii) Data Cleaning as a Process Data cleaning as a process is discrepancy detection. Discrepancies can be caused by several factors, including poorly designed data entry forms that have many optional fields, human error in data entry, deliberate errors and data decay (e.g., outdated addresses)
  • 16. The data should also be examined regarding unique rules, consecutive rules, and null rules. A unique rule says that each value of the given attribute must be different from all other values for that attribute. A consecutive rule says that there can be no missing values between the lowest and highest values for the attribute, and that all values must also be unique (e.g., as in check numbers). A null rule specifies the use of blanks, question marks, special characters, or other strings that may indicate the null condition and how such values should be handled
  • 17. There are a number of different commercial tools that can aid in the discrepancy detection step. Data scrubbing tools use simple domain knowledge to detect errors and make corrections in the data. These tools rely on parsing and fuzzy matching techniques when cleaning data from multiple sources. Data auditing tools find discrepancies by analyzing the data to discover rules and relationships, and detecting data that violate such conditions.
  • 18. Data Integration Merging of data from multiple data stores. Sources may include multiple databases, data cubes, or flat files. Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set. This can help improve the accuracy and speed of the subsequent data mining process How can we match schema and objects from different sources? This is the essence of the entity identification problem i) Entity Identification Problem There are a number of issues to consider during data integration. Schema integration and object matching can be tricky
  • 19. For example, How can the data analyst or the computer be sure that customer id in one database and cust_number in another refer to the same attribute?  Metadata for each attribute include the name, meaning, data type, and range of values permitted for the attribute, and null rules for handling blank, zero, or null values . Such metadata can be used to help avoid errors in schema integration. The metadata may also be used to help transform the data (e.g., where data codes for pay type in one database may be “H” and “S” but 1 and 2 in another)
  • 20. ii) Redundancy and Correlation Analysis Redundancy is another important issue in data integration. An attribute (such as annual revenue) may be redundant if it can be “derived” from another attribute or set of attributes. Inconsistencies in attribute or dimension naming can also cause redundancies in the resulting data set. Some redundancies can be detected by correlation analysis. Given two attributes, such analysis can measure how strongly one attribute implies the other, based on the available data. For nominal data, we use the (chi-square) test. For numeric attributes, we can use the correlation coefficient and covariance, both of which access how one attribute’s values vary from those of another.
  • 21. Correlation Test for Nominal Data The Chi-Square test of independence is used to determine if there is a significant relationship between two nominal (categorical) variables.  The frequency of each category for one nominal variable is compared across the categories of the second nominal variable. For nominal data, a correlation relationship between two attributes, A and B, can be discovered by a (chi-square) test. Suppose A has c distinct values, namely a1,a2,…ac . B has r distinct values, namely b1,b2, …br .
  • 22. Value (also known as the Pearson statistic) is computed as The statistic tests the hypothesis that A and B are independent, that is, there is no correlation between them. The test is based on a significance level, with (r-1)*(c-1) degrees of freedom. If the hypothesis can be rejected, then we say that A and B are statistically correlated.
  • 23. For example Expected frequency is calculated as Using the formulae computation is as follows
  • 24. For this 22 table, the degrees of freedom are (2-1)(2-1)= 1. For 1 degree of freedom, the value needed to reject the hypothesis at the 0.001 significance level is 10.828. Since our computed value is above this, we can reject the hypothesis that gender and preferred reading are independent and conclude that the two attributes are (strongly) correlated for the given group of people
  • 26. Covariance is a statistical measure that shows whether two variables are related by measuring how the variables change in relation to each other. Correlation, like covariance, is a measure of how two variables change in relation to each other, but it goes one step further than covariance in that correlation tells how strong the relationship is.
  • 27. For example  looking for trends with ice cream shop Temperature No of customers 98 15 87 12 90 10 85 10 95 16 75 7
  • 28. Find mean of x is (98+87+90+85+95+75)/6= 88.33. Find mean of y is (15+12+10+10+16+7)/6= 11.67  Then Subtract each value from its respective mean and then multiply these new values together.
  • 29. add all the products together, which yields the value 125.66. The final step is to divide by (n-1) = 6 - 1 = 5. Use formulae 125.66/5 = 25.132 The covariance of this set of data is 25.132. The number is positive, so we can state that the two variables do have a positive relationship as temperature rises, the number of customers in the store also rises
  • 30. To find the strength, we need to continue on to correlation This formula will result in a number between -1 and 1, with -1 being a perfect inverse correlation (the variables move in opposite directions) 0 indicating no relationship between the two variables 1 being a perfect positive correlation (the variables reliably and consistently move in the same direction as each other). ***** iii) Tuple duplication iv) Data value conflict detection and resolution
  • 31. Data Reduction Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original Data Overview of Data Reduction Strategies Data reduction strategies include i) dimensionality reduction ii) numerosity reduction iii) data compression
  • 32. i) Dimensionality reduction It is the process of reducing the number of random variables or attributes under consideration. Dimensionality reduction methods include wavelet transforms and principal components analysis which transform or project the original data onto a smaller space. Attribute subset selection is a method of dimensionality reduction in which irrelevant, weakly relevant, or redundant attributes or dimensions are detected and removed.
  • 33. Such attribute construction can help improve accuracy and understanding of structure in high dimensional data Basic heuristic methods of attribute subset selection include the techniques that follows as Greedy (heuristic) methods for attribute subset selection
  • 34. Stepwise forward selection: The procedure starts with an empty set of attributes as the reduced set. The best of the original attributes is determined and added to the reduced set. At each subsequent iteration or step, the best of the remaining original attributes is added to the set. Stepwise backward elimination: The procedure starts with the full set of attributes. At each step, it removes the worst attribute remaining in the set.
  • 35. ii) Numerosity reduction techniques It replace the original data volume by alternative, smaller forms of data representation. These techniques may be parametric or nonparametric. For parametric methods, a model is used to estimate the data, so that typically only the data parameters need to be stored, instead of the actual data. (Outliers may also be stored.) Regression and log-linear models are examples. Nonparametric methods for storing reduced representations of the data include histograms, clustering, sampling and data cube aggregation.
  • 36. iii) Data compression, transformations They are applied to obtain a reduced or compressed representation of the original data. If the original data can be reconstructed from the compressed data without any information loss, the data reduction is called lossless. If, instead, we can reconstruct only an approximation of the original data, then the data reduction is called lossy. There are several lossless algorithms for string compression; however, they typically allow only limited data manipulation.  Dimensionality reduction and numerosity reduction techniques can also be considered forms of data compression