SlideShare a Scribd company logo
1 of 23
Download to read offline
Data Preprocessing
Presented by
P.Veeralakshmi
M.C.A
Why Data Preprocessing?
• Data in the real world is dirty
• Incomplete
Incomplete data may come from
Human/hardware/software problems
e.g., occupation=“ ”
• Noisy:
Faulty data collection instruments
e.g., Salary=“-10”
Cont….
• Inconsistent:
Functional dependency violation
e.g., Age=“42” Birthday=“03/07/1997”
Major Tasks in Data Preprocessing?
• Data cleaning
Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
• Data integration
Integration of multiple databases, data cubes, or files
• Data transformation
Normalization and aggregation
• Data reduction
Obtains reduced representation in volume but
produces the same or similar analytical results
• Data discretization
Part of data reduction but with particular importance,
especially for numerical data
Descriptive Data Summarization
• It is a techniques can be used to identify the
which data values should be treated as noise
or outliers.
• Measures of central tendency include
1. mean
2. median
3. mode
4. midrange,
Graphic Displays of Basic Descriptive
Data Summaries
• Aside from the bar charts, pie charts, and line
graphs used in most statistical or graphical
data presentation.
• Histogram
• Quantile plots
• q-q plots
• scatter plots
• loess curves.
HISTOGRAM
Qunatile
Q-Q plot
Loess Curve
Data Cleaning
• Data cleaning (or data cleansing) routines
attempt to fill in
• missing values
• noise
• identifying outliers
• correct inconsistencies.
Missing Values
1. Ignore the tuple
2. Fill in the missing value manually
3. Use a global constant to fill in the missing value
4. Use the attribute mean to fill in the missing value
5. Use the attribute mean for all samples belonging
to the same class as the given tuple
6. Use the most probable value to fill in the missing
value
Method 6,however, is a popular strategy.
Noisy Data
1. Binning:
• The sorted values are distributed into a
number
of “buckets,” or bins
ex: Bin = 4,8,15
• Smoothing by bin means
Bin = 9
• Smoothing by bin boundaries
4,4,15
2.Regression
• Data can be smoothed by fitting the data to a
function.
• Linear regression involves finding the “best” line
to fit two attributes
• so that one attribute can be used to predict the
other.
• Multiple linear regression is an extension of linear
regression
3. Clustering
• Outliers may be detected by clustering, where
similar values are organized into groups, or
“clusters.”
• The values that fall outside of the set of
clusters may be considered outliers
..........
……
…
…..
SOME RULES
• The data should also be examined regarding
• unique rules - each value attribute must be
different from all other values
• consecutive rules - no missing values between
the lowest and highest values .
• null rules - A null rule specifies the use of
blanks,question marks, special characters.
Data Integration
• Data integration, which combines data from
multiple sources into a coherent data store.
• Data integration Technique:
• Schema integration
• Redundancy
• correlation analysis
Data Transformation
• In data transformation, the data are transformed
or consolidated into forms appropriate for
mining.
• Data transformation can involve the following:
• Smoothing - to remove noise from the data.
• Aggregation - summary or aggregation operations
are applied to the data.
• Ex : the daily sales data may be aggregated so as
to compute monthly and annual total amounts.
Cont….
• Generalization - low-level or “primitive” (raw)
data are replaced by higher-level concepts
through the use of concept hierarchies.
• Normalization - the attribute data are scaled
so as to fall within a small specified range,
such as 1:0 to 1:0, or 0:0 to 1:0.
• Attribute construction - new attributes are
constructed and added from the given set of
attributes.
Data Reduction
• Data reduction techniques can be applied to obtain a
reduced representation of the data set that is much
smaller in volume.
1. Data cube aggregation
where aggregation operations are applied to the data
in the construction of a data cube.
2. Attribute subset selection
where irrelevant, weakly relevant, or redundant
attributes or dimensions may be detected and removed.
Cont…
3.Dimensionality reduction
where encoding mechanisms are used to
reduce the data set size.
4.Numerosity reduction
where the data are replaced or estimated by
alternative
•clustering
• sampling
• histograms.
Data Discretization
• Data discretization techniques can be used to
reduce the number of values for a given
continuous attribute by dividing the range of the
attribute into intervals.
• Binning
• Histogram Analysis
• Entropy-Based Discretization
• Interval Merging by x2 Analysis
• Cluster Analysis
• Discretization by Intuitive Partitioning
THANK YOU

More Related Content

What's hot

Data Compression in Data mining and Business Intelligencs
Data Compression in Data mining and Business Intelligencs Data Compression in Data mining and Business Intelligencs
Data Compression in Data mining and Business Intelligencs ShahDhruv21
 
03 preprocessing
03 preprocessing03 preprocessing
03 preprocessingpurnimatm
 
Transformasi Data Penelitian
Transformasi Data PenelitianTransformasi Data Penelitian
Transformasi Data PenelitianTrisnadi Wijaya
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1meenas06
 
Random forest
Random forestRandom forest
Random forestUjjawal
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysisDataminingTools Inc
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingvenkadesh236
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsDataminingTools Inc
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingvenkadesh236
 
Accelerating the Random Forest algorithm for commodity parallel- Mark Seligman
Accelerating the Random Forest algorithm for commodity parallel- Mark SeligmanAccelerating the Random Forest algorithm for commodity parallel- Mark Seligman
Accelerating the Random Forest algorithm for commodity parallel- Mark SeligmanPyData
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Derek Kane
 
Data Science and Data Visualization (All about Data Analysis) by Pooja Ajmera
Data Science and Data Visualization (All about Data Analysis) by Pooja AjmeraData Science and Data Visualization (All about Data Analysis) by Pooja Ajmera
Data Science and Data Visualization (All about Data Analysis) by Pooja AjmeraPooja Ajmera
 
Introduction to random forest and gradient boosting methods a lecture
Introduction to random forest and gradient boosting methods   a lectureIntroduction to random forest and gradient boosting methods   a lecture
Introduction to random forest and gradient boosting methods a lectureShreyas S K
 
What makes a good decision tree?
What makes a good decision tree?What makes a good decision tree?
What makes a good decision tree?Rupak Roy
 

What's hot (20)

Data Compression in Data mining and Business Intelligencs
Data Compression in Data mining and Business Intelligencs Data Compression in Data mining and Business Intelligencs
Data Compression in Data mining and Business Intelligencs
 
Data mining
Data miningData mining
Data mining
 
03 preprocessing
03 preprocessing03 preprocessing
03 preprocessing
 
Datapreprocess
DatapreprocessDatapreprocess
Datapreprocess
 
Transformasi Data Penelitian
Transformasi Data PenelitianTransformasi Data Penelitian
Transformasi Data Penelitian
 
Dbms schemas for decision support
Dbms schemas for decision supportDbms schemas for decision support
Dbms schemas for decision support
 
03 Data Mining Techniques
03 Data Mining Techniques03 Data Mining Techniques
03 Data Mining Techniques
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
 
Random forest
Random forestRandom forest
Random forest
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlations
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Accelerating the Random Forest algorithm for commodity parallel- Mark Seligman
Accelerating the Random Forest algorithm for commodity parallel- Mark SeligmanAccelerating the Random Forest algorithm for commodity parallel- Mark Seligman
Accelerating the Random Forest algorithm for commodity parallel- Mark Seligman
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
Data Science and Data Visualization (All about Data Analysis) by Pooja Ajmera
Data Science and Data Visualization (All about Data Analysis) by Pooja AjmeraData Science and Data Visualization (All about Data Analysis) by Pooja Ajmera
Data Science and Data Visualization (All about Data Analysis) by Pooja Ajmera
 
Preparing Data
Preparing DataPreparing Data
Preparing Data
 
BAS 250 Lecture 3
BAS 250 Lecture 3BAS 250 Lecture 3
BAS 250 Lecture 3
 
Introduction to random forest and gradient boosting methods a lecture
Introduction to random forest and gradient boosting methods   a lectureIntroduction to random forest and gradient boosting methods   a lecture
Introduction to random forest and gradient boosting methods a lecture
 
What makes a good decision tree?
What makes a good decision tree?What makes a good decision tree?
What makes a good decision tree?
 

Similar to Dmblog

Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data PreparationUmair Shafique
 
overview of_data_processing
overview of_data_processingoverview of_data_processing
overview of_data_processingFEG
 
Data pre processing
Data pre processingData pre processing
Data pre processingpommurajopt
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data miningDhilsath Fathima
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Dhilsath Fathima
 
Preprocessing
PreprocessingPreprocessing
Preprocessingmmuthuraj
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessingKnoldus Inc.
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.pptchatbot9
 
Data preprocessing 2
Data preprocessing 2Data preprocessing 2
Data preprocessing 2extraganesh
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.pptcongtran88
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingextraganesh
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingTony Nguyen
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingHarry Potter
 

Similar to Dmblog (20)

Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
overview of_data_processing
overview of_data_processingoverview of_data_processing
overview of_data_processing
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16
 
Data preprocess
Data preprocessData preprocess
Data preprocess
 
Data1
Data1Data1
Data1
 
Data1
Data1Data1
Data1
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Data preprocessing 2
Data preprocessing 2Data preprocessing 2
Data preprocessing 2
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Datapreprocessing
DatapreprocessingDatapreprocessing
Datapreprocessing
 

Dmblog

  • 2. Why Data Preprocessing? • Data in the real world is dirty • Incomplete Incomplete data may come from Human/hardware/software problems e.g., occupation=“ ” • Noisy: Faulty data collection instruments e.g., Salary=“-10”
  • 3. Cont…. • Inconsistent: Functional dependency violation e.g., Age=“42” Birthday=“03/07/1997”
  • 4. Major Tasks in Data Preprocessing? • Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration Integration of multiple databases, data cubes, or files • Data transformation Normalization and aggregation • Data reduction Obtains reduced representation in volume but produces the same or similar analytical results • Data discretization Part of data reduction but with particular importance, especially for numerical data
  • 5. Descriptive Data Summarization • It is a techniques can be used to identify the which data values should be treated as noise or outliers. • Measures of central tendency include 1. mean 2. median 3. mode 4. midrange,
  • 6. Graphic Displays of Basic Descriptive Data Summaries • Aside from the bar charts, pie charts, and line graphs used in most statistical or graphical data presentation. • Histogram • Quantile plots • q-q plots • scatter plots • loess curves.
  • 11. Data Cleaning • Data cleaning (or data cleansing) routines attempt to fill in • missing values • noise • identifying outliers • correct inconsistencies.
  • 12. Missing Values 1. Ignore the tuple 2. Fill in the missing value manually 3. Use a global constant to fill in the missing value 4. Use the attribute mean to fill in the missing value 5. Use the attribute mean for all samples belonging to the same class as the given tuple 6. Use the most probable value to fill in the missing value Method 6,however, is a popular strategy.
  • 13. Noisy Data 1. Binning: • The sorted values are distributed into a number of “buckets,” or bins ex: Bin = 4,8,15 • Smoothing by bin means Bin = 9 • Smoothing by bin boundaries 4,4,15
  • 14. 2.Regression • Data can be smoothed by fitting the data to a function. • Linear regression involves finding the “best” line to fit two attributes • so that one attribute can be used to predict the other. • Multiple linear regression is an extension of linear regression
  • 15. 3. Clustering • Outliers may be detected by clustering, where similar values are organized into groups, or “clusters.” • The values that fall outside of the set of clusters may be considered outliers .......... …… … …..
  • 16. SOME RULES • The data should also be examined regarding • unique rules - each value attribute must be different from all other values • consecutive rules - no missing values between the lowest and highest values . • null rules - A null rule specifies the use of blanks,question marks, special characters.
  • 17. Data Integration • Data integration, which combines data from multiple sources into a coherent data store. • Data integration Technique: • Schema integration • Redundancy • correlation analysis
  • 18. Data Transformation • In data transformation, the data are transformed or consolidated into forms appropriate for mining. • Data transformation can involve the following: • Smoothing - to remove noise from the data. • Aggregation - summary or aggregation operations are applied to the data. • Ex : the daily sales data may be aggregated so as to compute monthly and annual total amounts.
  • 19. Cont…. • Generalization - low-level or “primitive” (raw) data are replaced by higher-level concepts through the use of concept hierarchies. • Normalization - the attribute data are scaled so as to fall within a small specified range, such as 1:0 to 1:0, or 0:0 to 1:0. • Attribute construction - new attributes are constructed and added from the given set of attributes.
  • 20. Data Reduction • Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume. 1. Data cube aggregation where aggregation operations are applied to the data in the construction of a data cube. 2. Attribute subset selection where irrelevant, weakly relevant, or redundant attributes or dimensions may be detected and removed.
  • 21. Cont… 3.Dimensionality reduction where encoding mechanisms are used to reduce the data set size. 4.Numerosity reduction where the data are replaced or estimated by alternative •clustering • sampling • histograms.
  • 22. Data Discretization • Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. • Binning • Histogram Analysis • Entropy-Based Discretization • Interval Merging by x2 Analysis • Cluster Analysis • Discretization by Intuitive Partitioning