SlideShare a Scribd company logo
Data Preprocessing
By
S.Dinesh Babu
II MCA
Definition
 Data preprocessing is a data mining technique
that involves transforming raw data into an
understandable format.
 Data in the real world is dirty
Measures for data quality:A multidimensional view
◦ Accuracy: correct or wrong, accurate or not
◦ Completeness: not recorded, unavailable, …
◦ Consistency: some modified but some not,
dangling, …
◦ Timeliness: timely update?
◦ Believability: how trustable the data are correct?
◦ Interpretability: how easily the data can be
understood?
Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data
Discretization
Data Cleaning: Incomplete
 Data is not always available
 Ex:Age:” ”;
 Missing data may be due to
◦ equipment malfunction
◦ inconsistent with other recorded data and thus deleted
◦ data not entered due to misunderstanding
◦ certain data may not be considered important at the
time of entry
Noisy Data
 Unstructured Data.
 Increases the amount of storage space .
Causes:
Hardware Failure
Programming Errors
Data Cleaning as a Process
 Missing values, noise, and inconsistencies contribute to
inaccurate data.
 The first step in data cleaning as a process is
discrepancy detection.
 Discrepancies can be caused by several factors.
 Poorly designed data entry forms
 human error in data entry
The data should also be examined regarding:
o Unique rule:
Each attribute value must be different from all other attribute
value.
o Consecutive rule
No missing values between lowest and highest values of the
attribute.
o Null rule
Specifies the use of blanks, question marks, special
characters.
Data Integration
 The merging of data from multiple data stores.
 It can help reduce, avoid redundancies and
inconsistencies.
 It improve the accuracy and speed of the subsequent
data mining process.
Data Reduction
 To obtain a reduced representation of the data set that is
much smaller in volume.
Strategies for data reduction include the following:
 Data cube aggregation, where aggregation operations
are applied to the data in the construction of a data cube.
 Attribute subset selection, where irrelevant, weakly
relevant, or redundant attributes or dimensions may be
detected and removed.
 Dimensionality reduction, where encoding mechanisms are
used to reduce the data set size.
 Numerosity reduction, where the data are replaced or
estimated by alternative, smaller data representations such as
 Parametric models
 Nonparametric methods such as clustering, sampling,
and the use of histograms.
Data Transformation
 In data transformation, the data are transformed or
consolidated into forms appropriate for mining.
Data transformation can involve the following:
 Smoothing: remove noise from data
 Aggregation: summarization, data cube construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small,
specified range
 min-max normalization
Data Discretization
 Discretization: Divide the range of a continuous attribute
into intervals
◦ Interval labels can then be used to replace actual data
values
◦ Reduce data size by Discretization
◦ Split (top-down) vs. merge (bottom-up)
◦ Discretization can be performed recursively on an
attribute
◦ Prepare for further analysis, e.g., classification
 Three types of attributes
◦ Nominal—values from an unordered set, e.g., color, profession
◦ Ordinal—values from an ordered set, e.g., military or academic
rank
◦ Numeric—real numbers, e.g., integer or real numbers

More Related Content

Viewers also liked

Tutorial Blogspot
Tutorial BlogspotTutorial Blogspot
Tutorial Blogspot
udalaitz
 
Infografía
InfografíaInfografía
Infografía
rhonajuarez
 
AQA Biology 1A Fighting Disease
AQA Biology 1A Fighting DiseaseAQA Biology 1A Fighting Disease
AQA Biology 1A Fighting Disease
sherinshaju
 
Crazy leaders, micromanagement and blaming culture - is there an alternative
Crazy leaders, micromanagement and blaming culture - is there an alternativeCrazy leaders, micromanagement and blaming culture - is there an alternative
Crazy leaders, micromanagement and blaming culture - is there an alternative
Ilari Henrik Aegerter
 
Increase retention by 35% and avoid US$2.5 million penalties with a single so...
Increase retention by 35% and avoid US$2.5 million penalties with a single so...Increase retention by 35% and avoid US$2.5 million penalties with a single so...
Increase retention by 35% and avoid US$2.5 million penalties with a single so...
DRISHTI-SOFT SOLUTIONS PVT. LTD.
 
SATUAN HIDUP DALAM EKOSISTEM
SATUAN HIDUP DALAM EKOSISTEMSATUAN HIDUP DALAM EKOSISTEM
SATUAN HIDUP DALAM EKOSISTEM
Prihandoko Mufaridlo
 
รายงานผลงานการติดตามและประเมินผลแผนพัฒนาสามปี (พ.ศ.2559-2560)
รายงานผลงานการติดตามและประเมินผลแผนพัฒนาสามปี (พ.ศ.2559-2560)รายงานผลงานการติดตามและประเมินผลแผนพัฒนาสามปี (พ.ศ.2559-2560)
รายงานผลงานการติดตามและประเมินผลแผนพัฒนาสามปี (พ.ศ.2559-2560)
Kanjana thong
 
Self ordering kiosk_software - Atsmit self service sulutions ltd.
Self ordering kiosk_software - Atsmit self service sulutions ltd.Self ordering kiosk_software - Atsmit self service sulutions ltd.
Self ordering kiosk_software - Atsmit self service sulutions ltd.
Ygal Weitzman
 
A Fast & Furious Guide to Local SEO
A Fast & Furious Guide to Local SEOA Fast & Furious Guide to Local SEO
A Fast & Furious Guide to Local SEO
Greg Gifford
 
IoT時代のインターネット技術動向 インフラプロトコル編
IoT時代のインターネット技術動向 インフラプロトコル編IoT時代のインターネット技術動向 インフラプロトコル編
IoT時代のインターネット技術動向 インフラプロトコル編
Shoichi Sakane
 
The Valley of Disruption
The Valley of DisruptionThe Valley of Disruption
The Valley of Disruption
Peter Hinssen
 
What the Shift to Value Means for Pharmaceuticals
What the Shift to Value Means for PharmaceuticalsWhat the Shift to Value Means for Pharmaceuticals
What the Shift to Value Means for Pharmaceuticals
Medullan
 

Viewers also liked (13)

Tutorial Blogspot
Tutorial BlogspotTutorial Blogspot
Tutorial Blogspot
 
Infografía
InfografíaInfografía
Infografía
 
AQA Biology 1A Fighting Disease
AQA Biology 1A Fighting DiseaseAQA Biology 1A Fighting Disease
AQA Biology 1A Fighting Disease
 
Crazy leaders, micromanagement and blaming culture - is there an alternative
Crazy leaders, micromanagement and blaming culture - is there an alternativeCrazy leaders, micromanagement and blaming culture - is there an alternative
Crazy leaders, micromanagement and blaming culture - is there an alternative
 
Increase retention by 35% and avoid US$2.5 million penalties with a single so...
Increase retention by 35% and avoid US$2.5 million penalties with a single so...Increase retention by 35% and avoid US$2.5 million penalties with a single so...
Increase retention by 35% and avoid US$2.5 million penalties with a single so...
 
Орчлон ертөнц
Орчлон ертөнцОрчлон ертөнц
Орчлон ертөнц
 
SATUAN HIDUP DALAM EKOSISTEM
SATUAN HIDUP DALAM EKOSISTEMSATUAN HIDUP DALAM EKOSISTEM
SATUAN HIDUP DALAM EKOSISTEM
 
รายงานผลงานการติดตามและประเมินผลแผนพัฒนาสามปี (พ.ศ.2559-2560)
รายงานผลงานการติดตามและประเมินผลแผนพัฒนาสามปี (พ.ศ.2559-2560)รายงานผลงานการติดตามและประเมินผลแผนพัฒนาสามปี (พ.ศ.2559-2560)
รายงานผลงานการติดตามและประเมินผลแผนพัฒนาสามปี (พ.ศ.2559-2560)
 
Self ordering kiosk_software - Atsmit self service sulutions ltd.
Self ordering kiosk_software - Atsmit self service sulutions ltd.Self ordering kiosk_software - Atsmit self service sulutions ltd.
Self ordering kiosk_software - Atsmit self service sulutions ltd.
 
A Fast & Furious Guide to Local SEO
A Fast & Furious Guide to Local SEOA Fast & Furious Guide to Local SEO
A Fast & Furious Guide to Local SEO
 
IoT時代のインターネット技術動向 インフラプロトコル編
IoT時代のインターネット技術動向 インフラプロトコル編IoT時代のインターネット技術動向 インフラプロトコル編
IoT時代のインターネット技術動向 インフラプロトコル編
 
The Valley of Disruption
The Valley of DisruptionThe Valley of Disruption
The Valley of Disruption
 
What the Shift to Value Means for Pharmaceuticals
What the Shift to Value Means for PharmaceuticalsWhat the Shift to Value Means for Pharmaceuticals
What the Shift to Value Means for Pharmaceuticals
 

Similar to Data preprocessing

Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
YashikaSengar2
 
Intro to Data warehousing lecture 17
Intro to Data warehousing   lecture 17Intro to Data warehousing   lecture 17
Intro to Data warehousing lecture 17
AnwarrChaudary
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
Knoldus Inc.
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
Yugal Kumar
 
Editing, cleaning and coding of data in Business research methodology
Editing, cleaning and coding of data in Business research methodology Editing, cleaning and coding of data in Business research methodology
Editing, cleaning and coding of data in Business research methodology
VaishaghMp
 
Machine learning topics machine learning algorithm into three main parts.
Machine learning topics  machine learning algorithm into three main parts.Machine learning topics  machine learning algorithm into three main parts.
Machine learning topics machine learning algorithm into three main parts.
DurgaDeviP2
 
Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
Iffat Firozy
 
Unit2
Unit2Unit2
ML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.ppt
belay41
 
Assignmentdatamining
AssignmentdataminingAssignmentdatamining
Assignmentdatamining
Chandrika Sweety
 
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
DrGnaneswariG
 
Data MIning: Data processing
Data MIning: Data processingData MIning: Data processing
Data MIning: Data processing
Datamining Tools
 
Preprocess
PreprocessPreprocess
Preprocess
sharmilajohn
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
Umair Shafique
 
Pre processing
Pre processingPre processing
Pre processing
Vijay Kumar
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
Vijay Kumar
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
Gopal Sakarkar
 
1234
12341234
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
datapreprocessing
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
saranya12345
 

Similar to Data preprocessing (20)

Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
Intro to Data warehousing lecture 17
Intro to Data warehousing   lecture 17Intro to Data warehousing   lecture 17
Intro to Data warehousing lecture 17
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
 
Editing, cleaning and coding of data in Business research methodology
Editing, cleaning and coding of data in Business research methodology Editing, cleaning and coding of data in Business research methodology
Editing, cleaning and coding of data in Business research methodology
 
Machine learning topics machine learning algorithm into three main parts.
Machine learning topics  machine learning algorithm into three main parts.Machine learning topics  machine learning algorithm into three main parts.
Machine learning topics machine learning algorithm into three main parts.
 
Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
 
Unit2
Unit2Unit2
Unit2
 
ML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.ppt
 
Assignmentdatamining
AssignmentdataminingAssignmentdatamining
Assignmentdatamining
 
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
 
Data MIning: Data processing
Data MIning: Data processingData MIning: Data processing
Data MIning: Data processing
 
Preprocess
PreprocessPreprocess
Preprocess
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
Pre processing
Pre processingPre processing
Pre processing
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
1234
12341234
1234
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 

Data preprocessing

  • 2. Definition  Data preprocessing is a data mining technique that involves transforming raw data into an understandable format.  Data in the real world is dirty
  • 3. Measures for data quality:A multidimensional view ◦ Accuracy: correct or wrong, accurate or not ◦ Completeness: not recorded, unavailable, … ◦ Consistency: some modified but some not, dangling, … ◦ Timeliness: timely update? ◦ Believability: how trustable the data are correct? ◦ Interpretability: how easily the data can be understood?
  • 4. Major Tasks in Data Preprocessing  Data Cleaning  Data Integration  Data Reduction  Data Transformation and Data Discretization
  • 5. Data Cleaning: Incomplete  Data is not always available  Ex:Age:” ”;  Missing data may be due to ◦ equipment malfunction ◦ inconsistent with other recorded data and thus deleted ◦ data not entered due to misunderstanding ◦ certain data may not be considered important at the time of entry
  • 6. Noisy Data  Unstructured Data.  Increases the amount of storage space . Causes: Hardware Failure Programming Errors
  • 7. Data Cleaning as a Process  Missing values, noise, and inconsistencies contribute to inaccurate data.  The first step in data cleaning as a process is discrepancy detection.  Discrepancies can be caused by several factors.  Poorly designed data entry forms  human error in data entry
  • 8. The data should also be examined regarding: o Unique rule: Each attribute value must be different from all other attribute value. o Consecutive rule No missing values between lowest and highest values of the attribute. o Null rule Specifies the use of blanks, question marks, special characters.
  • 9. Data Integration  The merging of data from multiple data stores.  It can help reduce, avoid redundancies and inconsistencies.  It improve the accuracy and speed of the subsequent data mining process.
  • 10. Data Reduction  To obtain a reduced representation of the data set that is much smaller in volume. Strategies for data reduction include the following:  Data cube aggregation, where aggregation operations are applied to the data in the construction of a data cube.  Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes or dimensions may be detected and removed.
  • 11.  Dimensionality reduction, where encoding mechanisms are used to reduce the data set size.  Numerosity reduction, where the data are replaced or estimated by alternative, smaller data representations such as  Parametric models  Nonparametric methods such as clustering, sampling, and the use of histograms.
  • 12. Data Transformation  In data transformation, the data are transformed or consolidated into forms appropriate for mining. Data transformation can involve the following:  Smoothing: remove noise from data  Aggregation: summarization, data cube construction  Generalization: concept hierarchy climbing  Normalization: scaled to fall within a small, specified range  min-max normalization
  • 13. Data Discretization  Discretization: Divide the range of a continuous attribute into intervals ◦ Interval labels can then be used to replace actual data values ◦ Reduce data size by Discretization ◦ Split (top-down) vs. merge (bottom-up) ◦ Discretization can be performed recursively on an attribute ◦ Prepare for further analysis, e.g., classification
  • 14.  Three types of attributes ◦ Nominal—values from an unordered set, e.g., color, profession ◦ Ordinal—values from an ordered set, e.g., military or academic rank ◦ Numeric—real numbers, e.g., integer or real numbers