SlideShare a Scribd company logo
1 of 15
Data Preprocessing
By
S.Dinesh Babu
II MCA
Definition
 Data preprocessing is a data mining technique
that involves transforming raw data into an
understandable format.
 Data in the real world is dirty
Measures for data quality:A multidimensional view
◦ Accuracy: correct or wrong, accurate or not
◦ Completeness: not recorded, unavailable, …
◦ Consistency: some modified but some not,
dangling, …
◦ Timeliness: timely update?
◦ Believability: how trustable the data are correct?
◦ Interpretability: how easily the data can be
understood?
Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data
Discretization
Data Cleaning: Incomplete
 Data is not always available
 Ex:Age:” ”;
 Missing data may be due to
◦ equipment malfunction
◦ inconsistent with other recorded data and thus deleted
◦ data not entered due to misunderstanding
◦ certain data may not be considered important at the
time of entry
Noisy Data
 Unstructured Data.
 Increases the amount of storage space .
Causes:
Hardware Failure
Programming Errors
Data Cleaning as a Process
 Missing values, noise, and inconsistencies contribute to
inaccurate data.
 The first step in data cleaning as a process is
discrepancy detection.
 Discrepancies can be caused by several factors.
 Poorly designed data entry forms
 human error in data entry
The data should also be examined regarding:
o Unique rule:
Each attribute value must be different from all other attribute
value.
o Consecutive rule
No missing values between lowest and highest values of the
attribute.
o Null rule
Specifies the use of blanks, question marks, special
characters.
Data Integration
 The merging of data from multiple data stores.
 It can help reduce, avoid redundancies and
inconsistencies.
 It improve the accuracy and speed of the subsequent
data mining process.
Data Reduction
 To obtain a reduced representation of the data set that is
much smaller in volume.
Strategies for data reduction include the following:
 Data cube aggregation, where aggregation operations
are applied to the data in the construction of a data cube.
 Attribute subset selection, where irrelevant, weakly
relevant, or redundant attributes or dimensions may be
detected and removed.
 Dimensionality reduction, where encoding mechanisms are
used to reduce the data set size.
 Numerosity reduction, where the data are replaced or
estimated by alternative, smaller data representations such as
 Parametric models
 Nonparametric methods such as clustering, sampling,
and the use of histograms.
Data Transformation
 In data transformation, the data are transformed or
consolidated into forms appropriate for mining.
Data transformation can involve the following:
 Smoothing: remove noise from data
 Aggregation: summarization, data cube construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small,
specified range
 min-max normalization
Data Discretization
 Discretization: Divide the range of a continuous attribute
into intervals
◦ Interval labels can then be used to replace actual data
values
◦ Reduce data size by Discretization
◦ Split (top-down) vs. merge (bottom-up)
◦ Discretization can be performed recursively on an
attribute
◦ Prepare for further analysis, e.g., classification
 Three types of attributes
◦ Nominal—values from an unordered set, e.g., color, profession
◦ Ordinal—values from an ordered set, e.g., military or academic
rank
◦ Numeric—real numbers, e.g., integer or real numbers
ThankYou

More Related Content

What's hot

OLAP operations
OLAP operationsOLAP operations
OLAP operationskunj desai
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter TuningJon Lederman
 
6 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/26 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/2Fabio Fumarola
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sqlRam kumar
 
Presentation on data preparation with pandas
Presentation on data preparation with pandasPresentation on data preparation with pandas
Presentation on data preparation with pandasAkshitaKanther
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streamshktripathy
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithmsJulie Iskander
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning pyingkodi maran
 
Data Reduction
Data ReductionData Reduction
Data ReductionRajan Shah
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Salah Amean
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsDatamining Tools
 

What's hot (20)

OLAP operations
OLAP operationsOLAP operations
OLAP operations
 
Neural network
Neural networkNeural network
Neural network
 
Data Mining: Outlier analysis
Data Mining: Outlier analysisData Mining: Outlier analysis
Data Mining: Outlier analysis
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
6 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/26 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/2
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sql
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Presentation on data preparation with pandas
Presentation on data preparation with pandasPresentation on data preparation with pandas
Presentation on data preparation with pandas
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streams
 
Data models in NoSQL
Data models in NoSQLData models in NoSQL
Data models in NoSQL
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithms
 
Data reduction
Data reductionData reduction
Data reduction
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Data Reduction
Data ReductionData Reduction
Data Reduction
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlations
 

Viewers also liked

Caso.atrapalo.com.janet carrillo.dif.y.posicionamiento
Caso.atrapalo.com.janet carrillo.dif.y.posicionamientoCaso.atrapalo.com.janet carrillo.dif.y.posicionamiento
Caso.atrapalo.com.janet carrillo.dif.y.posicionamientoJanny Carrillo
 
Colours oldgames
Colours oldgamesColours oldgames
Colours oldgamestbarczak
 
Guide 5 ingles
Guide 5 inglesGuide 5 ingles
Guide 5 inglestatis3456
 
Schulungskonzept botschafter workshops
Schulungskonzept botschafter workshopsSchulungskonzept botschafter workshops
Schulungskonzept botschafter workshopsChristian Urech
 
2014 Global Traveler Study
2014 Global Traveler Study2014 Global Traveler Study
2014 Global Traveler StudyCognizant
 
Mapa conceptual Erika Rivera Wendy Cervantes
Mapa conceptual  Erika Rivera Wendy CervantesMapa conceptual  Erika Rivera Wendy Cervantes
Mapa conceptual Erika Rivera Wendy Cervanteserika_rivera_chona
 
Digital Marketing Approach - Master - Andrea Genovese
Digital Marketing Approach - Master - Andrea Genovese Digital Marketing Approach - Master - Andrea Genovese
Digital Marketing Approach - Master - Andrea Genovese Andrea Genovese
 
Instructional power point on unit 2 second steps
Instructional power point on unit 2 second stepsInstructional power point on unit 2 second steps
Instructional power point on unit 2 second stepswtheteacher
 
57 ch13mendel2008
57 ch13mendel200857 ch13mendel2008
57 ch13mendel2008sbarkanic
 
Levende selvledelse - Sundt engagement der skaber værdi | Tommy Kjaer Lassen ...
Levende selvledelse - Sundt engagement der skaber værdi | Tommy Kjaer Lassen ...Levende selvledelse - Sundt engagement der skaber værdi | Tommy Kjaer Lassen ...
Levende selvledelse - Sundt engagement der skaber værdi | Tommy Kjaer Lassen ...Tommy Kjær Lassen
 
What's New in SharePoint 2016 for End Users Webinar with Intlock
What's New in SharePoint 2016 for End Users Webinar with IntlockWhat's New in SharePoint 2016 for End Users Webinar with Intlock
What's New in SharePoint 2016 for End Users Webinar with IntlockVlad Catrinescu
 
Berlin Battle hack presentation
Berlin Battle hack presentationBerlin Battle hack presentation
Berlin Battle hack presentationPayPal
 
単語分散表現を用いた多層 Denoising Auto-Encoder による評価極性分類
単語分散表現を用いた多層 Denoising Auto-Encoder による評価極性分類単語分散表現を用いた多層 Denoising Auto-Encoder による評価極性分類
単語分散表現を用いた多層 Denoising Auto-Encoder による評価極性分類Peinan ZHANG
 

Viewers also liked (20)

Caso.atrapalo.com.janet carrillo.dif.y.posicionamiento
Caso.atrapalo.com.janet carrillo.dif.y.posicionamientoCaso.atrapalo.com.janet carrillo.dif.y.posicionamiento
Caso.atrapalo.com.janet carrillo.dif.y.posicionamiento
 
Colours oldgames
Colours oldgamesColours oldgames
Colours oldgames
 
顛倒思考提
顛倒思考提顛倒思考提
顛倒思考提
 
Guide 5 ingles
Guide 5 inglesGuide 5 ingles
Guide 5 ingles
 
Schulungskonzept botschafter workshops
Schulungskonzept botschafter workshopsSchulungskonzept botschafter workshops
Schulungskonzept botschafter workshops
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
2014 Global Traveler Study
2014 Global Traveler Study2014 Global Traveler Study
2014 Global Traveler Study
 
Mapa conceptual Erika Rivera Wendy Cervantes
Mapa conceptual  Erika Rivera Wendy CervantesMapa conceptual  Erika Rivera Wendy Cervantes
Mapa conceptual Erika Rivera Wendy Cervantes
 
Digital Marketing Approach - Master - Andrea Genovese
Digital Marketing Approach - Master - Andrea Genovese Digital Marketing Approach - Master - Andrea Genovese
Digital Marketing Approach - Master - Andrea Genovese
 
Instructional power point on unit 2 second steps
Instructional power point on unit 2 second stepsInstructional power point on unit 2 second steps
Instructional power point on unit 2 second steps
 
X code
X codeX code
X code
 
boost tour 1.48.0 all
boost tour 1.48.0 allboost tour 1.48.0 all
boost tour 1.48.0 all
 
Ksb portfolio
Ksb portfolioKsb portfolio
Ksb portfolio
 
57 ch13mendel2008
57 ch13mendel200857 ch13mendel2008
57 ch13mendel2008
 
Levende selvledelse - Sundt engagement der skaber værdi | Tommy Kjaer Lassen ...
Levende selvledelse - Sundt engagement der skaber værdi | Tommy Kjaer Lassen ...Levende selvledelse - Sundt engagement der skaber værdi | Tommy Kjaer Lassen ...
Levende selvledelse - Sundt engagement der skaber værdi | Tommy Kjaer Lassen ...
 
Consultoria en Gestiòn de Proyectos
Consultoria en Gestiòn de ProyectosConsultoria en Gestiòn de Proyectos
Consultoria en Gestiòn de Proyectos
 
What's New in SharePoint 2016 for End Users Webinar with Intlock
What's New in SharePoint 2016 for End Users Webinar with IntlockWhat's New in SharePoint 2016 for End Users Webinar with Intlock
What's New in SharePoint 2016 for End Users Webinar with Intlock
 
Berlin Battle hack presentation
Berlin Battle hack presentationBerlin Battle hack presentation
Berlin Battle hack presentation
 
単語分散表現を用いた多層 Denoising Auto-Encoder による評価極性分類
単語分散表現を用いた多層 Denoising Auto-Encoder による評価極性分類単語分散表現を用いた多層 Denoising Auto-Encoder による評価極性分類
単語分散表現を用いた多層 Denoising Auto-Encoder による評価極性分類
 
Resume
ResumeResume
Resume
 

Similar to Data Preprocessing

Intro to Data warehousing lecture 17
Intro to Data warehousing   lecture 17Intro to Data warehousing   lecture 17
Intro to Data warehousing lecture 17AnwarrChaudary
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessingKnoldus Inc.
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedYugal Kumar
 
Editing, cleaning and coding of data in Business research methodology
Editing, cleaning and coding of data in Business research methodology Editing, cleaning and coding of data in Business research methodology
Editing, cleaning and coding of data in Business research methodology VaishaghMp
 
Machine learning topics machine learning algorithm into three main parts.
Machine learning topics  machine learning algorithm into three main parts.Machine learning topics  machine learning algorithm into three main parts.
Machine learning topics machine learning algorithm into three main parts.DurgaDeviP2
 
ML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptbelay41
 
Data MIning: Data processing
Data MIning: Data processingData MIning: Data processing
Data MIning: Data processingDatamining Tools
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data PreparationUmair Shafique
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ngsaranya12345
 
DATA CLEANING.pdf
DATA CLEANING.pdfDATA CLEANING.pdf
DATA CLEANING.pdfRumanaAykiz
 

Similar to Data Preprocessing (20)

Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
Intro to Data warehousing lecture 17
Intro to Data warehousing   lecture 17Intro to Data warehousing   lecture 17
Intro to Data warehousing lecture 17
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
 
Editing, cleaning and coding of data in Business research methodology
Editing, cleaning and coding of data in Business research methodology Editing, cleaning and coding of data in Business research methodology
Editing, cleaning and coding of data in Business research methodology
 
Machine learning topics machine learning algorithm into three main parts.
Machine learning topics  machine learning algorithm into three main parts.Machine learning topics  machine learning algorithm into three main parts.
Machine learning topics machine learning algorithm into three main parts.
 
Unit2
Unit2Unit2
Unit2
 
ML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.ppt
 
Assignmentdatamining
AssignmentdataminingAssignmentdatamining
Assignmentdatamining
 
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
 
Data MIning: Data processing
Data MIning: Data processingData MIning: Data processing
Data MIning: Data processing
 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
 
Preprocess
PreprocessPreprocess
Preprocess
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
Pre processing
Pre processingPre processing
Pre processing
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
1234
12341234
1234
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
DATA CLEANING.pdf
DATA CLEANING.pdfDATA CLEANING.pdf
DATA CLEANING.pdf
 

Recently uploaded

What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Romantic Opera MUSIC FOR GRADE NINE pptx
Romantic Opera MUSIC FOR GRADE NINE pptxRomantic Opera MUSIC FOR GRADE NINE pptx
Romantic Opera MUSIC FOR GRADE NINE pptxsqpmdrvczh
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Quarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayQuarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayMakMakNepo
 

Recently uploaded (20)

What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Romantic Opera MUSIC FOR GRADE NINE pptx
Romantic Opera MUSIC FOR GRADE NINE pptxRomantic Opera MUSIC FOR GRADE NINE pptx
Romantic Opera MUSIC FOR GRADE NINE pptx
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Quarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayQuarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up Friday
 

Data Preprocessing

  • 2. Definition  Data preprocessing is a data mining technique that involves transforming raw data into an understandable format.  Data in the real world is dirty
  • 3. Measures for data quality:A multidimensional view ◦ Accuracy: correct or wrong, accurate or not ◦ Completeness: not recorded, unavailable, … ◦ Consistency: some modified but some not, dangling, … ◦ Timeliness: timely update? ◦ Believability: how trustable the data are correct? ◦ Interpretability: how easily the data can be understood?
  • 4. Major Tasks in Data Preprocessing  Data Cleaning  Data Integration  Data Reduction  Data Transformation and Data Discretization
  • 5. Data Cleaning: Incomplete  Data is not always available  Ex:Age:” ”;  Missing data may be due to ◦ equipment malfunction ◦ inconsistent with other recorded data and thus deleted ◦ data not entered due to misunderstanding ◦ certain data may not be considered important at the time of entry
  • 6. Noisy Data  Unstructured Data.  Increases the amount of storage space . Causes: Hardware Failure Programming Errors
  • 7. Data Cleaning as a Process  Missing values, noise, and inconsistencies contribute to inaccurate data.  The first step in data cleaning as a process is discrepancy detection.  Discrepancies can be caused by several factors.  Poorly designed data entry forms  human error in data entry
  • 8. The data should also be examined regarding: o Unique rule: Each attribute value must be different from all other attribute value. o Consecutive rule No missing values between lowest and highest values of the attribute. o Null rule Specifies the use of blanks, question marks, special characters.
  • 9. Data Integration  The merging of data from multiple data stores.  It can help reduce, avoid redundancies and inconsistencies.  It improve the accuracy and speed of the subsequent data mining process.
  • 10. Data Reduction  To obtain a reduced representation of the data set that is much smaller in volume. Strategies for data reduction include the following:  Data cube aggregation, where aggregation operations are applied to the data in the construction of a data cube.  Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes or dimensions may be detected and removed.
  • 11.  Dimensionality reduction, where encoding mechanisms are used to reduce the data set size.  Numerosity reduction, where the data are replaced or estimated by alternative, smaller data representations such as  Parametric models  Nonparametric methods such as clustering, sampling, and the use of histograms.
  • 12. Data Transformation  In data transformation, the data are transformed or consolidated into forms appropriate for mining. Data transformation can involve the following:  Smoothing: remove noise from data  Aggregation: summarization, data cube construction  Generalization: concept hierarchy climbing  Normalization: scaled to fall within a small, specified range  min-max normalization
  • 13. Data Discretization  Discretization: Divide the range of a continuous attribute into intervals ◦ Interval labels can then be used to replace actual data values ◦ Reduce data size by Discretization ◦ Split (top-down) vs. merge (bottom-up) ◦ Discretization can be performed recursively on an attribute ◦ Prepare for further analysis, e.g., classification
  • 14.  Three types of attributes ◦ Nominal—values from an unordered set, e.g., color, profession ◦ Ordinal—values from an ordered set, e.g., military or academic rank ◦ Numeric—real numbers, e.g., integer or real numbers