SlideShare a Scribd company logo
1 of 11
S.PARANIPUSHPA
DEPARTMENT OF CS AND IT
NADAR SARASWATHI COLLEGE OF ARTS AND
SCIENCE
 Invalid values: Some datasets have well-known
values, e.g. gender must only have “F” (Female)
and “M” (Male). In this case it’s easy to detect
wrong values.
 Formats: The most common issue. It’s possible to
get values in different formats like a name
written as “Name, Surname” or “Surname,
Name”.
 Attribute dependencies: When the value of a
feature depends on the value of another feature.
For example, if we have some school data, the
“number of students” is related to whether the
person “is teacher?”. If someone is not a teacher
he/she can’t have any students.
 Uniqueness: It’s possible to find repeated
data in features that only allow unique values.
For example, we can’t have two products with
the same identifier.
 Missing values: Some features in the dataset
may have blank or null values.
 Misspellings: Incorrectly written values.
 Misfielded values: When a feature contains the
values of another.
 Visualisation: Visualising all the values of each
feature, or taking a random sample to see if it’s
right.
 Outlier analysis: Analysing if data can be a
human error. E.g. a 300 year old person in the
“age” feature.
 Validation code: It’s possible to create a code
that checks if the data is right. For example, in
uniqueness, checking if the length of the data is
the same as the length of the vector of unique
values.
 We can apply many methods to fix the different
 Indicator variables: This technique converts
categorical data into boolean values by
creating indicator variables. If we have more
than two values (n) we have to create n-1
columns.
 Data Binning or Bucketing: A pre-processing
technique used to reduce the effects of
minor observation errors. The sample is
divided into intervals and replaced by
categorical values.
 Centering & Scaling: We can Centre the data
of one feature by substracting the mean to
all values. To scale the data, we
should divide the centered feature by the
standard deviation:

 Other techniques: For example, we can
group the outliers with the same value or
replace the value with the number of times
that it appears in the feature:
THANK
YOU

More Related Content

What's hot

Relational Databases 2
Relational Databases 2Relational Databases 2
Relational Databases 2
Jason Hando
 

What's hot (20)

E r diagram
E r diagramE r diagram
E r diagram
 
ER-Model-ER Diagram
ER-Model-ER DiagramER-Model-ER Diagram
ER-Model-ER Diagram
 
Entity Relationship Diagrams
Entity Relationship DiagramsEntity Relationship Diagrams
Entity Relationship Diagrams
 
Entity Relationship Diagram
Entity Relationship DiagramEntity Relationship Diagram
Entity Relationship Diagram
 
Entity relation(1)
Entity relation(1)Entity relation(1)
Entity relation(1)
 
ER MODEL
ER MODELER MODEL
ER MODEL
 
ER Model in DBMS
ER Model in DBMSER Model in DBMS
ER Model in DBMS
 
Entity relationship modelling
Entity relationship modellingEntity relationship modelling
Entity relationship modelling
 
Entity Relationship Modelling
Entity Relationship ModellingEntity Relationship Modelling
Entity Relationship Modelling
 
Erd examples
Erd examplesErd examples
Erd examples
 
Data Models
Data ModelsData Models
Data Models
 
Database - Entity Relationship Diagram (ERD)
Database - Entity Relationship Diagram (ERD)Database - Entity Relationship Diagram (ERD)
Database - Entity Relationship Diagram (ERD)
 
Presentation of saad on e-r diagram.
Presentation of saad on e-r diagram.Presentation of saad on e-r diagram.
Presentation of saad on e-r diagram.
 
DBMS UNIT1
DBMS UNIT1DBMS UNIT1
DBMS UNIT1
 
Erd1
Erd1Erd1
Erd1
 
Relational Databases 2
Relational Databases 2Relational Databases 2
Relational Databases 2
 
Symbol of e r diagram presentation
Symbol of e r diagram presentationSymbol of e r diagram presentation
Symbol of e r diagram presentation
 
RDBMS ERD
RDBMS ERDRDBMS ERD
RDBMS ERD
 
Er model
Er modelEr model
Er model
 
Ch 3 E R Model
Ch 3  E R  ModelCh 3  E R  Model
Ch 3 E R Model
 

Similar to Datacleaning.ppt

EXPLAIN SPSS-part1.pdf
EXPLAIN SPSS-part1.pdfEXPLAIN SPSS-part1.pdf
EXPLAIN SPSS-part1.pdf
ahmedMETWALLI12
 
Categorical DataCategorical data represents characteristics..docx
Categorical DataCategorical data represents characteristics..docxCategorical DataCategorical data represents characteristics..docx
Categorical DataCategorical data represents characteristics..docx
keturahhazelhurst
 
Exploratory Data Analysis Unit 1 ppt presentation.pptx
Exploratory Data Analysis Unit 1 ppt presentation.pptxExploratory Data Analysis Unit 1 ppt presentation.pptx
Exploratory Data Analysis Unit 1 ppt presentation.pptx
Mayura shelke
 
measurement and scaling techniques
measurement and scaling techniques measurement and scaling techniques
measurement and scaling techniques
Akanksha Gupta
 
Introduction to Data (1).pptx
Introduction to Data (1).pptxIntroduction to Data (1).pptx
Introduction to Data (1).pptx
SubhamitaKanungo
 
DATA MODEL PRESENTATION UNIT I-BCA I.pptx
DATA MODEL PRESENTATION UNIT I-BCA I.pptxDATA MODEL PRESENTATION UNIT I-BCA I.pptx
DATA MODEL PRESENTATION UNIT I-BCA I.pptx
JasmineMichael1
 
Data Mining DataLecture Notes for Chapter 2Introduc
Data Mining DataLecture Notes for Chapter 2IntroducData Mining DataLecture Notes for Chapter 2Introduc
Data Mining DataLecture Notes for Chapter 2Introduc
OllieShoresna
 

Similar to Datacleaning.ppt (20)

3. Chapter Three.pdf
3. Chapter Three.pdf3. Chapter Three.pdf
3. Chapter Three.pdf
 
EXPLAIN SPSS-part1.pdf
EXPLAIN SPSS-part1.pdfEXPLAIN SPSS-part1.pdf
EXPLAIN SPSS-part1.pdf
 
Categorical DataCategorical data represents characteristics..docx
Categorical DataCategorical data represents characteristics..docxCategorical DataCategorical data represents characteristics..docx
Categorical DataCategorical data represents characteristics..docx
 
Unit 2-Data Modeling.pdf
Unit 2-Data Modeling.pdfUnit 2-Data Modeling.pdf
Unit 2-Data Modeling.pdf
 
Exploratory Data Analysis Unit 1 ppt presentation.pptx
Exploratory Data Analysis Unit 1 ppt presentation.pptxExploratory Data Analysis Unit 1 ppt presentation.pptx
Exploratory Data Analysis Unit 1 ppt presentation.pptx
 
E_R-Diagram (2).pptx
E_R-Diagram (2).pptxE_R-Diagram (2).pptx
E_R-Diagram (2).pptx
 
measurement and scaling techniques
measurement and scaling techniques measurement and scaling techniques
measurement and scaling techniques
 
Entity Relationship Model
Entity Relationship ModelEntity Relationship Model
Entity Relationship Model
 
Introduction to Statistics and Arithmetic Mean
Introduction to Statistics and Arithmetic MeanIntroduction to Statistics and Arithmetic Mean
Introduction to Statistics and Arithmetic Mean
 
Data Types
Data TypesData Types
Data Types
 
er-models.pptx
er-models.pptxer-models.pptx
er-models.pptx
 
FDS PPT_Unit-5.pptx fundamentals of data science
FDS PPT_Unit-5.pptx fundamentals of data scienceFDS PPT_Unit-5.pptx fundamentals of data science
FDS PPT_Unit-5.pptx fundamentals of data science
 
Introduction to Data (1).pptx
Introduction to Data (1).pptxIntroduction to Data (1).pptx
Introduction to Data (1).pptx
 
Data modeling
Data modelingData modeling
Data modeling
 
data processing.pdf
data processing.pdfdata processing.pdf
data processing.pdf
 
DATA MODEL PRESENTATION UNIT I-BCA I.pptx
DATA MODEL PRESENTATION UNIT I-BCA I.pptxDATA MODEL PRESENTATION UNIT I-BCA I.pptx
DATA MODEL PRESENTATION UNIT I-BCA I.pptx
 
Data Models & Introduction to UML
Data Models & Introduction to UML Data Models & Introduction to UML
Data Models & Introduction to UML
 
Unit-1-DBMS-SUN-4 everything you need to know.pptx
Unit-1-DBMS-SUN-4 everything you need to know.pptxUnit-1-DBMS-SUN-4 everything you need to know.pptx
Unit-1-DBMS-SUN-4 everything you need to know.pptx
 
Data Mining DataLecture Notes for Chapter 2Introduc
Data Mining DataLecture Notes for Chapter 2IntroducData Mining DataLecture Notes for Chapter 2Introduc
Data Mining DataLecture Notes for Chapter 2Introduc
 
Chapter 2.pdf
Chapter 2.pdfChapter 2.pdf
Chapter 2.pdf
 

More from amuthadeepa (12)

Edgelinking
EdgelinkingEdgelinking
Edgelinking
 
Handover
HandoverHandover
Handover
 
Handover
HandoverHandover
Handover
 
Cookies
CookiesCookies
Cookies
 
Critical system
Critical systemCritical system
Critical system
 
Excellencein visualization
Excellencein visualizationExcellencein visualization
Excellencein visualization
 
Protocol.ppt
Protocol.pptProtocol.ppt
Protocol.ppt
 
Datacleaning.ppt
Datacleaning.pptDatacleaning.ppt
Datacleaning.ppt
 
Database.ppt
Database.pptDatabase.ppt
Database.ppt
 
Network security.ppt
Network security.pptNetwork security.ppt
Network security.ppt
 
Smart.ppt
Smart.pptSmart.ppt
Smart.ppt
 
Perceptron.ppt
Perceptron.pptPerceptron.ppt
Perceptron.ppt
 

Recently uploaded

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 

Recently uploaded (20)

Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 

Datacleaning.ppt

  • 1. S.PARANIPUSHPA DEPARTMENT OF CS AND IT NADAR SARASWATHI COLLEGE OF ARTS AND SCIENCE
  • 2.  Invalid values: Some datasets have well-known values, e.g. gender must only have “F” (Female) and “M” (Male). In this case it’s easy to detect wrong values.  Formats: The most common issue. It’s possible to get values in different formats like a name written as “Name, Surname” or “Surname, Name”.  Attribute dependencies: When the value of a feature depends on the value of another feature. For example, if we have some school data, the “number of students” is related to whether the person “is teacher?”. If someone is not a teacher he/she can’t have any students.
  • 3.  Uniqueness: It’s possible to find repeated data in features that only allow unique values. For example, we can’t have two products with the same identifier.  Missing values: Some features in the dataset may have blank or null values.  Misspellings: Incorrectly written values.  Misfielded values: When a feature contains the values of another.
  • 4.
  • 5.  Visualisation: Visualising all the values of each feature, or taking a random sample to see if it’s right.  Outlier analysis: Analysing if data can be a human error. E.g. a 300 year old person in the “age” feature.  Validation code: It’s possible to create a code that checks if the data is right. For example, in uniqueness, checking if the length of the data is the same as the length of the vector of unique values.  We can apply many methods to fix the different
  • 6.  Indicator variables: This technique converts categorical data into boolean values by creating indicator variables. If we have more than two values (n) we have to create n-1 columns.
  • 7.  Data Binning or Bucketing: A pre-processing technique used to reduce the effects of minor observation errors. The sample is divided into intervals and replaced by categorical values.
  • 8.  Centering & Scaling: We can Centre the data of one feature by substracting the mean to all values. To scale the data, we should divide the centered feature by the standard deviation: 
  • 9.  Other techniques: For example, we can group the outliers with the same value or replace the value with the number of times that it appears in the feature:
  • 10.