SlideShare a Scribd company logo
1 of 16
• Data preprocessing is a data mining technique
that involves transforming raw data into an
understandable format.
• Real-world data is often incomplete,
inconsistent, and/or lacking in certain
behaviors or trends, and is likely to contain
many errors.
• Data preprocessing is a proven method of
resolving such issues.
• Data preprocessing prepares raw data for
further processing.
• Data preprocessing is used database-driven
applications such as customer relationship
management and rule-based applications (like
neural networks).
Number of data preprocessing
techniques
•
•
•
•

Data cleaning
Data integration
Data transformation
Data reduction
Data Preprocessing Techniques
• Data cleaning : can be applied to remove
noise and correct inconsistencies in the data.
• Data integration :merges data from multiple
sources into a coherent data store, such as a
data warehouse.
• Data transformations :such as
normalization, may be applied.
• Data reduction : can reduce the data size by
aggregating, eliminating redundant features,
or clustering ,for instance.
• routines work to “clean” the data by filling in
missing values, smoothing noisy data, identifying
or removing outliers, and resolving
inconsistencies.
• If users believe the data are dirty, they are
unlikely to trust the results of any data mining
that has been applied to it.
• Although most mining routines have some
procedures for dealing with incomplete or noisy
data, they are not always robust.
• Therefore, a useful preprocessing step is to
run your data through some data cleaning
routines.
• Include data from multiple sources in your
analysis.
• This would involve integrating multiple
databases, data cubes, or files, that is, data
integration.
• Yet some attributes representing a given
concept may have different names in different
databases, causing inconsistencies and
redundancies.
• Having a large amount of redundant data may
slow down or confuse the knowledge
discovery process.
• Clearly, in addition to data cleaning, steps
must be taken to help avoid redundancies
during data integration.
• Typically, data cleaning and data integration
are performed as a preprocessing step when
preparing the data for a data warehouse.
• Additional data cleaning can be performed to
detect and remove redundancies that may
have resulted from data integration.
• Getting back to your data, you have decided,
say, that you would like to use a distance
based mining algorithm for your analysis, such
as neural networks, nearest-neighbor
classifiers, or clustering.
• methods provide better results if the data to
be analyzed have been normalized, that is,
scaled to a specific range such as [0.0, 1.0].
• You soon realize that data transformation
operations, such as normalization and
aggregation, are additional data preprocessing
procedures that would contribute toward the
success of the mining process.
• Data reduction obtains a reduced
representation of the data set that is much
smaller in volume, yet produces the same (or
almost the same) analytical results.
• There are a number of strategies for data
reduction.
• These include data aggregation , attribute
subset selection , dimensionality reduction and
numerosity reduction.
DATA REDUCTION
• Data can also be “reduced” by generalization
with the use of concept hierarchies, where lowlevel concepts, such as city for customer location,
are replaced with higher-level concepts, such as
region or province or state.
• A concept hierarchy organizes the concepts into
varying levels of abstraction.
• Data discretization is a form of data reduction
that is very useful for the automatic generation of
concept hierarchies from numerical data.
Data preprocessing techniques for data cleaning and transformation

More Related Content

What's hot

Research trends in data warehousing and data mining
Research trends in data warehousing and data miningResearch trends in data warehousing and data mining
Research trends in data warehousing and data miningEr. Nawaraj Bhandari
 
Data warehouse 5: Data Reconciliation and Transformation in Data Warehouse
Data warehouse 5: Data Reconciliation and Transformation in Data WarehouseData warehouse 5: Data Reconciliation and Transformation in Data Warehouse
Data warehouse 5: Data Reconciliation and Transformation in Data WarehouseVaibhav Khanna
 
Data warehouse 14 data reconciliation tools
Data warehouse 14 data reconciliation toolsData warehouse 14 data reconciliation tools
Data warehouse 14 data reconciliation toolsVaibhav Khanna
 
Introduction to dm and dw
Introduction to dm and dwIntroduction to dm and dw
Introduction to dm and dwANUSUYA T K
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data miningUjjawal
 
EDI Training Module 12: An Introduction to Metadata and Data Repositories
EDI Training Module 12:  An Introduction to Metadata and Data RepositoriesEDI Training Module 12:  An Introduction to Metadata and Data Repositories
EDI Training Module 12: An Introduction to Metadata and Data RepositoriesEnvironmental Data Initiative
 
EDI Training Module 5: Creating Clean Data foro Publishing
EDI Training Module 5:  Creating Clean Data foro PublishingEDI Training Module 5:  Creating Clean Data foro Publishing
EDI Training Module 5: Creating Clean Data foro PublishingEnvironmental Data Initiative
 
EDI Training Module 4: Organizing Data Into Publishable Units
EDI Training Module 4: Organizing Data Into Publishable UnitsEDI Training Module 4: Organizing Data Into Publishable Units
EDI Training Module 4: Organizing Data Into Publishable UnitsEnvironmental Data Initiative
 
EDI Training Module 10: EDI Data Repository Overview
EDI Training Module 10:  EDI Data Repository OverviewEDI Training Module 10:  EDI Data Repository Overview
EDI Training Module 10: EDI Data Repository OverviewEnvironmental Data Initiative
 
Data warehouse 12 reconciled data layers
Data warehouse  12 reconciled data layersData warehouse  12 reconciled data layers
Data warehouse 12 reconciled data layersVaibhav Khanna
 
Bt9001, data mining
Bt9001, data miningBt9001, data mining
Bt9001, data miningsmumbahelp
 
Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysisDataminingTools Inc
 
Mc0088 data mining
Mc0088  data miningMc0088  data mining
Mc0088 data miningsmumbahelp
 

What's hot (20)

Research trends in data warehousing and data mining
Research trends in data warehousing and data miningResearch trends in data warehousing and data mining
Research trends in data warehousing and data mining
 
Data processing
Data processingData processing
Data processing
 
Data warehouse 5: Data Reconciliation and Transformation in Data Warehouse
Data warehouse 5: Data Reconciliation and Transformation in Data WarehouseData warehouse 5: Data Reconciliation and Transformation in Data Warehouse
Data warehouse 5: Data Reconciliation and Transformation in Data Warehouse
 
Data warehouse 14 data reconciliation tools
Data warehouse 14 data reconciliation toolsData warehouse 14 data reconciliation tools
Data warehouse 14 data reconciliation tools
 
Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
 
Data Mining Technniques
Data Mining TechnniquesData Mining Technniques
Data Mining Technniques
 
Introduction to dm and dw
Introduction to dm and dwIntroduction to dm and dw
Introduction to dm and dw
 
HashCash big data services
HashCash big data servicesHashCash big data services
HashCash big data services
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data mining
 
EDI Training Module 12: An Introduction to Metadata and Data Repositories
EDI Training Module 12:  An Introduction to Metadata and Data RepositoriesEDI Training Module 12:  An Introduction to Metadata and Data Repositories
EDI Training Module 12: An Introduction to Metadata and Data Repositories
 
EDI Training Module 5: Creating Clean Data foro Publishing
EDI Training Module 5:  Creating Clean Data foro PublishingEDI Training Module 5:  Creating Clean Data foro Publishing
EDI Training Module 5: Creating Clean Data foro Publishing
 
EDI Training Module 4: Organizing Data Into Publishable Units
EDI Training Module 4: Organizing Data Into Publishable UnitsEDI Training Module 4: Organizing Data Into Publishable Units
EDI Training Module 4: Organizing Data Into Publishable Units
 
Final presentation
Final presentationFinal presentation
Final presentation
 
EDI Training Module 10: EDI Data Repository Overview
EDI Training Module 10:  EDI Data Repository OverviewEDI Training Module 10:  EDI Data Repository Overview
EDI Training Module 10: EDI Data Repository Overview
 
Data warehouse 12 reconciled data layers
Data warehouse  12 reconciled data layersData warehouse  12 reconciled data layers
Data warehouse 12 reconciled data layers
 
03. Data Preprocessing
03. Data Preprocessing03. Data Preprocessing
03. Data Preprocessing
 
Bt9001, data mining
Bt9001, data miningBt9001, data mining
Bt9001, data mining
 
Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysis
 
Ijcet 06 06_002
Ijcet 06 06_002Ijcet 06 06_002
Ijcet 06 06_002
 
Mc0088 data mining
Mc0088  data miningMc0088  data mining
Mc0088 data mining
 

Viewers also liked

Katharine Jarmul, Founder at Kjamistan - "Learn Data Wrangling with Python"
Katharine Jarmul, Founder at Kjamistan - "Learn Data Wrangling with Python"Katharine Jarmul, Founder at Kjamistan - "Learn Data Wrangling with Python"
Katharine Jarmul, Founder at Kjamistan - "Learn Data Wrangling with Python"Dataconomy Media
 
Data Mining Xuequn Shang NorthWestern Polytechnical University
Data Mining Xuequn Shang NorthWestern Polytechnical UniversityData Mining Xuequn Shang NorthWestern Polytechnical University
Data Mining Xuequn Shang NorthWestern Polytechnical Universitybutest
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingksamyMCA
 
Iris by @run@$uj! final
Iris by @run@$uj!    finalIris by @run@$uj!    final
Iris by @run@$uj! finalARUNASUJITHA
 
Session 05 cleaning and exploring
Session 05 cleaning and exploringSession 05 cleaning and exploring
Session 05 cleaning and exploringbodaceacat
 
H2O World - Munging, modeling, and pipelines using Python - Hank Roark
H2O World - Munging, modeling, and pipelines using Python - Hank RoarkH2O World - Munging, modeling, and pipelines using Python - Hank Roark
H2O World - Munging, modeling, and pipelines using Python - Hank RoarkSri Ambati
 
Pandas/Data Analysis at Baypiggies
Pandas/Data Analysis at BaypiggiesPandas/Data Analysis at Baypiggies
Pandas/Data Analysis at BaypiggiesAndy Hayden
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and developmentWes McKinney
 
Getting started with pandas
Getting started with pandasGetting started with pandas
Getting started with pandasmaikroeder
 
Image pre processing - local processing
Image pre processing - local processingImage pre processing - local processing
Image pre processing - local processingAshish Kumar
 
pandas - Python Data Analysis
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data AnalysisAndrew Henshaw
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and StatisticsWes McKinney
 
Parity check(Error Detecting Codes)
Parity check(Error Detecting Codes)Parity check(Error Detecting Codes)
Parity check(Error Detecting Codes)Imesha Perera
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for PythonWes McKinney
 
Image pre processing
Image pre processingImage pre processing
Image pre processingAshish Kumar
 

Viewers also liked (20)

Katharine Jarmul, Founder at Kjamistan - "Learn Data Wrangling with Python"
Katharine Jarmul, Founder at Kjamistan - "Learn Data Wrangling with Python"Katharine Jarmul, Founder at Kjamistan - "Learn Data Wrangling with Python"
Katharine Jarmul, Founder at Kjamistan - "Learn Data Wrangling with Python"
 
Data Mining Xuequn Shang NorthWestern Polytechnical University
Data Mining Xuequn Shang NorthWestern Polytechnical UniversityData Mining Xuequn Shang NorthWestern Polytechnical University
Data Mining Xuequn Shang NorthWestern Polytechnical University
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Iris by @run@$uj! final
Iris by @run@$uj!    finalIris by @run@$uj!    final
Iris by @run@$uj! final
 
Session 05 cleaning and exploring
Session 05 cleaning and exploringSession 05 cleaning and exploring
Session 05 cleaning and exploring
 
H2O World - Munging, modeling, and pipelines using Python - Hank Roark
H2O World - Munging, modeling, and pipelines using Python - Hank RoarkH2O World - Munging, modeling, and pipelines using Python - Hank Roark
H2O World - Munging, modeling, and pipelines using Python - Hank Roark
 
Python projects
Python projectsPython projects
Python projects
 
Pandas/Data Analysis at Baypiggies
Pandas/Data Analysis at BaypiggiesPandas/Data Analysis at Baypiggies
Pandas/Data Analysis at Baypiggies
 
Pre processing big data
Pre processing big dataPre processing big data
Pre processing big data
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and development
 
Data Analysis With Pandas
Data Analysis With PandasData Analysis With Pandas
Data Analysis With Pandas
 
Getting started with pandas
Getting started with pandasGetting started with pandas
Getting started with pandas
 
R vs Python vs SAS
R vs Python vs SASR vs Python vs SAS
R vs Python vs SAS
 
Image pre processing - local processing
Image pre processing - local processingImage pre processing - local processing
Image pre processing - local processing
 
pandas - Python Data Analysis
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data Analysis
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statistics
 
Parity check(Error Detecting Codes)
Parity check(Error Detecting Codes)Parity check(Error Detecting Codes)
Parity check(Error Detecting Codes)
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
 
Image pre processing
Image pre processingImage pre processing
Image pre processing
 

Similar to Data preprocessing techniques for data cleaning and transformation

Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data miningDhilsath Fathima
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedYugal Kumar
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1meenas06
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Dhilsath Fathima
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data PreparationUmair Shafique
 
DATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSINGDATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSINGAhtesham Ullah khan
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousingEr. Nawaraj Bhandari
 
overview of_data_processing
overview of_data_processingoverview of_data_processing
overview of_data_processingFEG
 
Machine learning topics machine learning algorithm into three main parts.
Machine learning topics  machine learning algorithm into three main parts.Machine learning topics  machine learning algorithm into three main parts.
Machine learning topics machine learning algorithm into three main parts.DurgaDeviP2
 
finalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptxfinalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptxshumPanwar
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessingKnoldus Inc.
 
Subhaschamdrabhosesubhqschndrachose.pptx
Subhaschamdrabhosesubhqschndrachose.pptxSubhaschamdrabhosesubhqschndrachose.pptx
Subhaschamdrabhosesubhqschndrachose.pptxrocky170104
 
Data mining & data warehousing (ppt)
Data mining & data warehousing (ppt)Data mining & data warehousing (ppt)
Data mining & data warehousing (ppt)Harish Chand
 

Similar to Data preprocessing techniques for data cleaning and transformation (20)

Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
DATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSINGDATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSING
 
Dmblog
DmblogDmblog
Dmblog
 
Ch~2.pdf
Ch~2.pdfCh~2.pdf
Ch~2.pdf
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
 
overview of_data_processing
overview of_data_processingoverview of_data_processing
overview of_data_processing
 
Machine learning topics machine learning algorithm into three main parts.
Machine learning topics  machine learning algorithm into three main parts.Machine learning topics  machine learning algorithm into three main parts.
Machine learning topics machine learning algorithm into three main parts.
 
finalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptxfinalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptx
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 
Subhaschamdrabhosesubhqschndrachose.pptx
Subhaschamdrabhosesubhqschndrachose.pptxSubhaschamdrabhosesubhqschndrachose.pptx
Subhaschamdrabhosesubhqschndrachose.pptx
 
Data mining & data warehousing (ppt)
Data mining & data warehousing (ppt)Data mining & data warehousing (ppt)
Data mining & data warehousing (ppt)
 
DATA WAREHOUSING.2.pptx
DATA WAREHOUSING.2.pptxDATA WAREHOUSING.2.pptx
DATA WAREHOUSING.2.pptx
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 

Data preprocessing techniques for data cleaning and transformation

  • 1.
  • 2. • Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. • Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors.
  • 3. • Data preprocessing is a proven method of resolving such issues. • Data preprocessing prepares raw data for further processing. • Data preprocessing is used database-driven applications such as customer relationship management and rule-based applications (like neural networks).
  • 4. Number of data preprocessing techniques • • • • Data cleaning Data integration Data transformation Data reduction
  • 5.
  • 6. Data Preprocessing Techniques • Data cleaning : can be applied to remove noise and correct inconsistencies in the data. • Data integration :merges data from multiple sources into a coherent data store, such as a data warehouse. • Data transformations :such as normalization, may be applied. • Data reduction : can reduce the data size by aggregating, eliminating redundant features, or clustering ,for instance.
  • 7. • routines work to “clean” the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. • If users believe the data are dirty, they are unlikely to trust the results of any data mining that has been applied to it. • Although most mining routines have some procedures for dealing with incomplete or noisy data, they are not always robust.
  • 8. • Therefore, a useful preprocessing step is to run your data through some data cleaning routines.
  • 9. • Include data from multiple sources in your analysis. • This would involve integrating multiple databases, data cubes, or files, that is, data integration. • Yet some attributes representing a given concept may have different names in different databases, causing inconsistencies and redundancies.
  • 10. • Having a large amount of redundant data may slow down or confuse the knowledge discovery process. • Clearly, in addition to data cleaning, steps must be taken to help avoid redundancies during data integration. • Typically, data cleaning and data integration are performed as a preprocessing step when preparing the data for a data warehouse.
  • 11. • Additional data cleaning can be performed to detect and remove redundancies that may have resulted from data integration.
  • 12. • Getting back to your data, you have decided, say, that you would like to use a distance based mining algorithm for your analysis, such as neural networks, nearest-neighbor classifiers, or clustering. • methods provide better results if the data to be analyzed have been normalized, that is, scaled to a specific range such as [0.0, 1.0].
  • 13. • You soon realize that data transformation operations, such as normalization and aggregation, are additional data preprocessing procedures that would contribute toward the success of the mining process.
  • 14. • Data reduction obtains a reduced representation of the data set that is much smaller in volume, yet produces the same (or almost the same) analytical results. • There are a number of strategies for data reduction. • These include data aggregation , attribute subset selection , dimensionality reduction and numerosity reduction.
  • 15. DATA REDUCTION • Data can also be “reduced” by generalization with the use of concept hierarchies, where lowlevel concepts, such as city for customer location, are replaced with higher-level concepts, such as region or province or state. • A concept hierarchy organizes the concepts into varying levels of abstraction. • Data discretization is a form of data reduction that is very useful for the automatic generation of concept hierarchies from numerical data.