SlideShare a Scribd company logo
1 of 33
Dr.Dipali Meher
mailtomeher@gmail.com
Data Preprocessing
Need, Objectives and Techniques of data preprocessing
Data Cleaning: Handling of Missing values and Noisy Data, Data cleaning as a process
Data Integration: Schema integration, Controlling redundancies using correlation.
Data Transformation: Smoothing, Aggregation, Generalization, Attribute construction,
Normalization
Data Reduction: Data Cube Aggregation; Attribute Subset Selection, Dimensionality Reduction,
Numerosity Reduction
Collected by : Dr. Dipali Meher
Agenda
Data Preprocessing
Collected by : Dr. Dipali Meher
Data preprocessing is the process of
transforming raw data into an
understandable format.
It is also an important step in data
mining as we cannot work with raw
data.
The quality of the data should be
checked before applying machine
learning or data mining algorithms.
Collected by : Dr. Dipali Meher
Source:intellspot.com
Need of Data Preprocessing
Collected by : Dr. Dipali Meher
Preprocessing of data is mainly to check the data quality. The quality can be checked by the following.
•Real world data are generally:
•Incomplete: Missing attribute values, missing certain attributes of importance, or having only
aggregate data
•Noisy: Containing errors or outliers
•Inconsistent: Containing discrepancies in codes or names
Accuracy: To check whether the data entered is correct or not.
Completeness: To check whether the data is available or not recorded.
Consistency: To check whether the same data is kept in all the places that do or do not match
Interpretability: The understandability of the data.
Believability: The data should be trustable.
Timeliness: The data should be updated correctly.
Objectives of Data Preprocessing
 To transform the raw data into an understandable format
 To transform data for its usable format
 To eliminate inconsistencies in data
 To remove duplicates in data
 To give more accurate data for preprocessing
 To give assurance for incorrect or missing values in data
 To reduce dimensionalities in data
Accurate data accurate results
Collected by : Dr. Dipali Meher
Data Preprocessing
Data Integration
Data Transformation -2,32,100,59,48 -0.02,0.32,1.00,0.59,0.48
Data Cleaning
Data Reduction
Attributes A1 A2 …. A200
Attributes A1 A2 ….A50
Collected by : Dr. Dipali Meher
Data Cleaning: Handling of Missing values and Noisy Data
Noisy data: This is data with error or data which has no meaning at all. This type of data can either lead to invalid results or
can create the problem to the process of mining itself. The problem of noisy data can be solved with binning method,
regression and clustering.
Data Cleaning
Missing Data Noisy Data
Missing data: Missing data is the case wherein some of the attributes or attribute data is missing or the data
is not normalized. This situation can be handled by either ignoring the values or filling the missing value.
Collected by : Dr. Dipali Meher
Data cleaning as a process
Missing data:
Ignore the tuple
Fill in the missing values manually
Use a global constant to fill in the missing value
Use a measure of central tendency for the attribute (such as mean or median) to fill
in the missing value
Use the attribute mean or median for all samples belonging to the same class as the
given tuple
Use the most probable value to fill in the missing value
Collected by : Dr. Dipali Meher
Data Integration
Combines data from multiple
homogeneous and heterogeneous
sources into coherent store
Data integration may produce
redundancies and inconsistencies in
the resulting data set
There are tow important tasks in data
integration
1 Detecting and resolving data value
and schema conflicts
2. Handling Redundancy
Collected by : Dr. Dipali Meher
Schema Integration
Integrate metadata from different sources
The same attribute or object may have different
names in different databases.
E.g. cust_id,is same as cust_no or cno
Collected by : Dr. Dipali Meher
Data Integration: Controlling redundancies using correlation.
Collected by : Dr. Dipali Meher
Data Integration: Controlling redundancies using correlation.
Collected by : Dr. Dipali Meher
Data Transformation
Smoothing
Aggregation
Discretization
Attribute Construction
Generalization
Normalization
Collected by : Dr. Dipali Meher
Data Transformation
(a) Smoothing: This is the process of removing the unnecessary data and cleaning the
data so as to improve the functionality of the data.
(b) Aggregation: This is the process of collecting the data from heterogeneous
platforms and converting it to a uniform format. This improves the quality of the data.
(c) Discretization: Large data sets are complex to handle. Discretization is the process of
breaking up the data in small intervals. These chunks are continuous chunks, and these
are supported by all the existing frameworks.
(d) Attribute construction: To improve the efficiency in the mining process, some new
attributes are generated from existing data sets.
(e) Generalization: This is the process of converting low level attributes to high level
attributes using hierarchy.
(f) Normalization: In the process of Normalization, attributes are scaled within a
specified range.
Collected by : Dr. Dipali Meher
Data Aggregation
Collected by : Dr. Dipali Meher
Data
Reduction
Attribute Selection
Data Cube Aggregation
Numerosity Reduction
Dimensionality Reduction
Collected by : Dr. Dipali Meher
(a) Attribute Selection: When data is collected from various sources, it may contain duplicate attributes.
Some of the attributes are irrelevant. The Attribute Selection method is used to remove such
redundant and unnecessary attributes from the data set. This process results in an improved data set.
(b) Data Cube Aggregation: In this reduction method, aggregation property is applied on selected data
sets so as to get the data in a much simpler format.
(c) Numerosity Reduction: In this reduction method, actual data is substituted with a mathematical
model of the data.
(d) Dimensionality Reduction: In this reduction method, duplicate attributes are removed to reduce the
data size.
Data Reduction
Collected by : Dr. Dipali Meher
Data Cube Aggregation
Collected by : Dr. Dipali Meher
Unit 3: Data Preprocessing
Numerosity Reduction
Collected by : Dr. Dipali Meher
Data Reduction: Sampling
Collected by : Dr. Dipali Meher
Data Reduction: Sampling
Collected by : Dr. Dipali Meher
Data Discretization
 Large data sets are complex to handle.
 Discretization is the process of breaking up the data in
small intervals.
 Here, the data size is reduced. But the data which is
divided into intervals is kept continuous having some
sequence.
 Every interval has its own name and later these intervals
can be replaced with actual data.
 These chunks are continuous chunks and these are
supported by all the existing frameworks.
Collected by : Dr. Dipali Meher
1. Top-down Discretization: If the process starts by first finding
one or a few points (called split points or cut points) to split
the entire attribute range, and then repeats this recursively on
the resulting intervals, then it is called top-down discretization
or splitting.
2. Bottom-up Discretization: If the process starts by
considering all of the continuous values as potential split-
points, removes some by merging neighborhood values to
form intervals, then it is called bottom-up discretization or
merging.
Data Discretization
 Concept hierarchies can be used to reduce the data by
collecting and replacing low-level concepts (such as numerical
values for the attribute age) with higher-level concepts (such
as youth, middle-aged, or senior).
 In the multidimensional model, data are organized into multiple
dimensions, and each dimension contains multiple levels of
abstraction defined by concept hierarchies. This organization
provides users with the flexibility to view data from different
perspectives.
Concept Hierarchies
Collected by : Dr. Dipali Meher
Following are some Data discretization methods for numeric data:
1. Binning: This is a top-down unsupervised splitting technique based on a
specified number of bins. In this method, values found for an attribute are
grouped into a number of equal-width or equal-frequency bins. Then the
values are smoothened using bin mean or bin median in each bean. Using
this method recursively you can generate concept hierarchy.
2. Histogram Analysis: The histogram distributes an attribute's observed
value into a disjoint subset, often called buckets or bins.
3. Cluster Analysis: Cluster analysis is a common form of data discretization.
In this technique, a clustering algorithm can be applied to discretize a
numerical attribute by partitioning the values of that attribute into clusters or
groups.
Data Discretization Methods
Collected by : Dr. Dipali Meher
Binning
The stored values are distributed into a number of buckets or
bins and then replacing each bin value by the bin mean or
median.
It is top-down splitting techniques based on a specified
number of bins.
It is unsupervised discretization technique because it does not
use class information.
- Equal-width(distance)partitioning
- Equal –depth(frequency)Partitioning
Collected by : Dr. Dipali Meher
Binning: Equal width (distance) partitioning
Divides the range into N intervals of
equal size: uniform grid
If A and B are the lowest and highest value
of the attribute the width of the intervals will
be W=(B-A)/N
- Outliers may be come dominant
- Skewed data my not be handled well
Collected by : Dr. Dipali Meher
Binning: Equal-width (distance) partitioning
Collected by : Dr. Dipali Meher
Unit 3: Data Preprocessing
Binning: Equal-depth (frequency) partitioning
Collected by : Dr. Dipali Meher
Histogram Analysis
Collected by : Dr. Dipali Meher
Cluster Analysis
Clustering can be used to generate a concept hierarchy for
A by following either a top down splitting strategy to a bottom up merging strategy
where each cluster forms a node of the conapt hierarchy
 Initial cluster may be further decomposed into several sub clusters forming a
lower level hierarchy
 Later on clusters my be repeatedly grouped with neighbor clusters to form
higher level concepts
Collected by : Dr. Dipali Meher
References
Collected by : Dr. Dipali Meher
Data Mining, Introduction and Advanced Topics
by Margaret H. Dunham and Sridhar
Pearson Education
ISBN 81-7758-785-4
Data Mining Concepts and Techniques
by Jiawei Han and Micheline Kamber
Morgan Kaufmann Publishers
ISBN 81-312-0535-5

More Related Content

What's hot

Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessingSalah Amean
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
Data mining-2
Data mining-2Data mining-2
Data mining-2Nit Hik
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalitiesKrish_ver2
 
data mining
data miningdata mining
data mininguoitc
 
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Salah Amean
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingankur bhalla
 
4.3 multimedia datamining
4.3 multimedia datamining4.3 multimedia datamining
4.3 multimedia dataminingKrish_ver2
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CARTXueping Peng
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & UnderfittingSOUMIT KAR
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reductionmrizwan969
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning pyingkodi maran
 

What's hot (20)

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Data mining-2
Data mining-2Data mining-2
Data mining-2
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalities
 
data mining
data miningdata mining
data mining
 
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Mining: Data Preprocessing
Data Mining: Data PreprocessingData Mining: Data Preprocessing
Data Mining: Data Preprocessing
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
4.3 multimedia datamining
4.3 multimedia datamining4.3 multimedia datamining
4.3 multimedia datamining
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 

Similar to DataPreprocessing.pptx

Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data miningUjjawal
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningNandakumar P
 
ML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptbelay41
 
Intro to Data warehousing lecture 17
Intro to Data warehousing   lecture 17Intro to Data warehousing   lecture 17
Intro to Data warehousing lecture 17AnwarrChaudary
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ngsaranya12345
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data PreparationUmair Shafique
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2Gokulks007
 
overview of_data_processing
overview of_data_processingoverview of_data_processing
overview of_data_processingFEG
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesSơn Còm Nhom
 

Similar to DataPreprocessing.pptx (20)

Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data mining
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
 
ML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.ppt
 
Intro to Data warehousing lecture 17
Intro to Data warehousing   lecture 17Intro to Data warehousing   lecture 17
Intro to Data warehousing lecture 17
 
Datapreprocess
DatapreprocessDatapreprocess
Datapreprocess
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
data mining.pptx
data mining.pptxdata mining.pptx
data mining.pptx
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Data1
Data1Data1
Data1
 
Data1
Data1Data1
Data1
 
Preprocess
PreprocessPreprocess
Preprocess
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
data mining
data miningdata mining
data mining
 
1234
12341234
1234
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
overview of_data_processing
overview of_data_processingoverview of_data_processing
overview of_data_processing
 
Clutter Reduction in Multi-Dimensional Visualization by Using Dimension Reduc...
Clutter Reduction in Multi-Dimensional Visualization by Using Dimension Reduc...Clutter Reduction in Multi-Dimensional Visualization by Using Dimension Reduc...
Clutter Reduction in Multi-Dimensional Visualization by Using Dimension Reduc...
 
Data processing
Data processingData processing
Data processing
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and Techniques
 

More from Dr-Dipali Meher

More from Dr-Dipali Meher (16)

Database Security Methods, DAC, MAC,View
Database Security Methods, DAC, MAC,ViewDatabase Security Methods, DAC, MAC,View
Database Security Methods, DAC, MAC,View
 
Version Stamps in NOSQL Databases
Version Stamps in NOSQL DatabasesVersion Stamps in NOSQL Databases
Version Stamps in NOSQL Databases
 
Literature Review
Literature ReviewLiterature Review
Literature Review
 
Research Problem
Research ProblemResearch Problem
Research Problem
 
Formulation of Research Design
Formulation of Research DesignFormulation of Research Design
Formulation of Research Design
 
Types of Research
Types of ResearchTypes of Research
Types of Research
 
Research Methodology-Intorduction
Research Methodology-IntorductionResearch Methodology-Intorduction
Research Methodology-Intorduction
 
Introduction to Research
Introduction to ResearchIntroduction to Research
Introduction to Research
 
Neo4j session
Neo4j sessionNeo4j session
Neo4j session
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Consistency in NoSQL
Consistency in NoSQLConsistency in NoSQL
Consistency in NoSQL
 
Data models in NoSQL
Data models in NoSQLData models in NoSQL
Data models in NoSQL
 
Schema migrations in no sql
Schema migrations in no sqlSchema migrations in no sql
Schema migrations in no sql
 
Polyglot Persistence
Polyglot Persistence Polyglot Persistence
Polyglot Persistence
 
Naive bayesian classification
Naive bayesian classificationNaive bayesian classification
Naive bayesian classification
 
Function Pointer
Function PointerFunction Pointer
Function Pointer
 

Recently uploaded

Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerunnathinaik
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 

Recently uploaded (20)

Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developer
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 

DataPreprocessing.pptx

  • 2. Need, Objectives and Techniques of data preprocessing Data Cleaning: Handling of Missing values and Noisy Data, Data cleaning as a process Data Integration: Schema integration, Controlling redundancies using correlation. Data Transformation: Smoothing, Aggregation, Generalization, Attribute construction, Normalization Data Reduction: Data Cube Aggregation; Attribute Subset Selection, Dimensionality Reduction, Numerosity Reduction Collected by : Dr. Dipali Meher Agenda
  • 3. Data Preprocessing Collected by : Dr. Dipali Meher Data preprocessing is the process of transforming raw data into an understandable format. It is also an important step in data mining as we cannot work with raw data. The quality of the data should be checked before applying machine learning or data mining algorithms.
  • 4. Collected by : Dr. Dipali Meher Source:intellspot.com
  • 5. Need of Data Preprocessing Collected by : Dr. Dipali Meher Preprocessing of data is mainly to check the data quality. The quality can be checked by the following. •Real world data are generally: •Incomplete: Missing attribute values, missing certain attributes of importance, or having only aggregate data •Noisy: Containing errors or outliers •Inconsistent: Containing discrepancies in codes or names Accuracy: To check whether the data entered is correct or not. Completeness: To check whether the data is available or not recorded. Consistency: To check whether the same data is kept in all the places that do or do not match Interpretability: The understandability of the data. Believability: The data should be trustable. Timeliness: The data should be updated correctly.
  • 6. Objectives of Data Preprocessing  To transform the raw data into an understandable format  To transform data for its usable format  To eliminate inconsistencies in data  To remove duplicates in data  To give more accurate data for preprocessing  To give assurance for incorrect or missing values in data  To reduce dimensionalities in data Accurate data accurate results Collected by : Dr. Dipali Meher
  • 7. Data Preprocessing Data Integration Data Transformation -2,32,100,59,48 -0.02,0.32,1.00,0.59,0.48 Data Cleaning Data Reduction Attributes A1 A2 …. A200 Attributes A1 A2 ….A50 Collected by : Dr. Dipali Meher
  • 8. Data Cleaning: Handling of Missing values and Noisy Data Noisy data: This is data with error or data which has no meaning at all. This type of data can either lead to invalid results or can create the problem to the process of mining itself. The problem of noisy data can be solved with binning method, regression and clustering. Data Cleaning Missing Data Noisy Data Missing data: Missing data is the case wherein some of the attributes or attribute data is missing or the data is not normalized. This situation can be handled by either ignoring the values or filling the missing value. Collected by : Dr. Dipali Meher
  • 9. Data cleaning as a process Missing data: Ignore the tuple Fill in the missing values manually Use a global constant to fill in the missing value Use a measure of central tendency for the attribute (such as mean or median) to fill in the missing value Use the attribute mean or median for all samples belonging to the same class as the given tuple Use the most probable value to fill in the missing value Collected by : Dr. Dipali Meher
  • 10. Data Integration Combines data from multiple homogeneous and heterogeneous sources into coherent store Data integration may produce redundancies and inconsistencies in the resulting data set There are tow important tasks in data integration 1 Detecting and resolving data value and schema conflicts 2. Handling Redundancy Collected by : Dr. Dipali Meher
  • 11. Schema Integration Integrate metadata from different sources The same attribute or object may have different names in different databases. E.g. cust_id,is same as cust_no or cno Collected by : Dr. Dipali Meher
  • 12. Data Integration: Controlling redundancies using correlation. Collected by : Dr. Dipali Meher
  • 13. Data Integration: Controlling redundancies using correlation. Collected by : Dr. Dipali Meher
  • 15. Data Transformation (a) Smoothing: This is the process of removing the unnecessary data and cleaning the data so as to improve the functionality of the data. (b) Aggregation: This is the process of collecting the data from heterogeneous platforms and converting it to a uniform format. This improves the quality of the data. (c) Discretization: Large data sets are complex to handle. Discretization is the process of breaking up the data in small intervals. These chunks are continuous chunks, and these are supported by all the existing frameworks. (d) Attribute construction: To improve the efficiency in the mining process, some new attributes are generated from existing data sets. (e) Generalization: This is the process of converting low level attributes to high level attributes using hierarchy. (f) Normalization: In the process of Normalization, attributes are scaled within a specified range. Collected by : Dr. Dipali Meher
  • 16. Data Aggregation Collected by : Dr. Dipali Meher
  • 17. Data Reduction Attribute Selection Data Cube Aggregation Numerosity Reduction Dimensionality Reduction Collected by : Dr. Dipali Meher
  • 18. (a) Attribute Selection: When data is collected from various sources, it may contain duplicate attributes. Some of the attributes are irrelevant. The Attribute Selection method is used to remove such redundant and unnecessary attributes from the data set. This process results in an improved data set. (b) Data Cube Aggregation: In this reduction method, aggregation property is applied on selected data sets so as to get the data in a much simpler format. (c) Numerosity Reduction: In this reduction method, actual data is substituted with a mathematical model of the data. (d) Dimensionality Reduction: In this reduction method, duplicate attributes are removed to reduce the data size. Data Reduction Collected by : Dr. Dipali Meher
  • 19. Data Cube Aggregation Collected by : Dr. Dipali Meher
  • 20. Unit 3: Data Preprocessing Numerosity Reduction Collected by : Dr. Dipali Meher
  • 21. Data Reduction: Sampling Collected by : Dr. Dipali Meher
  • 22. Data Reduction: Sampling Collected by : Dr. Dipali Meher
  • 23. Data Discretization  Large data sets are complex to handle.  Discretization is the process of breaking up the data in small intervals.  Here, the data size is reduced. But the data which is divided into intervals is kept continuous having some sequence.  Every interval has its own name and later these intervals can be replaced with actual data.  These chunks are continuous chunks and these are supported by all the existing frameworks. Collected by : Dr. Dipali Meher
  • 24. 1. Top-down Discretization: If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range, and then repeats this recursively on the resulting intervals, then it is called top-down discretization or splitting. 2. Bottom-up Discretization: If the process starts by considering all of the continuous values as potential split- points, removes some by merging neighborhood values to form intervals, then it is called bottom-up discretization or merging. Data Discretization
  • 25.  Concept hierarchies can be used to reduce the data by collecting and replacing low-level concepts (such as numerical values for the attribute age) with higher-level concepts (such as youth, middle-aged, or senior).  In the multidimensional model, data are organized into multiple dimensions, and each dimension contains multiple levels of abstraction defined by concept hierarchies. This organization provides users with the flexibility to view data from different perspectives. Concept Hierarchies Collected by : Dr. Dipali Meher
  • 26. Following are some Data discretization methods for numeric data: 1. Binning: This is a top-down unsupervised splitting technique based on a specified number of bins. In this method, values found for an attribute are grouped into a number of equal-width or equal-frequency bins. Then the values are smoothened using bin mean or bin median in each bean. Using this method recursively you can generate concept hierarchy. 2. Histogram Analysis: The histogram distributes an attribute's observed value into a disjoint subset, often called buckets or bins. 3. Cluster Analysis: Cluster analysis is a common form of data discretization. In this technique, a clustering algorithm can be applied to discretize a numerical attribute by partitioning the values of that attribute into clusters or groups. Data Discretization Methods Collected by : Dr. Dipali Meher
  • 27. Binning The stored values are distributed into a number of buckets or bins and then replacing each bin value by the bin mean or median. It is top-down splitting techniques based on a specified number of bins. It is unsupervised discretization technique because it does not use class information. - Equal-width(distance)partitioning - Equal –depth(frequency)Partitioning Collected by : Dr. Dipali Meher
  • 28. Binning: Equal width (distance) partitioning Divides the range into N intervals of equal size: uniform grid If A and B are the lowest and highest value of the attribute the width of the intervals will be W=(B-A)/N - Outliers may be come dominant - Skewed data my not be handled well Collected by : Dr. Dipali Meher
  • 29. Binning: Equal-width (distance) partitioning Collected by : Dr. Dipali Meher
  • 30. Unit 3: Data Preprocessing Binning: Equal-depth (frequency) partitioning Collected by : Dr. Dipali Meher
  • 31. Histogram Analysis Collected by : Dr. Dipali Meher
  • 32. Cluster Analysis Clustering can be used to generate a concept hierarchy for A by following either a top down splitting strategy to a bottom up merging strategy where each cluster forms a node of the conapt hierarchy  Initial cluster may be further decomposed into several sub clusters forming a lower level hierarchy  Later on clusters my be repeatedly grouped with neighbor clusters to form higher level concepts Collected by : Dr. Dipali Meher
  • 33. References Collected by : Dr. Dipali Meher Data Mining, Introduction and Advanced Topics by Margaret H. Dunham and Sridhar Pearson Education ISBN 81-7758-785-4 Data Mining Concepts and Techniques by Jiawei Han and Micheline Kamber Morgan Kaufmann Publishers ISBN 81-312-0535-5