SlideShare a Scribd company logo
1 of 23
Data Preprocessing
MS. T.K. ANUSUYA
ASSISTANT PROFESSOR
DEPARTMENT OF COMPUTER SCIENCE
BON SECOURS COLLEGE FOR WOMEN, THANJAVUR.
Why Data Pre-processing?
 Data in real-world
 Highly noisy, - errors or outliers
 Missing/incomplete – lacking attribute values eg name=“”
 Duplicate tuples
 inconsistent data due to their typically huge size.
 Low quality data
 low quality mining results.
 Different data sources
 Data extraction, cleaning and transformation
2
Data Pre-processing
Multi Dimensional Measure of Data Quality
 Accuracy
 Completeness
 Consistency
 Timeliness
 Believability
 Interpretability
3
Data Pre-processing
Data Pre-processing Techniques
 Data Cleaning
 Data integration
 Data reduction
 Data transformation
4
Data Pre-processing
Data Pre-processing Techniques
 Data Cleaning
 Missing values(noisy data), outliers , data’s are dirty
 Data Integration
 Integration of multiple databases, data cubes or files
 Data Transformation
 Normalization and aggregation
 Data Reduction
 Reduce data size,/compressed, aggregating, eliminating redundant
features
 Dimensionality reduction -removing irrelevant attributes
 Numerosity reduction – replaced by alternatives,
parametric models(regression /log linear models) or
non parametric models(eg. Histograms, clusters, sampling and data aggregation)
5
Data Pre-processing
Data Cleaning
To fill in missing values, smooth out noisy while identifying outliers and correct inconsistencies in the
data
• Missing Values
• Ignore the tuple – when class label is missing
• Fill in the missing value manually –tedious and infeasible
• Use a global constant to fill in the missing value – unknown a new class
• Use a measure of central tendency for the attribute
• Use the attribute mean or median for all samples belonging to the same class as the given tuple.
• Use the most probable value to fill on the missing value. –regression, Bayesian formula, decision
trees.
6
Data Pre-processing
Data Cleaning
• Noisy Data
• Noise is a random error or variance in a measured variable.
• Binning Method : sorting the data
• Smooth by bin median, median and boundaries.
• Clustering – detect and remove outliers
• Semi Automated – Computer and Manual intervention
• Regression – use regression functions
7
Data Pre-processing
Data Integration
 Data Integration
 Merging of data from multiple data stores.
 Reduce and avoid redundancies and inconsistencies
 Improves the accuracy and speed of the mining process.
 Entity identification problem
 Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton
 Redundant attributes may be able to detected to correlation analysis and covariance analysis
8
Data Pre-processing
Correlation Analysis (Nominal Data)
 Χ2 (chi-square) test
 The larger the Χ2 value, the more likely the variables are related
 The cells that contribute the most to the Χ2 value are those whose actual
count is very different from the expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population



Expected
ExpectedObserved 2
2 )(

9
Data Pre-processing
10
Data Pre-processing
Chi-square Calculation-example
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the
data distribution in the two categories)
It shows that like_science fiction and playchess are correlated in the group
93.507
840
)8401000(
360
)360200(
210
)21050(
90
)90250( 2222
2









11
Data Pre-processing
Correlation Analysis (Numerical Data)
Correlation coefficient (also called Pearson’s product moment
coefficient)
where n is the number of tuples, and are the respective means of A and B,
σA and σB are the respective standard deviation of A and B, and Σ(AB) is the sum of
the AB cross-product.
If rA,B > 0, A and B are positively correlated (A’s values increase as B’s).
The higher, the stronger correlation.
rA,B = 0: independent; rA,B < 0: negatively correlated
BABA n
BAnAB
n
BBAA
r BA
 )1(
)(
)1(
))((
,







Data Reduction
 Reduced representation of the data set that is much smaller in volume, yet closely
maintains the integrity of base data.
 Data cube aggregation
 Dimensionality reduction - reducing the random variables or attributes under
consideration (Wavelet Transforms)
 Numerosity reduction – Regression and log linear models, Histograms, Clustoring,
Sampling Data cube aggregation
 Data compression
12
Data Pre-processing
Wavelet Transform
 Data are transformed to preserve relative distance between objects at different
levels of resolutions
 Used for image compression
13
Data Pre-processing
Numerosity Reduction
 Reduce data volume by choosing alternative forms of data representation
 Parametric Methods (Regressions)
 Assume the data fits in models
 Linear Regression -Straight line
 Multiple Regression – multidimensional vector
 Log linear model- discrete multidimensional distributions
 Non-Parametric Methods
 Don’t assume models (Histograms, clustering, sampling…)
14
Data Pre-processing
Histograms
 Popular Data reduction techniques
 Divide and equal the data into buckets and store average for each bucket
15
Data Pre-processing
Data Cube Aggregation
 The lowest level of a data cube (Cubiod)
 A cube is highest level of abstraction is apex cuboid
 Multiple levels of aggregation in data cubes
 Provide fast access to precomputed, summarized data.
 Reduce the size of data
16
Data Pre-processing
Data Transformation
 Pre-processing step
 Data are transformed or consolidated the resulting mining process may be more efficient and
the patterns found.
 Smoothing – remove noisy data (binning, regression and clustering)
 Attribute construction – new attributes constructed
 Aggregation –summarized, data cube
 Normalization –(min-max, z-score)
 Discretization –hierarchy climbing
 Concept hierarchy generation for nominal data
17
Data Pre-processing
Normalization
 Min – maz normalization 9new mina, new maxA)
 Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then
$73,000 is mapped to
 Z-score normalization (μ: mean, σ: standard deviation):
 Normalization by decimal scaling where j is the smallest integer such that
max v <1
AAA
AA
A
minnewminnewmaxnew
minmax
minv
v _)__(' 



A
Av
v


'
j
v
v'
18
Data Pre-processing
Data Discretization
 Three types of attributes:
 Nominal — values from an unordered set, e.g., color, profession
 Ordinal — values from an ordered set, e.g., military or academic rank
 Continuous — real numbers, e.g., integer or real numbers
 Discretization:
 Divide the range of a continuous attribute into intervals
 Some classification algorithms only accept categorical attributes.
 Reduce data size by discretization
Data Pre-processing
19
Data Discretization
 Discretization
 Reduce the number of values for a given continuous attribute by dividing the range of the attribute
into intervals
 Interval labels can then be used to replace actual data values
 Supervised vs. unsupervised
 Split (top-down) vs. merge (bottom-up)
 Discretization can be performed recursively on an attribute
 Concept hierarchy formation
 Recursively reduce the data by collecting and replacing low level concepts (such as numeric values
for age) by higher level concepts (such as young, middle-aged, or senior)
Data Pre-processing
20
Data Discretization Methods
 Typical methods: All the methods can be applied recursively
 Binning
 Top-down split, unsupervised
 Histogram analysis
 Top-down split, unsupervised
 Clustering analysis (unsupervised, top-down split or bottom-up merge)
 Decision-tree analysis (supervised, top-down split)
 Correlation (e.g., 2) analysis (unsupervised, bottom-up merge)
21
Data Pre-processing
Data Pre-processing
22
Data Pre-processing
23

More Related Content

What's hot

Data preprocessing
Data preprocessingData preprocessing
Data preprocessingankur bhalla
 
Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data MiningIffat Firozy
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kambererror007
 
1.8 discretization
1.8 discretization1.8 discretization
1.8 discretizationKrish_ver2
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithmRashid Ansari
 
Data Reduction
Data ReductionData Reduction
Data ReductionRajan Shah
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reductionKrish_ver2
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data MiningDHIVYADEVAKI
 
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsData Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsSalah Amean
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision treeKrish_ver2
 
Classification and prediction
Classification and predictionClassification and prediction
Classification and predictionAcad
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset PreparationAndrew Ferlitsch
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning pyingkodi maran
 

What's hot (20)

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
1.8 discretization
1.8 discretization1.8 discretization
1.8 discretization
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithm
 
Decision tree
Decision treeDecision tree
Decision tree
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
 
Decision tree
Decision treeDecision tree
Decision tree
 
Data Reduction
Data ReductionData Reduction
Data Reduction
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reduction
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsData Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
 
Classification and prediction
Classification and predictionClassification and prediction
Classification and prediction
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset Preparation
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
 

Similar to Data preprocessing PPT

Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.pptRevathy V R
 
Data preperation
Data preperationData preperation
Data preperationFraboni Ec
 
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...ImXaib
 
Data preparation
Data preparationData preparation
Data preparationTony Nguyen
 
Data preparation
Data preparationData preparation
Data preparationJames Wong
 
Data preprocessing 2
Data preprocessing 2Data preprocessing 2
Data preprocessing 2extraganesh
 
Datapreprocessingppt
DatapreprocessingpptDatapreprocessingppt
DatapreprocessingpptShree Hari
 
03Preprocessing01.pdf
03Preprocessing01.pdf03Preprocessing01.pdf
03Preprocessing01.pdfAlireza418370
 

Similar to Data preprocessing PPT (20)

Data1
Data1Data1
Data1
 
Data1
Data1Data1
Data1
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preparation
Data preparationData preparation
Data preparation
 
Data preparation
Data preparationData preparation
Data preparation
 
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
 
Data preparation
Data preparationData preparation
Data preparation
 
Data preparation
Data preparationData preparation
Data preparation
 
Data Mining
Data MiningData Mining
Data Mining
 
Data preprocessing 2
Data preprocessing 2Data preprocessing 2
Data preprocessing 2
 
Datapreprocessingppt
DatapreprocessingpptDatapreprocessingppt
Datapreprocessingppt
 
Preprocess
PreprocessPreprocess
Preprocess
 
Unit 3-2.ppt
Unit 3-2.pptUnit 3-2.ppt
Unit 3-2.ppt
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data processing
Data processingData processing
Data processing
 
03Preprocessing01.pdf
03Preprocessing01.pdf03Preprocessing01.pdf
03Preprocessing01.pdf
 
Datapreprocessing
DatapreprocessingDatapreprocessing
Datapreprocessing
 

More from ANUSUYA T K

Chap3 Device Technology
Chap3 Device TechnologyChap3 Device Technology
Chap3 Device TechnologyANUSUYA T K
 
Introduction to Corel Draw
Introduction to Corel DrawIntroduction to Corel Draw
Introduction to Corel DrawANUSUYA T K
 
Introduction to dm and dw
Introduction to dm and dwIntroduction to dm and dw
Introduction to dm and dwANUSUYA T K
 
Chap 2-pc applications examples
Chap 2-pc applications examplesChap 2-pc applications examples
Chap 2-pc applications examplesANUSUYA T K
 
Chap1 introduction to Pervasive Computing
Chap1 introduction to Pervasive ComputingChap1 introduction to Pervasive Computing
Chap1 introduction to Pervasive ComputingANUSUYA T K
 
Pagemaker7.0 layout
Pagemaker7.0 layoutPagemaker7.0 layout
Pagemaker7.0 layoutANUSUYA T K
 
Mail merge in page maker 7
Mail merge in page maker 7Mail merge in page maker 7
Mail merge in page maker 7ANUSUYA T K
 
Layers and types of cloud
Layers and types of cloudLayers and types of cloud
Layers and types of cloudANUSUYA T K
 
Cloud deployment models
Cloud deployment modelsCloud deployment models
Cloud deployment modelsANUSUYA T K
 
Virtual Machine provisioning and migration services
Virtual Machine provisioning and migration servicesVirtual Machine provisioning and migration services
Virtual Machine provisioning and migration servicesANUSUYA T K
 
VM for cloud infrastructure
VM for cloud infrastructureVM for cloud infrastructure
VM for cloud infrastructureANUSUYA T K
 
Cloud Computing Environment using Cluster as a service
Cloud Computing Environment using Cluster as a serviceCloud Computing Environment using Cluster as a service
Cloud Computing Environment using Cluster as a serviceANUSUYA T K
 
Data Storage in Cloud computing
Data Storage in Cloud computingData Storage in Cloud computing
Data Storage in Cloud computingANUSUYA T K
 
Migrating into a cloud
Migrating into a cloudMigrating into a cloud
Migrating into a cloudANUSUYA T K
 
Cloud computing introduction
Cloud computing introductionCloud computing introduction
Cloud computing introductionANUSUYA T K
 

More from ANUSUYA T K (16)

Chap3 Device Technology
Chap3 Device TechnologyChap3 Device Technology
Chap3 Device Technology
 
Introduction to Corel Draw
Introduction to Corel DrawIntroduction to Corel Draw
Introduction to Corel Draw
 
Introduction to dm and dw
Introduction to dm and dwIntroduction to dm and dw
Introduction to dm and dw
 
Chap 2-pc applications examples
Chap 2-pc applications examplesChap 2-pc applications examples
Chap 2-pc applications examples
 
Chap1 introduction to Pervasive Computing
Chap1 introduction to Pervasive ComputingChap1 introduction to Pervasive Computing
Chap1 introduction to Pervasive Computing
 
Pagemaker7.0 layout
Pagemaker7.0 layoutPagemaker7.0 layout
Pagemaker7.0 layout
 
Mail merge in page maker 7
Mail merge in page maker 7Mail merge in page maker 7
Mail merge in page maker 7
 
Layers and types of cloud
Layers and types of cloudLayers and types of cloud
Layers and types of cloud
 
Cloud deployment models
Cloud deployment modelsCloud deployment models
Cloud deployment models
 
Cc chap-8
Cc chap-8Cc chap-8
Cc chap-8
 
Virtual Machine provisioning and migration services
Virtual Machine provisioning and migration servicesVirtual Machine provisioning and migration services
Virtual Machine provisioning and migration services
 
VM for cloud infrastructure
VM for cloud infrastructureVM for cloud infrastructure
VM for cloud infrastructure
 
Cloud Computing Environment using Cluster as a service
Cloud Computing Environment using Cluster as a serviceCloud Computing Environment using Cluster as a service
Cloud Computing Environment using Cluster as a service
 
Data Storage in Cloud computing
Data Storage in Cloud computingData Storage in Cloud computing
Data Storage in Cloud computing
 
Migrating into a cloud
Migrating into a cloudMigrating into a cloud
Migrating into a cloud
 
Cloud computing introduction
Cloud computing introductionCloud computing introduction
Cloud computing introduction
 

Recently uploaded

Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupJonathanParaisoCruz
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfadityarao40181
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 

Recently uploaded (20)

Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized Group
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdf
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 

Data preprocessing PPT

  • 1. Data Preprocessing MS. T.K. ANUSUYA ASSISTANT PROFESSOR DEPARTMENT OF COMPUTER SCIENCE BON SECOURS COLLEGE FOR WOMEN, THANJAVUR.
  • 2. Why Data Pre-processing?  Data in real-world  Highly noisy, - errors or outliers  Missing/incomplete – lacking attribute values eg name=“”  Duplicate tuples  inconsistent data due to their typically huge size.  Low quality data  low quality mining results.  Different data sources  Data extraction, cleaning and transformation 2 Data Pre-processing
  • 3. Multi Dimensional Measure of Data Quality  Accuracy  Completeness  Consistency  Timeliness  Believability  Interpretability 3 Data Pre-processing
  • 4. Data Pre-processing Techniques  Data Cleaning  Data integration  Data reduction  Data transformation 4 Data Pre-processing
  • 5. Data Pre-processing Techniques  Data Cleaning  Missing values(noisy data), outliers , data’s are dirty  Data Integration  Integration of multiple databases, data cubes or files  Data Transformation  Normalization and aggregation  Data Reduction  Reduce data size,/compressed, aggregating, eliminating redundant features  Dimensionality reduction -removing irrelevant attributes  Numerosity reduction – replaced by alternatives, parametric models(regression /log linear models) or non parametric models(eg. Histograms, clusters, sampling and data aggregation) 5 Data Pre-processing
  • 6. Data Cleaning To fill in missing values, smooth out noisy while identifying outliers and correct inconsistencies in the data • Missing Values • Ignore the tuple – when class label is missing • Fill in the missing value manually –tedious and infeasible • Use a global constant to fill in the missing value – unknown a new class • Use a measure of central tendency for the attribute • Use the attribute mean or median for all samples belonging to the same class as the given tuple. • Use the most probable value to fill on the missing value. –regression, Bayesian formula, decision trees. 6 Data Pre-processing
  • 7. Data Cleaning • Noisy Data • Noise is a random error or variance in a measured variable. • Binning Method : sorting the data • Smooth by bin median, median and boundaries. • Clustering – detect and remove outliers • Semi Automated – Computer and Manual intervention • Regression – use regression functions 7 Data Pre-processing
  • 8. Data Integration  Data Integration  Merging of data from multiple data stores.  Reduce and avoid redundancies and inconsistencies  Improves the accuracy and speed of the mining process.  Entity identification problem  Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton  Redundant attributes may be able to detected to correlation analysis and covariance analysis 8 Data Pre-processing
  • 9. Correlation Analysis (Nominal Data)  Χ2 (chi-square) test  The larger the Χ2 value, the more likely the variables are related  The cells that contribute the most to the Χ2 value are those whose actual count is very different from the expected count  Correlation does not imply causality  # of hospitals and # of car-theft in a city are correlated  Both are causally linked to the third variable: population    Expected ExpectedObserved 2 2 )(  9 Data Pre-processing
  • 10. 10 Data Pre-processing Chi-square Calculation-example Play chess Not play chess Sum (row) Like science fiction 250(90) 200(360) 450 Not like science fiction 50(210) 1000(840) 1050 Sum(col.) 300 1200 1500 Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories) It shows that like_science fiction and playchess are correlated in the group 93.507 840 )8401000( 360 )360200( 210 )21050( 90 )90250( 2222 2         
  • 11. 11 Data Pre-processing Correlation Analysis (Numerical Data) Correlation coefficient (also called Pearson’s product moment coefficient) where n is the number of tuples, and are the respective means of A and B, σA and σB are the respective standard deviation of A and B, and Σ(AB) is the sum of the AB cross-product. If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the stronger correlation. rA,B = 0: independent; rA,B < 0: negatively correlated BABA n BAnAB n BBAA r BA  )1( )( )1( ))(( ,       
  • 12. Data Reduction  Reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of base data.  Data cube aggregation  Dimensionality reduction - reducing the random variables or attributes under consideration (Wavelet Transforms)  Numerosity reduction – Regression and log linear models, Histograms, Clustoring, Sampling Data cube aggregation  Data compression 12 Data Pre-processing
  • 13. Wavelet Transform  Data are transformed to preserve relative distance between objects at different levels of resolutions  Used for image compression 13 Data Pre-processing
  • 14. Numerosity Reduction  Reduce data volume by choosing alternative forms of data representation  Parametric Methods (Regressions)  Assume the data fits in models  Linear Regression -Straight line  Multiple Regression – multidimensional vector  Log linear model- discrete multidimensional distributions  Non-Parametric Methods  Don’t assume models (Histograms, clustering, sampling…) 14 Data Pre-processing
  • 15. Histograms  Popular Data reduction techniques  Divide and equal the data into buckets and store average for each bucket 15 Data Pre-processing
  • 16. Data Cube Aggregation  The lowest level of a data cube (Cubiod)  A cube is highest level of abstraction is apex cuboid  Multiple levels of aggregation in data cubes  Provide fast access to precomputed, summarized data.  Reduce the size of data 16 Data Pre-processing
  • 17. Data Transformation  Pre-processing step  Data are transformed or consolidated the resulting mining process may be more efficient and the patterns found.  Smoothing – remove noisy data (binning, regression and clustering)  Attribute construction – new attributes constructed  Aggregation –summarized, data cube  Normalization –(min-max, z-score)  Discretization –hierarchy climbing  Concept hierarchy generation for nominal data 17 Data Pre-processing
  • 18. Normalization  Min – maz normalization 9new mina, new maxA)  Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to  Z-score normalization (μ: mean, σ: standard deviation):  Normalization by decimal scaling where j is the smallest integer such that max v <1 AAA AA A minnewminnewmaxnew minmax minv v _)__('     A Av v   ' j v v' 18 Data Pre-processing
  • 19. Data Discretization  Three types of attributes:  Nominal — values from an unordered set, e.g., color, profession  Ordinal — values from an ordered set, e.g., military or academic rank  Continuous — real numbers, e.g., integer or real numbers  Discretization:  Divide the range of a continuous attribute into intervals  Some classification algorithms only accept categorical attributes.  Reduce data size by discretization Data Pre-processing 19
  • 20. Data Discretization  Discretization  Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals  Interval labels can then be used to replace actual data values  Supervised vs. unsupervised  Split (top-down) vs. merge (bottom-up)  Discretization can be performed recursively on an attribute  Concept hierarchy formation  Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as young, middle-aged, or senior) Data Pre-processing 20
  • 21. Data Discretization Methods  Typical methods: All the methods can be applied recursively  Binning  Top-down split, unsupervised  Histogram analysis  Top-down split, unsupervised  Clustering analysis (unsupervised, top-down split or bottom-up merge)  Decision-tree analysis (supervised, top-down split)  Correlation (e.g., 2) analysis (unsupervised, bottom-up merge) 21 Data Pre-processing