SlideShare a Scribd company logo
Data Preprocessing
Jun Du
The University of Western Ontario
jdu43@uwo.ca
Outline
• Data
• Data Preprocessing: An Overview
• Data Cleaning
• Data Transformation and Data Discretization
• Data Reduction
• Summary
1
What is Data?
• Collection of data objects
and their attributes
• Data objects  rows
• Attributes  columns
2
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Attributes
Objects
Data Objects
• A data object represents an entity.
• Examples:
– Sales database: customers, store items, sales
– Medical database: patients, treatments
– University database: students, professors, courses
• Also called examples, instances, records, cases,
samples, data points, objects, etc.
• Data objects are described by attributes.
3
Attributes
• An attribute is a data field, representing a
characteristic or feature of a data object.
• Example:
– Customer Data: customer _ID, name, gender, age, address,
phone number, etc.
– Product data: product_ID, price, quantity, manufacturer,
etc.
• Also called features, variables, fields, dimensions, etc.
4
Attribute Types (1)
• Nominal (Discrete) Attribute
– Has only a finite set of values (such as, categories, states,
etc.)
– E.g., Hair_color = {black, blond, brown, grey, red, white, …}
– E.g., marital status, zip codes
• Numeric (Continuous) Attribute
– Has real numbers as attribute values
– E.g., temperature, height, or weight.
• Question: what about student id, SIN, year of birth?
5
Attribute Types (2)
• Binary
– A special case of nominal attribute: with only 2 states (0
and 1)
– Gender = {male, female};
– Medical test = {positive, negative}
• Ordinal
– Usually a special case of nominal attribute: values have a
meaningful order (ranking)
– Size = {small, medium, large}
– Army rankings
6
Outline
• Data
• Data Preprocessing: An Overview
• Data Cleaning
• Data Transformation and Data Discretization
• Data Reduction
• Summary
7
Data Preprocessing
• Why preprocess the data?
– Data quality is poor in real world.
– No quality data, no quality mining results!
• Measures for data quality
– Accuracy: noise, outliers, …
– Completeness: missing values, …
– Redundancy: duplicated data, irrelevant data, …
– Consistency: some modified but some not, …
– ……
8
Typical Tasks in Data Preprocessing
• Data Cleaning
– Handle missing values, noisy / outlier data, resolve
inconsistencies, …
• Data Transformation
– Aggregation
– Type Conversion
– Normalization
• Data Reduction
– Data Sampling
– Dimensionality Reduction
• ……
9
Outline
• Data
• Data Preprocessing: An Overview
• Data Cleaning
• Data Transformation and Data Discretization
• Data Reduction
• Summary
10
Data Cleaning
• Missing value: lacking attribute values
– E.g., Occupation = “ ”
• Noise (Error): modification of original values
– E.g., Salary = “−10”
• Outlier: considerably different from most of the
other data (not necessarily error)
– E.g., Salary = “2,100,000”
• Inconsistency: discrepancies in codes or names
– E.g., Age=“42”, Birthday=“03/07/2010”
– Was rating “1, 2, 3”, now rating “A, B, C”
• ……
11
Missing Values
• Reasons for missing values
– Information is not collected
• E.g., people decline to give their age and weight
– Attributes may not be applicable to all cases
• E.g., annual income is not applicable to children
– Human / Hardware / Software problems
• E.g., Birthdate information is accidentally deleted for all
people born in 1988.
– ……
12
How to Handle Missing Value?
• Eliminate  ignore missing value
• Eliminate  ignore the examples
• Eliminate  ignore the features
• Simple; not applicable when data is scarce
• Estimate missing value
– Global constant : e.g., “unknown”,
– Attribute mean (median, mode)
– Predict the value based on features (data imputation)
• Estimate gender based on first name (name gender)
• Estimate age based on first name (name popularity)
• Build a predictive model based on other features
– Missing value estimation depends on the missing reason!
13
Demonstration
• ReplaceMissingValues
– WekaVote
– Replacing missing values for nominal and numeric
attributes
• More functions in Rapidminer
14
Noisy (Outlier) Data
• Noise: refers to modification of original values
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
15
How to Handle Noisy (Outlier) Data?
• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Regression
– smooth by fitting the data into regression functions
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human
16
Binning
Sort data in ascending order: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
• Partition into equal-frequency (equal-depth) bins:
– Bin 1: 4, 8, 9, 15
– Bin 2: 21, 21, 24, 25
– Bin 3: 26, 28, 29, 34
• Smoothing by bin means:
– Bin 1: 9, 9, 9, 9
– Bin 2: 23, 23, 23, 23
– Bin 3: 29, 29, 29, 29
• Smoothing by bin boundaries:
– Bin 1: 4, 4, 4, 15
– Bin 2: 21, 21, 25, 25
– Bin 3: 26, 26, 26, 34
17
Regression
18
x
y
y = x + 1
X1
Y1
Y1’
Cluster Analysis
19
Outline
• Data
• Data Preprocessing: An Overview
• Data Cleaning
• Data Transformation and Data Discretization
• Data Reduction
• Summary
20
Data Transformation
• Aggregation:
– Attribute / example summarization
• Feature type conversion:
– Nominal  Numeric, …
• Normalization:
– Scaled to fall within a small, specified range
• Attribute/feature construction:
– New attributes constructed from the given ones
21
Aggregation
• Combining two or more attributes (examples) into a single
attribute (example)
• Combining two or more attribute values into a single attribute
value
• Purpose
– Change of scale
• Cities aggregated into regions, states, countries, etc
– More “stable” data
• Aggregated data tends to have less variability
– More “predictive” data
• Aggregated data might have high Predictability
22
Demonstration
• MergeTwoValues
– Wekacontact-lenses
– Merge class values “soft” and “hard”
• Effective aggregation in real-world application
23
Feature Type Conversion
• Some algorithms can only handle numeric features; some can
only handle nominal features. Only few can handle both.
• Features have to be converted to satisfy the requirement of
learning algorithms.
– Numeric  Nominal (Discretization)
• E.g., Age Discretization: Young 18-29; Career 30-40; Mid-Life 41-55;
Empty-Nester 56-69; Senior 70+
– Nominal  Numeric
• Introduce multiple numeric features for one nominal feature
• Nominal  Binary (Numeric)
• E.g., size={L, M, S}  size_L: 0, 1; size_M: 0, 1; size_S: 0, 1
24
Demonstration
• Discretize
– Wekadiabetes
– Discretize “age” (equal bins vs equal frequency)
• NumericToNominal
– Wekadiabetes
– Discretize “age” (vs “Discretize” method)
• NominalToBinary
– UCIautos
– Convert “num-of-doors”
– Convert “drive-wheels”
25
Normalization
716.00)00.1(
000,12000,98
000,12600,73



26
Scale the attribute values to a small specified range
• Min-max normalization: to [new_minA, new_maxA]
– E.g., Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
Then $73,000 is mapped to
• Z-score normalization (μ: mean, σ: standard deviation):
• ……
AAA
AA
A
minnewminnewmaxnew
minmax
minv
v _)__(' 



Demonstration
• Normalize
– Wekadiabetes
– Normalize “age”
• Standardize
– Wekadiabetes
– Standardize “age” (vs “Normalize” method)
27
Outline
• Data
• Data Preprocessing: An Overview
• Data Cleaning
• Data Transformation and Data Discretization
• Data Reduction
• Summary
28
Sampling
• Big data era: too expensive (or even infeasible) to
process the entire data set
• Sampling: obtaining a small sample to represent the
entire data set ( ---- undersampling)
• Oversampling is also required in some scenarios,
such as class imbalance problem
– E.g., 100 HIV test results: 5 positive, 995 negative
29
Sampling Principle
Key principle for effective sampling:
• Using a sample will work almost as well as using the
entire data sets, if the sample is representative
• A sample is representative if it has approximately the
same property (of interest) as the original set of data
30
Types of Sampling (1)
• Random sampling without replacement
– As each example is selected, it is removed from the population
• Random sampling with replacement
– Examples are not removed from the population after being selected
• The same example can be picked up more than once
31
Raw Data
Types of Sampling (2)
• Stratified sampling
– Split the data into several partitions; then draw random
samples from each partition
32
Raw Data Stratified Sampling
Demonstration
• Resample
– UCIwaveform-5000
– Undersampling (with or without replacement)
33
Dimensionality Reduction
• Purpose:
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce noise
• Techniques
– Feature Selection
– Feature Extraction
34
Feature Selection
• Redundant features
– Duplicated information contained in different features
– E.g., “Age”, “Year of Birth”; “Purchase price”, “Sales tax”
• Irrelevant features
– Containing no information that is useful for the task
– E.g., students' ID is irrelevant to predicting GPA
• Goal:
– A minimum set of features containing all (most)
information
35
Heuristic Search in Feature Selection
• Given d features, there are 2d possible feature
combinations
– Exhaust search won’t work
– Heuristics has to be applied
• Typical heuristic feature selection methods:
– Feature ranking
– Forward feature selection
– Backward feature elimination
– Bidirectional search (selection + elimination)
– Search based on evolution algorithm
– ……
36
Feature Ranking
• Steps:
1) Rank all the individual features according to certain criteria
(e.g., information gain, gain ratio, χ2)
2) Select / keep top N features
• Properties:
– Usually independent of the learning algorithm to be used
– Efficient (no search process)
– Hard to determine the threshold
– Unable to consider correlation between features
37
Forward Feature Selection
• Steps:
1) First select the best single-feature (according to the learning
algorithm)
2) Repeat (until some stop criterion is met):
Select the next best feature, given the already picked features
• Properties:
– Usually learning algorithm dependent
– Feature correlation is considered
– More reliable
– Inefficient
38
Backward Feature Elimination
• Steps:
1) First build a model based on all the features
2) Repeat (until some criterion is met):
Eliminate the feature that makes the least contribution.
• Properties:
– Usually learning algorithm dependent
– Feature correlation is considered
– More reliable
– Inefficient
39
Filter vs Wrapper Model
• Filter model
– Separating feature selection from learning
– Relying on general characteristics of data (information, etc.)
– No bias toward any learning algorithm, fast
– Feature ranking usually falls into here
• Wrapper model
– Relying on a predetermined learning algorithm
– Using predictive accuracy as goodness measure
– High accuracy, computationally expensive
– FFS, BFE usually fall into here
40
Demonstration
• Feature ranking
– Wekaweather
– ChiSquared, InfoGain, GainRatio
• FFS & BFE
– WekaDiabetes
– ClassifierSubsetEval + GreedyStepwise
41
Feature Extraction
• Map original high-dimensional data onto a lower-
dimensional space
– Generate a (smaller) set of new features
– Preserve all (most) information from the original data
• Techniques
– Principal Component Analysis (PCA)
– Canonical Correlation Analysis (CCA)
– Linear Discriminant Analysis (LDA)
– Independent Component Analysis (ICA)
– Manifold Learning
– ……
42
Principal Component Analysis (PCA)
• Find a projection that captures the largest amount of variation
in data
• The original data are projected onto a much smaller space,
resulting in dimensionality reduction.
43
x2
x1
e
Principal Component Analysis (Steps)
• Given data from n-dimensions (n features), find k ≤ n new
features (principal components) that can best represent data
– Normalize input data: each feature falls within the same range
– Compute k principal components (details omitted)
– Each input data is projected in the new k-dimensional space
– The new features (principal components ) are sorted in order of
decreasing “significance” or strength
– Eliminate weak components / features to reduce dimensionality.
• Works for numeric data only
44
PCA Demonstration
• UCIbreast-w
– Accuracy with all features
– PrincipalComponents (data transformation)
– Visualize/save transformed data (first two features, last
two features)
– Accuracy with all transformed features
– Accuracy with top 1 or 2 feature(s)
45
Outline
• Data
• Data Preprocessing: An Overview
• Data Cleaning
• Data Transformation and Data Discretization
• Data Reduction
• Summary
46
Summary
• Data (features and instances)
• Data Cleaning: missing values, noise / outliers
• Data Transformation: aggregation, type conversion,
normalization
• Data Reduction
– Sampling: random sampling with replacement, random
sampling without replacement, stratified sampling
– Dimensionality reduction:
• Feature Selection: Feature ranking, FFS, BFE
• Feature Extraction: PCA
47
Notes
• In real world applications, data preprocessing usually
occupies about 70% workload in a data mining task.
• Domain knowledge is usually required to do good
data preprocessing.
• To improve a predictive performance of a model
– Improve learning algorithms (different algorithms,
different parameters)
• Most data mining research focuses on here
– Improve data quality ---- data preprocessing
• Deserve more attention!
48

More Related Content

What's hot

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ankur bhalla
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
Gopal Sakarkar
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Gajanand Sharma
 
CART – Classification & Regression Trees
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression Trees
Hemant Chetwani
 
Python pandas Library
Python pandas LibraryPython pandas Library
Python pandas Library
Md. Sohag Miah
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1
malathieswaran29
 
Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
Iffat Firozy
 
Text Classification
Text ClassificationText Classification
Text Classification
RAX Automation Suite
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data Science
Kenny Daniel
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
Salah Amean
 
pandas - Python Data Analysis
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data Analysis
Andrew Henshaw
 
Data PreProcessing
Data PreProcessingData PreProcessing
Data PreProcessing
tdharmaputhiran
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
Salah Amean
 
Cross-validation Tutorial: What, how and which?
Cross-validation Tutorial: What, how and which?Cross-validation Tutorial: What, how and which?
Cross-validation Tutorial: What, how and which?
Pradeep Redddy Raamana
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
ShivanandaVSeeri
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
HJ van Veen
 
NumPy.pptx
NumPy.pptxNumPy.pptx
NumPy.pptx
EN1036VivekSingh
 
Statistics and data science
Statistics and data scienceStatistics and data science
Statistics and data science
Mohammad Azharuddin
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
Dr. Abdul Ahad Abro
 
Introduction to data mining technique
Introduction to data mining techniqueIntroduction to data mining technique
Introduction to data mining technique
Pawneshwar Datt Rai
 

What's hot (20)

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
CART – Classification & Regression Trees
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression Trees
 
Python pandas Library
Python pandas LibraryPython pandas Library
Python pandas Library
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1
 
Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
 
Text Classification
Text ClassificationText Classification
Text Classification
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data Science
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
pandas - Python Data Analysis
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data Analysis
 
Data PreProcessing
Data PreProcessingData PreProcessing
Data PreProcessing
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Cross-validation Tutorial: What, how and which?
Cross-validation Tutorial: What, how and which?Cross-validation Tutorial: What, how and which?
Cross-validation Tutorial: What, how and which?
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
NumPy.pptx
NumPy.pptxNumPy.pptx
NumPy.pptx
 
Statistics and data science
Statistics and data scienceStatistics and data science
Statistics and data science
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 
Introduction to data mining technique
Introduction to data mining techniqueIntroduction to data mining technique
Introduction to data mining technique
 

Similar to Data preprocessing in Data Mining

Data preprocessing 2
Data preprocessing 2Data preprocessing 2
Data preprocessing 2
extraganesh
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
extraganesh
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Tony Nguyen
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Harry Potter
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
Roshan575917
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
Arumugam Prakash
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
chatbot9
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
waseemchaudhry13
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
mmuthuraj
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
meenas06
 
Wk. 3. Data [12-05-2021] (2).ppt
Wk. 3.  Data [12-05-2021] (2).pptWk. 3.  Data [12-05-2021] (2).ppt
Wk. 3. Data [12-05-2021] (2).ppt
MdZahidHasan55
 
3 module 2
3 module 23 module 2
3 module 2
tafosepsdfasg
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
congtran88
 
Data types and Attributes1 (1).pptx
Data types and Attributes1 (1).pptxData types and Attributes1 (1).pptx
Data types and Attributes1 (1).pptx
RupaRaj6
 
Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptx
ImXaib
 
Preprocessing_new.ppt
Preprocessing_new.pptPreprocessing_new.ppt
Preprocessing_new.ppt
YashikaSengar2
 
UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
subhashchandra197
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16
Dhilsath Fathima
 
DATA preprocessing.pptx
DATA preprocessing.pptxDATA preprocessing.pptx
DATA preprocessing.pptx
Chandra Meena
 
DataAnalyticsIntroduction and its ci.pptx
DataAnalyticsIntroduction and its ci.pptxDataAnalyticsIntroduction and its ci.pptx
DataAnalyticsIntroduction and its ci.pptx
PrincePatel272012
 

Similar to Data preprocessing in Data Mining (20)

Data preprocessing 2
Data preprocessing 2Data preprocessing 2
Data preprocessing 2
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
 
Wk. 3. Data [12-05-2021] (2).ppt
Wk. 3.  Data [12-05-2021] (2).pptWk. 3.  Data [12-05-2021] (2).ppt
Wk. 3. Data [12-05-2021] (2).ppt
 
3 module 2
3 module 23 module 2
3 module 2
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Data types and Attributes1 (1).pptx
Data types and Attributes1 (1).pptxData types and Attributes1 (1).pptx
Data types and Attributes1 (1).pptx
 
Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptx
 
Preprocessing_new.ppt
Preprocessing_new.pptPreprocessing_new.ppt
Preprocessing_new.ppt
 
UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16
 
DATA preprocessing.pptx
DATA preprocessing.pptxDATA preprocessing.pptx
DATA preprocessing.pptx
 
DataAnalyticsIntroduction and its ci.pptx
DataAnalyticsIntroduction and its ci.pptxDataAnalyticsIntroduction and its ci.pptx
DataAnalyticsIntroduction and its ci.pptx
 

Recently uploaded

Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
IJNSA Journal
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
IJECEIAES
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
kandramariana6
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
VICTOR MAESTRE RAMIREZ
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
KrishnaveniKrishnara1
 
Textile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdfTextile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdf
NazakatAliKhoso2
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
ihlasbinance2003
 
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
University of Maribor
 
Question paper of renewable energy sources
Question paper of renewable energy sourcesQuestion paper of renewable energy sources
Question paper of renewable energy sources
mahammadsalmanmech
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
Yasser Mahgoub
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
mamunhossenbd75
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
171ticu
 

Recently uploaded (20)

Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
 
Textile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdfTextile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdf
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
 
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
 
Question paper of renewable energy sources
Question paper of renewable energy sourcesQuestion paper of renewable energy sources
Question paper of renewable energy sources
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
 

Data preprocessing in Data Mining

  • 1. Data Preprocessing Jun Du The University of Western Ontario jdu43@uwo.ca
  • 2. Outline • Data • Data Preprocessing: An Overview • Data Cleaning • Data Transformation and Data Discretization • Data Reduction • Summary 1
  • 3. What is Data? • Collection of data objects and their attributes • Data objects  rows • Attributes  columns 2 Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Attributes Objects
  • 4. Data Objects • A data object represents an entity. • Examples: – Sales database: customers, store items, sales – Medical database: patients, treatments – University database: students, professors, courses • Also called examples, instances, records, cases, samples, data points, objects, etc. • Data objects are described by attributes. 3
  • 5. Attributes • An attribute is a data field, representing a characteristic or feature of a data object. • Example: – Customer Data: customer _ID, name, gender, age, address, phone number, etc. – Product data: product_ID, price, quantity, manufacturer, etc. • Also called features, variables, fields, dimensions, etc. 4
  • 6. Attribute Types (1) • Nominal (Discrete) Attribute – Has only a finite set of values (such as, categories, states, etc.) – E.g., Hair_color = {black, blond, brown, grey, red, white, …} – E.g., marital status, zip codes • Numeric (Continuous) Attribute – Has real numbers as attribute values – E.g., temperature, height, or weight. • Question: what about student id, SIN, year of birth? 5
  • 7. Attribute Types (2) • Binary – A special case of nominal attribute: with only 2 states (0 and 1) – Gender = {male, female}; – Medical test = {positive, negative} • Ordinal – Usually a special case of nominal attribute: values have a meaningful order (ranking) – Size = {small, medium, large} – Army rankings 6
  • 8. Outline • Data • Data Preprocessing: An Overview • Data Cleaning • Data Transformation and Data Discretization • Data Reduction • Summary 7
  • 9. Data Preprocessing • Why preprocess the data? – Data quality is poor in real world. – No quality data, no quality mining results! • Measures for data quality – Accuracy: noise, outliers, … – Completeness: missing values, … – Redundancy: duplicated data, irrelevant data, … – Consistency: some modified but some not, … – …… 8
  • 10. Typical Tasks in Data Preprocessing • Data Cleaning – Handle missing values, noisy / outlier data, resolve inconsistencies, … • Data Transformation – Aggregation – Type Conversion – Normalization • Data Reduction – Data Sampling – Dimensionality Reduction • …… 9
  • 11. Outline • Data • Data Preprocessing: An Overview • Data Cleaning • Data Transformation and Data Discretization • Data Reduction • Summary 10
  • 12. Data Cleaning • Missing value: lacking attribute values – E.g., Occupation = “ ” • Noise (Error): modification of original values – E.g., Salary = “−10” • Outlier: considerably different from most of the other data (not necessarily error) – E.g., Salary = “2,100,000” • Inconsistency: discrepancies in codes or names – E.g., Age=“42”, Birthday=“03/07/2010” – Was rating “1, 2, 3”, now rating “A, B, C” • …… 11
  • 13. Missing Values • Reasons for missing values – Information is not collected • E.g., people decline to give their age and weight – Attributes may not be applicable to all cases • E.g., annual income is not applicable to children – Human / Hardware / Software problems • E.g., Birthdate information is accidentally deleted for all people born in 1988. – …… 12
  • 14. How to Handle Missing Value? • Eliminate ignore missing value • Eliminate ignore the examples • Eliminate ignore the features • Simple; not applicable when data is scarce • Estimate missing value – Global constant : e.g., “unknown”, – Attribute mean (median, mode) – Predict the value based on features (data imputation) • Estimate gender based on first name (name gender) • Estimate age based on first name (name popularity) • Build a predictive model based on other features – Missing value estimation depends on the missing reason! 13
  • 15. Demonstration • ReplaceMissingValues – WekaVote – Replacing missing values for nominal and numeric attributes • More functions in Rapidminer 14
  • 16. Noisy (Outlier) Data • Noise: refers to modification of original values • Incorrect attribute values may be due to – faulty data collection instruments – data entry problems – data transmission problems – technology limitation – inconsistency in naming convention 15
  • 17. How to Handle Noisy (Outlier) Data? • Binning – first sort data and partition into (equal-frequency) bins – then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. • Regression – smooth by fitting the data into regression functions • Clustering – detect and remove outliers • Combined computer and human inspection – detect suspicious values and check by human 16
  • 18. Binning Sort data in ascending order: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 • Partition into equal-frequency (equal-depth) bins: – Bin 1: 4, 8, 9, 15 – Bin 2: 21, 21, 24, 25 – Bin 3: 26, 28, 29, 34 • Smoothing by bin means: – Bin 1: 9, 9, 9, 9 – Bin 2: 23, 23, 23, 23 – Bin 3: 29, 29, 29, 29 • Smoothing by bin boundaries: – Bin 1: 4, 4, 4, 15 – Bin 2: 21, 21, 25, 25 – Bin 3: 26, 26, 26, 34 17
  • 19. Regression 18 x y y = x + 1 X1 Y1 Y1’
  • 21. Outline • Data • Data Preprocessing: An Overview • Data Cleaning • Data Transformation and Data Discretization • Data Reduction • Summary 20
  • 22. Data Transformation • Aggregation: – Attribute / example summarization • Feature type conversion: – Nominal  Numeric, … • Normalization: – Scaled to fall within a small, specified range • Attribute/feature construction: – New attributes constructed from the given ones 21
  • 23. Aggregation • Combining two or more attributes (examples) into a single attribute (example) • Combining two or more attribute values into a single attribute value • Purpose – Change of scale • Cities aggregated into regions, states, countries, etc – More “stable” data • Aggregated data tends to have less variability – More “predictive” data • Aggregated data might have high Predictability 22
  • 24. Demonstration • MergeTwoValues – Wekacontact-lenses – Merge class values “soft” and “hard” • Effective aggregation in real-world application 23
  • 25. Feature Type Conversion • Some algorithms can only handle numeric features; some can only handle nominal features. Only few can handle both. • Features have to be converted to satisfy the requirement of learning algorithms. – Numeric  Nominal (Discretization) • E.g., Age Discretization: Young 18-29; Career 30-40; Mid-Life 41-55; Empty-Nester 56-69; Senior 70+ – Nominal  Numeric • Introduce multiple numeric features for one nominal feature • Nominal  Binary (Numeric) • E.g., size={L, M, S}  size_L: 0, 1; size_M: 0, 1; size_S: 0, 1 24
  • 26. Demonstration • Discretize – Wekadiabetes – Discretize “age” (equal bins vs equal frequency) • NumericToNominal – Wekadiabetes – Discretize “age” (vs “Discretize” method) • NominalToBinary – UCIautos – Convert “num-of-doors” – Convert “drive-wheels” 25
  • 27. Normalization 716.00)00.1( 000,12000,98 000,12600,73    26 Scale the attribute values to a small specified range • Min-max normalization: to [new_minA, new_maxA] – E.g., Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to • Z-score normalization (μ: mean, σ: standard deviation): • …… AAA AA A minnewminnewmaxnew minmax minv v _)__('    
  • 28. Demonstration • Normalize – Wekadiabetes – Normalize “age” • Standardize – Wekadiabetes – Standardize “age” (vs “Normalize” method) 27
  • 29. Outline • Data • Data Preprocessing: An Overview • Data Cleaning • Data Transformation and Data Discretization • Data Reduction • Summary 28
  • 30. Sampling • Big data era: too expensive (or even infeasible) to process the entire data set • Sampling: obtaining a small sample to represent the entire data set ( ---- undersampling) • Oversampling is also required in some scenarios, such as class imbalance problem – E.g., 100 HIV test results: 5 positive, 995 negative 29
  • 31. Sampling Principle Key principle for effective sampling: • Using a sample will work almost as well as using the entire data sets, if the sample is representative • A sample is representative if it has approximately the same property (of interest) as the original set of data 30
  • 32. Types of Sampling (1) • Random sampling without replacement – As each example is selected, it is removed from the population • Random sampling with replacement – Examples are not removed from the population after being selected • The same example can be picked up more than once 31 Raw Data
  • 33. Types of Sampling (2) • Stratified sampling – Split the data into several partitions; then draw random samples from each partition 32 Raw Data Stratified Sampling
  • 34. Demonstration • Resample – UCIwaveform-5000 – Undersampling (with or without replacement) 33
  • 35. Dimensionality Reduction • Purpose: – Reduce amount of time and memory required by data mining algorithms – Allow data to be more easily visualized – May help to eliminate irrelevant features or reduce noise • Techniques – Feature Selection – Feature Extraction 34
  • 36. Feature Selection • Redundant features – Duplicated information contained in different features – E.g., “Age”, “Year of Birth”; “Purchase price”, “Sales tax” • Irrelevant features – Containing no information that is useful for the task – E.g., students' ID is irrelevant to predicting GPA • Goal: – A minimum set of features containing all (most) information 35
  • 37. Heuristic Search in Feature Selection • Given d features, there are 2d possible feature combinations – Exhaust search won’t work – Heuristics has to be applied • Typical heuristic feature selection methods: – Feature ranking – Forward feature selection – Backward feature elimination – Bidirectional search (selection + elimination) – Search based on evolution algorithm – …… 36
  • 38. Feature Ranking • Steps: 1) Rank all the individual features according to certain criteria (e.g., information gain, gain ratio, χ2) 2) Select / keep top N features • Properties: – Usually independent of the learning algorithm to be used – Efficient (no search process) – Hard to determine the threshold – Unable to consider correlation between features 37
  • 39. Forward Feature Selection • Steps: 1) First select the best single-feature (according to the learning algorithm) 2) Repeat (until some stop criterion is met): Select the next best feature, given the already picked features • Properties: – Usually learning algorithm dependent – Feature correlation is considered – More reliable – Inefficient 38
  • 40. Backward Feature Elimination • Steps: 1) First build a model based on all the features 2) Repeat (until some criterion is met): Eliminate the feature that makes the least contribution. • Properties: – Usually learning algorithm dependent – Feature correlation is considered – More reliable – Inefficient 39
  • 41. Filter vs Wrapper Model • Filter model – Separating feature selection from learning – Relying on general characteristics of data (information, etc.) – No bias toward any learning algorithm, fast – Feature ranking usually falls into here • Wrapper model – Relying on a predetermined learning algorithm – Using predictive accuracy as goodness measure – High accuracy, computationally expensive – FFS, BFE usually fall into here 40
  • 42. Demonstration • Feature ranking – Wekaweather – ChiSquared, InfoGain, GainRatio • FFS & BFE – WekaDiabetes – ClassifierSubsetEval + GreedyStepwise 41
  • 43. Feature Extraction • Map original high-dimensional data onto a lower- dimensional space – Generate a (smaller) set of new features – Preserve all (most) information from the original data • Techniques – Principal Component Analysis (PCA) – Canonical Correlation Analysis (CCA) – Linear Discriminant Analysis (LDA) – Independent Component Analysis (ICA) – Manifold Learning – …… 42
  • 44. Principal Component Analysis (PCA) • Find a projection that captures the largest amount of variation in data • The original data are projected onto a much smaller space, resulting in dimensionality reduction. 43 x2 x1 e
  • 45. Principal Component Analysis (Steps) • Given data from n-dimensions (n features), find k ≤ n new features (principal components) that can best represent data – Normalize input data: each feature falls within the same range – Compute k principal components (details omitted) – Each input data is projected in the new k-dimensional space – The new features (principal components ) are sorted in order of decreasing “significance” or strength – Eliminate weak components / features to reduce dimensionality. • Works for numeric data only 44
  • 46. PCA Demonstration • UCIbreast-w – Accuracy with all features – PrincipalComponents (data transformation) – Visualize/save transformed data (first two features, last two features) – Accuracy with all transformed features – Accuracy with top 1 or 2 feature(s) 45
  • 47. Outline • Data • Data Preprocessing: An Overview • Data Cleaning • Data Transformation and Data Discretization • Data Reduction • Summary 46
  • 48. Summary • Data (features and instances) • Data Cleaning: missing values, noise / outliers • Data Transformation: aggregation, type conversion, normalization • Data Reduction – Sampling: random sampling with replacement, random sampling without replacement, stratified sampling – Dimensionality reduction: • Feature Selection: Feature ranking, FFS, BFE • Feature Extraction: PCA 47
  • 49. Notes • In real world applications, data preprocessing usually occupies about 70% workload in a data mining task. • Domain knowledge is usually required to do good data preprocessing. • To improve a predictive performance of a model – Improve learning algorithms (different algorithms, different parameters) • Most data mining research focuses on here – Improve data quality ---- data preprocessing • Deserve more attention! 48