SlideShare a Scribd company logo
1 of 57
Sejong University
Week 1: Introduction Applied Machine Learning
Applied Machine Learning
Spring 2023
Prof. Rizwan Ali Naqvi
School of Intelligent Mechatronics Engineering,
Sejong University, Republic of Korea.
Lecture 2:
Data Preprocessing
Sejong University
Week 1: Introduction Applied Machine Learning
Agenda
Data Preprocessing
Data Cleaning
Data Integration and Transformation
Data Reduction
Discretization & Concept Hierarchy Generation
Summary
Sejong University
Week 1: Introduction Applied Machine Learning
Data Preprocessing
• Data preprocessing is a process
that takes raw data and
transforms it into a format that can
be understood and analyzed by
computers and machine learning.
• Raw, real-world data in the form of
text, images, video, etc., is messy.
– It may contain errors and
inconsistencies, but it is often
incomplete,
– and doesn’t have a regular,
uniform design.
Sejong University
Week 1: Introduction Applied Machine Learning
Why Data Preprocessing?
• incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
• noisy: containing errors or outliers
• inconsistent: containing discrepancies in codes or names
Data in the real world is dirty
• Quality decisions must be based on quality data
• Data warehouse needs consistent integration of quality data
No quality data, no quality mining results!
• A well-accepted multi-dimensional view:
• accuracy, completeness, consistency, timeliness, believability, value added,
interpretability, accessibility
• Broad categories:
• intrinsic, contextual, representational, and accessibility.
A multi-dimensional measure of data quality:
Sejong University
Week 1: Introduction Applied Machine Learning
Data Preprocessing Importance
• When using data sets to train machine learning
models, you’ll often hear the phrase
– “garbage in, garbage out”
– This means that if you use bad or “dirty” data
to train your model,
– you’ll end up with a bad, improperly trained
model that won’t actually be relevant to your
analysis.
• Good, preprocessed data is even more
important than the most powerful algorithms, to
the point that machine learning models trained
with bad data could actually be harmful to the
analysis you’re trying to do – giving you
“garbage” results.
Sejong University
Week 1: Introduction Applied Machine Learning
Major Tasks in Data Preprocessing
• Data Cleaning
– Fill in missing values, smooth noisy data, identify
or remove outliers, and resolve inconsistencies
• Data Integration
– Integration of multiple databases, data cubes,
files, or notes
• Data Transformation
– Normalization (scaling to a specific range)
– Aggregation
• Data Reduction
– Obtains reduced representation in volume but
produces the same or similar analytical results.
– Data discretization: with particular importance,
especially for numerical data.
– Data aggregation, dimensionality reduction, data
compression, generalization.
Sejong University
Week 1: Introduction Applied Machine Learning
Forms of data preprocessing
Sejong University
Week 1: Introduction Applied Machine Learning
Agenda
Data Preprocessing
Data Cleaning
Data Integration and Transformation
Data Reduction
Discretization & Concept Hierarchy Generation
Summary
Sejong University
Week 1: Introduction Applied Machine Learning
Data Cleaning
• Data cleaning tasks
– Handling Missing Values
– Managing Unwanted Outliers
– Fixing Structural Errors
– Removal of Unwanted Observation
Sejong University
Week 1: Introduction Applied Machine Learning
Missing Data
• Data is not always available
– E.g., many tuples have no recorded value for several attributes, such as customer income
in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of entry
– not register history or changes of the data
• Missing data may need to be inferred
Sejong University
Week 1: Introduction Applied Machine Learning
How to Handle Missing Data?
• Delete the Data
– The easiest method is to just simply delete the whole training examples where one or
several columns have null entries.
• Imputing Averages
– The next method is to assign some average value (mean, median, or mode) to the null
entries.
• Assign New Category
– If missing is large in number, then a better way is to assign these NaN values their own
category
Sejong University
Week 1: Introduction Applied Machine Learning
Noisy Data
• Random error in a measured variable.
What is noise?
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
Incorrect attribute values may be due to
• duplicate records
• incomplete data
• inconsistent data
Other data problems which requires data cleaning
Sejong University
Week 1: Introduction Applied Machine Learning
How to Handle Noisy Data?
• first sort data and partition into (equi-depth) bins
• then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• used also for discretization (discussed later)
Binning method:
• detect and remove outliers
Clustering
• detect suspicious values and check manually
Semi-automated method:
combined computer and
human inspection
• smooth by fitting the data into regression functions
Regression
Sejong University
Week 1: Introduction Applied Machine Learning
Simple Discretization Methods: Binning
• Equal-width (distance) partitioning:
– It divides the range into N intervals of equal
size: uniform grid
– if A and B are the lowest and highest values
of the attribute, the width of intervals will be
W = (B-A)/N.
– The most straightforward
– But outliers may dominate the presentation
– Skewed data is not handled well.
• Equal-depth (frequency) partitioning:
– It divides the range into N intervals, each
containing the approximately same number
of samples
– Good data scaling
– Managing categorical attributes can be
tricky.
Sejong University
Week 1: Introduction Applied Machine Learning
Binning Methods for Data Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Sejong University
Week 1: Introduction Applied Machine Learning
Cluster Analysis
Sejong University
Week 1: Introduction Applied Machine Learning
Regression
x
y
y = x + 1
X1
Y1
Y1’
• Simple Linear regression (best line to fit
two variables)
• Multiple linear regression (more than
two variables, fit a multidimensional
surface
Sejong University
Week 1: Introduction Applied Machine Learning
How to Handle Inconsistent Data?
• Manual correction using external references
• Semi-automatic using various tools
– To detect violation of known functional dependencies and data constraints
– To correct redundant data
Sejong University
Week 1: Introduction Applied Machine Learning
Agenda
Data Preprocessing
Data Cleaning
Data Integration and Transformation
Data Reduction
Discretization & Concept Hierarchy Generation
Summary
Sejong University
Week 1: Introduction Applied Machine Learning
Data Integration
• Data integration:
– combines data from multiple sources into a
coherent store
• Schema integration
– integrate metadata from different sources
– Entity identification problem: identify real
world entities from multiple data sources,
e.g., A.cust-id B.cust-#
• Detecting and resolving data value conflicts
– for the same real-world entity, attribute
values from different sources are different
– possible reasons: different representations,
different scales, e.g., metric vs. British units,
different currency
Sejong University
Week 1: Introduction Applied Machine Learning
21 21
Handling Redundancy in Data Integration
• Redundant data occur often when integration of multiple databases
– Object identification: The same attribute or object may have different names in different
databases
– Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual
revenue
• Redundant attributes may be able to be detected by correlation analysis and covariance
analysis
• Careful integration of the data from multiple sources may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
Sejong University
Week 1: Introduction Applied Machine Learning
Correlation Analysis (Nominal Data)
• Χ2 (chi-square) test
• The larger the Χ2 value, the more likely the variables are related
• The cells that contribute the most to the Χ2 value are those whose actual count is very
different from the expected count
• Correlation does not imply causality
– # of hospitals and # of car-theft in a city are correlated
– Both are causally linked to the third variable: population



Expected
Expected
Observed 2
2 )
(

Sejong University
Week 1: Introduction Applied Machine Learning
Chi-Square Calculation: An Example
• Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on
the data distribution in the two categories)
• It shows that like_science_fiction and play_chess are correlated in the group
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
93
.
507
840
)
840
1000
(
360
)
360
200
(
210
)
210
50
(
90
)
90
250
( 2
2
2
2
2










Sejong University
Week 1: Introduction Applied Machine Learning
Correlation Analysis (Numeric Data)
• Correlation coefficient (also called Pearson’s product moment coefficient)
 where n is the number of tuples, and are the respective means of A and B, σA and
σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross-
product.
• If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the
stronger correlation.
• rA,B = 0: independent; rAB < 0: negatively correlated
B
A
n
i i
i
B
A
n
i i
i
B
A
n
B
A
n
b
a
n
B
b
A
a
r



 )
1
(
)
(
)
1
(
)
)(
( 1
1
,








 

A B
Sejong University
Week 1: Introduction Applied Machine Learning
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.
Sejong University
Week 1: Introduction Applied Machine Learning
26
Correlation (viewed as linear relationship)
• Correlation measures the linear relationship between objects
• To compute correlation, we standardize data objects, A and B, and then take their dot
product
)
(
/
))
(
(
' A
std
A
mean
a
a k
k 

)
(
/
))
(
(
' B
std
B
mean
b
b k
k 

'
'
)
,
( B
A
B
A
n
correlatio 

Sejong University
Week 1: Introduction Applied Machine Learning
27
Covariance (Numeric Data)
• Covariance is similar to correlation
where n is the number of tuples, and are the respective mean or expected values of A and B, σA and σB
are the respective standard deviation of A and B.
• Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their expected values.
• Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to be smaller than its
expected value.
• Independence: CovA,B = 0
A B
Correlation coefficient:
Sejong University
Week 1: Introduction Applied Machine Learning
Co-Variance: An Example
• It can be simplified in computation as
• Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5, 10), (4, 11), (6,
14).
• Question: If the stocks are affected by the same industry trends, will their prices rise or fall together?
– E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
– E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
– Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
• Thus, A and B rise together since Cov(A, B) > 0.
Sejong University
Week 1: Introduction Applied Machine Learning
Data Transformation
• A function that maps the entire set of values of a given attribute to a new set of replacement
values such that each old value can be identified with one of the new values
• Smoothing: remove noise from data (binning, clustering, regression)
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small, specified range
– min-max normalization
– z-score normalization
– normalization by decimal scaling
• Attribute/feature construction
– New attributes constructed from the given ones
Sejong University
Week 1: Introduction Applied Machine Learning
Data Transformation: Normalization
• Particularly useful for classification (NNs, distance measurements, classification, etc.)
– min-max normalization
– Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is
mapped to
– z-score normalization
– Ex. Let μ = 54,000, σ = 16,000. Then
– normalization by decimal scaling
A
A
A
A
A
A
min
new
min
new
max
new
min
max
min
v
v _
)
_
_
(
' 




A
A
dev
stand
mean
v
v
_
'


j
v
v
10
' Where j is the smallest integer such that Max(| |)<1
'
v
716
.
0
0
)
0
0
.
1
(
000
,
12
000
,
98
000
,
12
600
,
73





225
.
1
000
,
16
000
,
54
600
,
73


Sejong University
Week 1: Introduction Applied Machine Learning
Agenda
Data Preprocessing
Data Cleaning
Data Integration and Transformation
Data Reduction
Discretization & Concept Hierarchy Generation
Summary
Sejong University
Week 1: Introduction Applied Machine Learning
32
Data Reduction Strategies
• Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet
produces the same (or almost the same) analytical results
• Why data reduction? — A database/data warehouse may store terabytes of data. Complex data
analysis may take a very long time to run on the complete data set.
• Data reduction strategies
– Dimensionality reduction, e.g., removing unimportant attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
– Numerosity reduction (some simply call it: Data Reduction)
• Regression and Log-Linear Models
• Histograms, clustering, sampling
• Data cube aggregation
– Data compression
Sejong University
Week 1: Introduction Applied Machine Learning
33
Data Reduction 1: Dimensionality Reduction
• Curse of dimensionality
– When dimensionality increases, data becomes increasingly sparse
– Density and distance between points, which is critical to clustering, and outlier analysis, becomes less
meaningful
– The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
– Avoid the curse of dimensionality
– Help eliminate irrelevant features and reduce noise
– Reduce time and space required in data mining
– Allow easier visualization
• Dimensionality reduction techniques
– Wavelet transforms
– Principal Component Analysis
Sejong University
Week 1: Introduction Applied Machine Learning
34
What Is Wavelet Transform?
• Decomposes a signal into different
frequency subbands
– Applicable to n-dimensional signals
• Data are transformed to preserve the
relative distance between objects at
different levels of resolution
• Allow natural clusters to become more
distinguishable
• Used for image compression
Sejong University
Week 1: Introduction Applied Machine Learning
Wavelet Transforms
• Discrete wavelet transform (DWT):
– linear signal processing
• Compressed approximation: store only a small fraction of the strongest of the wavelet
coefficients
• Similar to discrete Fourier transform (DFT), but better lossy compression, localized in space
(conserves local details)
• Method (hierarchical pyramid algorithm):
– Length, L, must be an integer power of 2 (padding with 0s, when necessary)
– Each transform has 2 functions:
• smoothing (e.g., sum, weighted avg.), weighted difference
– Applies to pairs of data, resulting in two sets of data of length L/2
– Applies the two functions recursively, until reaches the desired length
Haar2 Daubechie4
Sejong University
Week 1: Introduction Applied Machine Learning
36
Why Wavelet Transform?
• Use hat-shape filters
– Emphasize the region where points cluster
– Suppress weaker information in their boundaries
• Effective removal of outliers
– Insensitive to noise, insensitive to input order
• Multi-resolution
– Detect arbitrarily shaped clusters at different scales
• Efficient
– Complexity O(N)
• Only applicable to low-dimensional data
Sejong University
Week 1: Introduction Applied Machine Learning
37
x2
x1
e
Principal Component Analysis (PCA)
• Find a projection that captures the largest amount of variation in data
• The original data are projected onto a much smaller space, resulting in dimensionality reduction. We
find the eigenvectors of the covariance matrix, and these eigenvectors define the new space
Sejong University
Week 1: Introduction Applied Machine Learning
38
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components) that can
be best used to represent data
– Normalize input data: Each attribute falls within the same range
– Compute k orthonormal (unit) vectors, i.e., principal components
– Each input data (vector) is a linear combination of the k principal component vectors
– The principal components are sorted in order of decreasing “significance” or strength
– Since the components are sorted, the size of the data can be reduced by eliminating the weak
components, i.e., those with low variance (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data)
• Works for numeric data only
Principal Component Analysis (Steps)
Sejong University
Week 1: Introduction Applied Machine Learning
39
Data Reduction 2: Numerosity Reduction
• Reduce data volume by choosing an alternative, smaller form of data representation
• Parametric methods (e.g., regression)
– Assume the data fits some model, estimate model parameters, store only the
parameters, and discard the data
• Non-parametric methods
– Do not assume models
– Major families: histograms, clustering, sampling, …
– but it may not achieve a high volume of data reduction like the parametric.
Sejong University
Week 1: Introduction Applied Machine Learning
40
Parametric Data Reduction: Regression and Log-Linear Models
• Linear regression
– Data modeled to fit a straight line
– Often uses the least-square method to fit the line
• Multiple regression
– Allows a response variable Y to be modeled as a linear function of the
multidimensional feature vector
• Log-linear model
– Approximates multidimensional probability distributions
Sejong University
Week 1: Introduction Applied Machine Learning
41
Regression Analysis
• Regression analysis: A collective name for techniques for
the modeling and analysis of numerical data consisting of
values of a dependent variable (also called response
variable or measurement) and of one or more independent
variables (aka. explanatory variables or predictors)
• The parameters are estimated to give a "best fit" of the
data
• Most commonly the best fit is evaluated by using the least
squares method, but other criteria have also been used
• Used for prediction (including forecasting of time-series
data), inference, hypothesis testing, etc.
y
x
y = x + 1
X1
Y1
Y1’
Sejong University
Week 1: Introduction Applied Machine Learning
42
Histogram Analysis
• Divide data into buckets and store
the average (sum) for each bucket
• Partitioning rules:
– Equal-width: equal bucket range
– Equal-frequency (or equal-depth)
0
5
10
15
20
25
30
35
40
10000 30000 50000 70000 90000
Sejong University
Week 1: Introduction Applied Machine Learning
43
Clustering
• Clustering techniques groups similar objects from the data so that the objects in a
cluster are similar to each other, but they are dissimilar to objects in another cluster
• How much similarity the objects inside a cluster can be calculated using a distance
function.
• More is the similarity between the objects in a cluster closer they appear in the cluster.
The quality of the cluster depends on the diameter of the cluster.
• The cluster representation replaces the original data. This technique is more effective
if the present data can be classified into distinct clusters.
Sejong University
Week 1: Introduction Applied Machine Learning
44
Sampling
• Sampling: obtaining a small sample s to represent the whole data set N
• Key principle: Choose a representative subset of the data
– Simple random sampling may have very poor performance in the presence of skew
– Develop adaptive sampling methods, e.g., stratified sampling
Sejong University
Week 1: Introduction Applied Machine Learning
45
Types of Sampling
• Simple random sampling
– There is an equal probability of selecting any particular item
• Sampling without replacement
– Once an object is selected, it is removed from the population
• Sampling with replacement
– A selected object is not removed from the population
• Stratified sampling:
– Partition the data set, and draw samples from each partition (proportionally, i.e.,
approximately the same percentage of the data)
– This method is effective for skewed data.
Sejong University
Week 1: Introduction Applied Machine Learning
46
Sampling: With or without Replacement
Raw Data
Sejong University
Week 1: Introduction Applied Machine Learning
47
Sampling: Stratified Sampling
Raw Data Stratified Sample
Sejong University
Week 1: Introduction Applied Machine Learning
Data Cube Aggregation
• Data cube aggregation involves moving the
data from a detailed level to fewer
dimensions.
• The resulting data set is smaller in volume,
without loss of information necessary for the
analysis task.
• For example, suppose you have the data of
All Electronics sales per quarter for the year
2018 to the year 2022. If you want to get
the annual sale per year, you just have to
aggregate the sales per quarter for each
year. In this way, aggregation provides you
with the required data, which is much
smaller in size, and thereby we achieve
data reduction even without losing any data.
Sejong University
Week 1: Introduction Applied Machine Learning
Data Reduction 3: Data Compression
• Data compression employs modification,
encoding, or converting the structure of
data in a way that consumes less space.
• Data compression involves building a
compact representation of information by
removing redundancy and representing
data in binary form.
• Data that can be restored successfully from
its compressed form is called Lossless
compression.
• In contrast, the opposite where it is not
possible to restore the original form from
the compressed form is Lossy compression
Sejong University
Week 1: Introduction Applied Machine Learning
Agenda
Data Preprocessing
Data Cleaning
Data Integration and Transformation
Data Reduction
Discretization & Concept Hierarchy Generation
Summary
Sejong University
Week 1: Introduction Applied Machine Learning
Data Discretization
• Data discretization converts a large number of data values into smaller once, so that data
evaluation and data management become very easy.
• For example, we have an attribute of age with the following values.
Age 10,11,13,14,17,19,30, 31, 32, 38, 40, 42,70 , 72, 73, 75
Attribute Age Age Age
10,11,13,14,17,19 30, 31, 32, 38, 40,
42
70 , 72, 73, 75
After Discretization Young Mature Old
Sejong University
Week 1: Introduction Applied Machine Learning
Cont.
• Another example is the Website visitor’s data. As seen in the figure below, data is discretized into the
countries.
Sejong University
Week 1: Introduction Applied Machine Learning
Concept Hierarchy Generation
• A concept hierarchy represents a sequence of mappings with a set of more
general concepts to specialized concepts. Similarly mapping from low-level
concepts to higher-level concepts. In other words, we can say top-down
mapping and bottom-up mapping.
• Let’s see an example of a concept hierarchy for the dimension location.
• Each city can be mapped with the country to which the given city belongs. For
example, Seoul can be mapped to South Korea and South Korea can be
mapped to Asia.
• Top-down mapping
– Top-down mapping starts from the top with general concepts and moves
to the bottom to the specialized concepts.
• Bottom-up mapping
– Bottom-up mapping starts from the Bottom with specialized concepts and
moves to the top to the generalized concepts.
country
province_or_ state
city
street
Sejong University
Week 1: Introduction Applied Machine Learning
Agenda
Data Preprocessing
Data Cleaning
Data Integration and Transformation
Data Reduction
Discretization & Concept Hierarchy Generation
Summary
Sejong University
Week 1: Introduction Applied Machine Learning
Summary
• Data preparation is a big issue for both warehousing and mining
• Data preparation includes
– Data cleaning and data integration
– Data reduction and feature selection
– Discretization
• A lot a methods have been developed but still an active area of research
Sejong University
Week 1: Introduction Applied Machine Learning
Sejong University
Week 1: Introduction Applied Machine Learning

More Related Content

Similar to Lecture-2 Applied ML .pptx

Review of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & PredictionReview of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & PredictionIRJET Journal
 
Data Management Lab: Data mapping exercise instructions
Data Management Lab: Data mapping exercise instructionsData Management Lab: Data mapping exercise instructions
Data Management Lab: Data mapping exercise instructionsIUPUI
 
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
IRJET-	 Fault Detection and Prediction of Failure using Vibration AnalysisIRJET-	 Fault Detection and Prediction of Failure using Vibration Analysis
IRJET- Fault Detection and Prediction of Failure using Vibration AnalysisIRJET Journal
 
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET Journal
 
Educational Data Mining to Analyze Students Performance – Concept Plan
Educational Data Mining to Analyze Students Performance – Concept PlanEducational Data Mining to Analyze Students Performance – Concept Plan
Educational Data Mining to Analyze Students Performance – Concept PlanIRJET Journal
 
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...Egyptian Engineers Association
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfSaketBansal9
 
Activity Monitoring Using Wearable Sensors and Smart Phone
Activity Monitoring Using Wearable Sensors and Smart PhoneActivity Monitoring Using Wearable Sensors and Smart Phone
Activity Monitoring Using Wearable Sensors and Smart PhoneDrAhmedZoha
 
التنقيب في البيانات - Data Mining
التنقيب في البيانات -  Data Miningالتنقيب في البيانات -  Data Mining
التنقيب في البيانات - Data Miningnabil_alsharafi
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huwekineheshete
 
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...theijes
 
Hypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsHypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsIJERA Editor
 
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data AnalysisWorkshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data AnalysisOlga Scrivner
 
DATA preprocessing.pptx
DATA preprocessing.pptxDATA preprocessing.pptx
DATA preprocessing.pptxChandra Meena
 
H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellSri Ambati
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupSri Ambati
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxsumitkumar600840
 

Similar to Lecture-2 Applied ML .pptx (20)

Review of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & PredictionReview of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & Prediction
 
Data Management Lab: Data mapping exercise instructions
Data Management Lab: Data mapping exercise instructionsData Management Lab: Data mapping exercise instructions
Data Management Lab: Data mapping exercise instructions
 
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
IRJET-	 Fault Detection and Prediction of Failure using Vibration AnalysisIRJET-	 Fault Detection and Prediction of Failure using Vibration Analysis
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
 
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data Mining
 
Educational Data Mining to Analyze Students Performance – Concept Plan
Educational Data Mining to Analyze Students Performance – Concept PlanEducational Data Mining to Analyze Students Performance – Concept Plan
Educational Data Mining to Analyze Students Performance – Concept Plan
 
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdf
 
Activity Monitoring Using Wearable Sensors and Smart Phone
Activity Monitoring Using Wearable Sensors and Smart PhoneActivity Monitoring Using Wearable Sensors and Smart Phone
Activity Monitoring Using Wearable Sensors and Smart Phone
 
التنقيب في البيانات - Data Mining
التنقيب في البيانات -  Data Miningالتنقيب في البيانات -  Data Mining
التنقيب في البيانات - Data Mining
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
 
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
 
06522405
0652240506522405
06522405
 
Hypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsHypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining Algorithms
 
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data AnalysisWorkshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
 
DATA preprocessing.pptx
DATA preprocessing.pptxDATA preprocessing.pptx
DATA preprocessing.pptx
 
H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin Ledell
 
Data Science and Analysis.pptx
Data Science and Analysis.pptxData Science and Analysis.pptx
Data Science and Analysis.pptx
 
02 Related Concepts
02 Related Concepts02 Related Concepts
02 Related Concepts
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User Group
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptx
 

More from ZainULABIDIN496386

More from ZainULABIDIN496386 (6)

Lecture 04 Adaptive filter.pptx
Lecture 04 Adaptive filter.pptxLecture 04 Adaptive filter.pptx
Lecture 04 Adaptive filter.pptx
 
04 Deep CNN (Ch_01 to Ch_3).pptx
04 Deep CNN (Ch_01 to Ch_3).pptx04 Deep CNN (Ch_01 to Ch_3).pptx
04 Deep CNN (Ch_01 to Ch_3).pptx
 
Lecture-1 Applied ML.pptx
Lecture-1 Applied ML.pptxLecture-1 Applied ML.pptx
Lecture-1 Applied ML.pptx
 
Lecture-7 Applied ML.pptx
Lecture-7 Applied ML.pptxLecture-7 Applied ML.pptx
Lecture-7 Applied ML.pptx
 
Lecture-1 Applied ML.pptx
Lecture-1 Applied ML.pptxLecture-1 Applied ML.pptx
Lecture-1 Applied ML.pptx
 
CS583-supervised-learning.ppt
CS583-supervised-learning.pptCS583-supervised-learning.ppt
CS583-supervised-learning.ppt
 

Recently uploaded

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...
High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...
High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...Call Girls in Nagpur High Profile
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 

Recently uploaded (20)

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...
High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...
High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 

Lecture-2 Applied ML .pptx

  • 1. Sejong University Week 1: Introduction Applied Machine Learning Applied Machine Learning Spring 2023 Prof. Rizwan Ali Naqvi School of Intelligent Mechatronics Engineering, Sejong University, Republic of Korea. Lecture 2: Data Preprocessing
  • 2. Sejong University Week 1: Introduction Applied Machine Learning Agenda Data Preprocessing Data Cleaning Data Integration and Transformation Data Reduction Discretization & Concept Hierarchy Generation Summary
  • 3. Sejong University Week 1: Introduction Applied Machine Learning Data Preprocessing • Data preprocessing is a process that takes raw data and transforms it into a format that can be understood and analyzed by computers and machine learning. • Raw, real-world data in the form of text, images, video, etc., is messy. – It may contain errors and inconsistencies, but it is often incomplete, – and doesn’t have a regular, uniform design.
  • 4. Sejong University Week 1: Introduction Applied Machine Learning Why Data Preprocessing? • incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data • noisy: containing errors or outliers • inconsistent: containing discrepancies in codes or names Data in the real world is dirty • Quality decisions must be based on quality data • Data warehouse needs consistent integration of quality data No quality data, no quality mining results! • A well-accepted multi-dimensional view: • accuracy, completeness, consistency, timeliness, believability, value added, interpretability, accessibility • Broad categories: • intrinsic, contextual, representational, and accessibility. A multi-dimensional measure of data quality:
  • 5. Sejong University Week 1: Introduction Applied Machine Learning Data Preprocessing Importance • When using data sets to train machine learning models, you’ll often hear the phrase – “garbage in, garbage out” – This means that if you use bad or “dirty” data to train your model, – you’ll end up with a bad, improperly trained model that won’t actually be relevant to your analysis. • Good, preprocessed data is even more important than the most powerful algorithms, to the point that machine learning models trained with bad data could actually be harmful to the analysis you’re trying to do – giving you “garbage” results.
  • 6. Sejong University Week 1: Introduction Applied Machine Learning Major Tasks in Data Preprocessing • Data Cleaning – Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data Integration – Integration of multiple databases, data cubes, files, or notes • Data Transformation – Normalization (scaling to a specific range) – Aggregation • Data Reduction – Obtains reduced representation in volume but produces the same or similar analytical results. – Data discretization: with particular importance, especially for numerical data. – Data aggregation, dimensionality reduction, data compression, generalization.
  • 7. Sejong University Week 1: Introduction Applied Machine Learning Forms of data preprocessing
  • 8. Sejong University Week 1: Introduction Applied Machine Learning Agenda Data Preprocessing Data Cleaning Data Integration and Transformation Data Reduction Discretization & Concept Hierarchy Generation Summary
  • 9. Sejong University Week 1: Introduction Applied Machine Learning Data Cleaning • Data cleaning tasks – Handling Missing Values – Managing Unwanted Outliers – Fixing Structural Errors – Removal of Unwanted Observation
  • 10. Sejong University Week 1: Introduction Applied Machine Learning Missing Data • Data is not always available – E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to – equipment malfunction – inconsistent with other recorded data and thus deleted – data not entered due to misunderstanding – certain data may not be considered important at the time of entry – not register history or changes of the data • Missing data may need to be inferred
  • 11. Sejong University Week 1: Introduction Applied Machine Learning How to Handle Missing Data? • Delete the Data – The easiest method is to just simply delete the whole training examples where one or several columns have null entries. • Imputing Averages – The next method is to assign some average value (mean, median, or mode) to the null entries. • Assign New Category – If missing is large in number, then a better way is to assign these NaN values their own category
  • 12. Sejong University Week 1: Introduction Applied Machine Learning Noisy Data • Random error in a measured variable. What is noise? • faulty data collection instruments • data entry problems • data transmission problems • technology limitation • inconsistency in naming convention Incorrect attribute values may be due to • duplicate records • incomplete data • inconsistent data Other data problems which requires data cleaning
  • 13. Sejong University Week 1: Introduction Applied Machine Learning How to Handle Noisy Data? • first sort data and partition into (equi-depth) bins • then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. • used also for discretization (discussed later) Binning method: • detect and remove outliers Clustering • detect suspicious values and check manually Semi-automated method: combined computer and human inspection • smooth by fitting the data into regression functions Regression
  • 14. Sejong University Week 1: Introduction Applied Machine Learning Simple Discretization Methods: Binning • Equal-width (distance) partitioning: – It divides the range into N intervals of equal size: uniform grid – if A and B are the lowest and highest values of the attribute, the width of intervals will be W = (B-A)/N. – The most straightforward – But outliers may dominate the presentation – Skewed data is not handled well. • Equal-depth (frequency) partitioning: – It divides the range into N intervals, each containing the approximately same number of samples – Good data scaling – Managing categorical attributes can be tricky.
  • 15. Sejong University Week 1: Introduction Applied Machine Learning Binning Methods for Data Smoothing * Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
  • 16. Sejong University Week 1: Introduction Applied Machine Learning Cluster Analysis
  • 17. Sejong University Week 1: Introduction Applied Machine Learning Regression x y y = x + 1 X1 Y1 Y1’ • Simple Linear regression (best line to fit two variables) • Multiple linear regression (more than two variables, fit a multidimensional surface
  • 18. Sejong University Week 1: Introduction Applied Machine Learning How to Handle Inconsistent Data? • Manual correction using external references • Semi-automatic using various tools – To detect violation of known functional dependencies and data constraints – To correct redundant data
  • 19. Sejong University Week 1: Introduction Applied Machine Learning Agenda Data Preprocessing Data Cleaning Data Integration and Transformation Data Reduction Discretization & Concept Hierarchy Generation Summary
  • 20. Sejong University Week 1: Introduction Applied Machine Learning Data Integration • Data integration: – combines data from multiple sources into a coherent store • Schema integration – integrate metadata from different sources – Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-# • Detecting and resolving data value conflicts – for the same real-world entity, attribute values from different sources are different – possible reasons: different representations, different scales, e.g., metric vs. British units, different currency
  • 21. Sejong University Week 1: Introduction Applied Machine Learning 21 21 Handling Redundancy in Data Integration • Redundant data occur often when integration of multiple databases – Object identification: The same attribute or object may have different names in different databases – Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue • Redundant attributes may be able to be detected by correlation analysis and covariance analysis • Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
  • 22. Sejong University Week 1: Introduction Applied Machine Learning Correlation Analysis (Nominal Data) • Χ2 (chi-square) test • The larger the Χ2 value, the more likely the variables are related • The cells that contribute the most to the Χ2 value are those whose actual count is very different from the expected count • Correlation does not imply causality – # of hospitals and # of car-theft in a city are correlated – Both are causally linked to the third variable: population    Expected Expected Observed 2 2 ) ( 
  • 23. Sejong University Week 1: Introduction Applied Machine Learning Chi-Square Calculation: An Example • Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories) • It shows that like_science_fiction and play_chess are correlated in the group Play chess Not play chess Sum (row) Like science fiction 250(90) 200(360) 450 Not like science fiction 50(210) 1000(840) 1050 Sum(col.) 300 1200 1500 93 . 507 840 ) 840 1000 ( 360 ) 360 200 ( 210 ) 210 50 ( 90 ) 90 250 ( 2 2 2 2 2          
  • 24. Sejong University Week 1: Introduction Applied Machine Learning Correlation Analysis (Numeric Data) • Correlation coefficient (also called Pearson’s product moment coefficient)  where n is the number of tuples, and are the respective means of A and B, σA and σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross- product. • If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the stronger correlation. • rA,B = 0: independent; rAB < 0: negatively correlated B A n i i i B A n i i i B A n B A n b a n B b A a r     ) 1 ( ) ( ) 1 ( ) )( ( 1 1 ,            A B
  • 25. Sejong University Week 1: Introduction Applied Machine Learning Visually Evaluating Correlation Scatter plots showing the similarity from –1 to 1.
  • 26. Sejong University Week 1: Introduction Applied Machine Learning 26 Correlation (viewed as linear relationship) • Correlation measures the linear relationship between objects • To compute correlation, we standardize data objects, A and B, and then take their dot product ) ( / )) ( ( ' A std A mean a a k k   ) ( / )) ( ( ' B std B mean b b k k   ' ' ) , ( B A B A n correlatio  
  • 27. Sejong University Week 1: Introduction Applied Machine Learning 27 Covariance (Numeric Data) • Covariance is similar to correlation where n is the number of tuples, and are the respective mean or expected values of A and B, σA and σB are the respective standard deviation of A and B. • Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their expected values. • Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to be smaller than its expected value. • Independence: CovA,B = 0 A B Correlation coefficient:
  • 28. Sejong University Week 1: Introduction Applied Machine Learning Co-Variance: An Example • It can be simplified in computation as • Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14). • Question: If the stocks are affected by the same industry trends, will their prices rise or fall together? – E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4 – E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6 – Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4 • Thus, A and B rise together since Cov(A, B) > 0.
  • 29. Sejong University Week 1: Introduction Applied Machine Learning Data Transformation • A function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values • Smoothing: remove noise from data (binning, clustering, regression) • Aggregation: summarization, data cube construction • Generalization: concept hierarchy climbing • Normalization: scaled to fall within a small, specified range – min-max normalization – z-score normalization – normalization by decimal scaling • Attribute/feature construction – New attributes constructed from the given ones
  • 30. Sejong University Week 1: Introduction Applied Machine Learning Data Transformation: Normalization • Particularly useful for classification (NNs, distance measurements, classification, etc.) – min-max normalization – Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to – z-score normalization – Ex. Let μ = 54,000, σ = 16,000. Then – normalization by decimal scaling A A A A A A min new min new max new min max min v v _ ) _ _ ( '      A A dev stand mean v v _ '   j v v 10 ' Where j is the smallest integer such that Max(| |)<1 ' v 716 . 0 0 ) 0 0 . 1 ( 000 , 12 000 , 98 000 , 12 600 , 73      225 . 1 000 , 16 000 , 54 600 , 73  
  • 31. Sejong University Week 1: Introduction Applied Machine Learning Agenda Data Preprocessing Data Cleaning Data Integration and Transformation Data Reduction Discretization & Concept Hierarchy Generation Summary
  • 32. Sejong University Week 1: Introduction Applied Machine Learning 32 Data Reduction Strategies • Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results • Why data reduction? — A database/data warehouse may store terabytes of data. Complex data analysis may take a very long time to run on the complete data set. • Data reduction strategies – Dimensionality reduction, e.g., removing unimportant attributes • Wavelet transforms • Principal Components Analysis (PCA) • Feature subset selection, feature creation – Numerosity reduction (some simply call it: Data Reduction) • Regression and Log-Linear Models • Histograms, clustering, sampling • Data cube aggregation – Data compression
  • 33. Sejong University Week 1: Introduction Applied Machine Learning 33 Data Reduction 1: Dimensionality Reduction • Curse of dimensionality – When dimensionality increases, data becomes increasingly sparse – Density and distance between points, which is critical to clustering, and outlier analysis, becomes less meaningful – The possible combinations of subspaces will grow exponentially • Dimensionality reduction – Avoid the curse of dimensionality – Help eliminate irrelevant features and reduce noise – Reduce time and space required in data mining – Allow easier visualization • Dimensionality reduction techniques – Wavelet transforms – Principal Component Analysis
  • 34. Sejong University Week 1: Introduction Applied Machine Learning 34 What Is Wavelet Transform? • Decomposes a signal into different frequency subbands – Applicable to n-dimensional signals • Data are transformed to preserve the relative distance between objects at different levels of resolution • Allow natural clusters to become more distinguishable • Used for image compression
  • 35. Sejong University Week 1: Introduction Applied Machine Learning Wavelet Transforms • Discrete wavelet transform (DWT): – linear signal processing • Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients • Similar to discrete Fourier transform (DFT), but better lossy compression, localized in space (conserves local details) • Method (hierarchical pyramid algorithm): – Length, L, must be an integer power of 2 (padding with 0s, when necessary) – Each transform has 2 functions: • smoothing (e.g., sum, weighted avg.), weighted difference – Applies to pairs of data, resulting in two sets of data of length L/2 – Applies the two functions recursively, until reaches the desired length Haar2 Daubechie4
  • 36. Sejong University Week 1: Introduction Applied Machine Learning 36 Why Wavelet Transform? • Use hat-shape filters – Emphasize the region where points cluster – Suppress weaker information in their boundaries • Effective removal of outliers – Insensitive to noise, insensitive to input order • Multi-resolution – Detect arbitrarily shaped clusters at different scales • Efficient – Complexity O(N) • Only applicable to low-dimensional data
  • 37. Sejong University Week 1: Introduction Applied Machine Learning 37 x2 x1 e Principal Component Analysis (PCA) • Find a projection that captures the largest amount of variation in data • The original data are projected onto a much smaller space, resulting in dimensionality reduction. We find the eigenvectors of the covariance matrix, and these eigenvectors define the new space
  • 38. Sejong University Week 1: Introduction Applied Machine Learning 38 • Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components) that can be best used to represent data – Normalize input data: Each attribute falls within the same range – Compute k orthonormal (unit) vectors, i.e., principal components – Each input data (vector) is a linear combination of the k principal component vectors – The principal components are sorted in order of decreasing “significance” or strength – Since the components are sorted, the size of the data can be reduced by eliminating the weak components, i.e., those with low variance (i.e., using the strongest principal components, it is possible to reconstruct a good approximation of the original data) • Works for numeric data only Principal Component Analysis (Steps)
  • 39. Sejong University Week 1: Introduction Applied Machine Learning 39 Data Reduction 2: Numerosity Reduction • Reduce data volume by choosing an alternative, smaller form of data representation • Parametric methods (e.g., regression) – Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data • Non-parametric methods – Do not assume models – Major families: histograms, clustering, sampling, … – but it may not achieve a high volume of data reduction like the parametric.
  • 40. Sejong University Week 1: Introduction Applied Machine Learning 40 Parametric Data Reduction: Regression and Log-Linear Models • Linear regression – Data modeled to fit a straight line – Often uses the least-square method to fit the line • Multiple regression – Allows a response variable Y to be modeled as a linear function of the multidimensional feature vector • Log-linear model – Approximates multidimensional probability distributions
  • 41. Sejong University Week 1: Introduction Applied Machine Learning 41 Regression Analysis • Regression analysis: A collective name for techniques for the modeling and analysis of numerical data consisting of values of a dependent variable (also called response variable or measurement) and of one or more independent variables (aka. explanatory variables or predictors) • The parameters are estimated to give a "best fit" of the data • Most commonly the best fit is evaluated by using the least squares method, but other criteria have also been used • Used for prediction (including forecasting of time-series data), inference, hypothesis testing, etc. y x y = x + 1 X1 Y1 Y1’
  • 42. Sejong University Week 1: Introduction Applied Machine Learning 42 Histogram Analysis • Divide data into buckets and store the average (sum) for each bucket • Partitioning rules: – Equal-width: equal bucket range – Equal-frequency (or equal-depth) 0 5 10 15 20 25 30 35 40 10000 30000 50000 70000 90000
  • 43. Sejong University Week 1: Introduction Applied Machine Learning 43 Clustering • Clustering techniques groups similar objects from the data so that the objects in a cluster are similar to each other, but they are dissimilar to objects in another cluster • How much similarity the objects inside a cluster can be calculated using a distance function. • More is the similarity between the objects in a cluster closer they appear in the cluster. The quality of the cluster depends on the diameter of the cluster. • The cluster representation replaces the original data. This technique is more effective if the present data can be classified into distinct clusters.
  • 44. Sejong University Week 1: Introduction Applied Machine Learning 44 Sampling • Sampling: obtaining a small sample s to represent the whole data set N • Key principle: Choose a representative subset of the data – Simple random sampling may have very poor performance in the presence of skew – Develop adaptive sampling methods, e.g., stratified sampling
  • 45. Sejong University Week 1: Introduction Applied Machine Learning 45 Types of Sampling • Simple random sampling – There is an equal probability of selecting any particular item • Sampling without replacement – Once an object is selected, it is removed from the population • Sampling with replacement – A selected object is not removed from the population • Stratified sampling: – Partition the data set, and draw samples from each partition (proportionally, i.e., approximately the same percentage of the data) – This method is effective for skewed data.
  • 46. Sejong University Week 1: Introduction Applied Machine Learning 46 Sampling: With or without Replacement Raw Data
  • 47. Sejong University Week 1: Introduction Applied Machine Learning 47 Sampling: Stratified Sampling Raw Data Stratified Sample
  • 48. Sejong University Week 1: Introduction Applied Machine Learning Data Cube Aggregation • Data cube aggregation involves moving the data from a detailed level to fewer dimensions. • The resulting data set is smaller in volume, without loss of information necessary for the analysis task. • For example, suppose you have the data of All Electronics sales per quarter for the year 2018 to the year 2022. If you want to get the annual sale per year, you just have to aggregate the sales per quarter for each year. In this way, aggregation provides you with the required data, which is much smaller in size, and thereby we achieve data reduction even without losing any data.
  • 49. Sejong University Week 1: Introduction Applied Machine Learning Data Reduction 3: Data Compression • Data compression employs modification, encoding, or converting the structure of data in a way that consumes less space. • Data compression involves building a compact representation of information by removing redundancy and representing data in binary form. • Data that can be restored successfully from its compressed form is called Lossless compression. • In contrast, the opposite where it is not possible to restore the original form from the compressed form is Lossy compression
  • 50. Sejong University Week 1: Introduction Applied Machine Learning Agenda Data Preprocessing Data Cleaning Data Integration and Transformation Data Reduction Discretization & Concept Hierarchy Generation Summary
  • 51. Sejong University Week 1: Introduction Applied Machine Learning Data Discretization • Data discretization converts a large number of data values into smaller once, so that data evaluation and data management become very easy. • For example, we have an attribute of age with the following values. Age 10,11,13,14,17,19,30, 31, 32, 38, 40, 42,70 , 72, 73, 75 Attribute Age Age Age 10,11,13,14,17,19 30, 31, 32, 38, 40, 42 70 , 72, 73, 75 After Discretization Young Mature Old
  • 52. Sejong University Week 1: Introduction Applied Machine Learning Cont. • Another example is the Website visitor’s data. As seen in the figure below, data is discretized into the countries.
  • 53. Sejong University Week 1: Introduction Applied Machine Learning Concept Hierarchy Generation • A concept hierarchy represents a sequence of mappings with a set of more general concepts to specialized concepts. Similarly mapping from low-level concepts to higher-level concepts. In other words, we can say top-down mapping and bottom-up mapping. • Let’s see an example of a concept hierarchy for the dimension location. • Each city can be mapped with the country to which the given city belongs. For example, Seoul can be mapped to South Korea and South Korea can be mapped to Asia. • Top-down mapping – Top-down mapping starts from the top with general concepts and moves to the bottom to the specialized concepts. • Bottom-up mapping – Bottom-up mapping starts from the Bottom with specialized concepts and moves to the top to the generalized concepts. country province_or_ state city street
  • 54. Sejong University Week 1: Introduction Applied Machine Learning Agenda Data Preprocessing Data Cleaning Data Integration and Transformation Data Reduction Discretization & Concept Hierarchy Generation Summary
  • 55. Sejong University Week 1: Introduction Applied Machine Learning Summary • Data preparation is a big issue for both warehousing and mining • Data preparation includes – Data cleaning and data integration – Data reduction and feature selection – Discretization • A lot a methods have been developed but still an active area of research
  • 56. Sejong University Week 1: Introduction Applied Machine Learning
  • 57. Sejong University Week 1: Introduction Applied Machine Learning