By
V.Sakthi Priya ,M.Sc(it)
Department Of CS & IT,
Nadar Saraswathi College Of Arts And
Science,
Theni.
Data Reduction
Data Reduction
1.Overview
2.The Curse of
Dimensionality
3.Data Sampling
4.Binning and Reduction of
Cardinality
Overview
 Data Reduction techniques are usually
categorized into three main families:
 Dimensionality Reduction: ensures the reduction of
the number of attributes or random variables in the data
set.
 Sample Numerosity Reduction: replaces the original
data by an alternative smaller data representation
 Cardinality Reduction: transformations applied to
obtain a reduced representation of the original data
The Curse of Dimensionality
 Dimensionality becomes a serious obstacle for
the efficiency of most of the DM algorithms.
 It has been estimated that as the number of
dimensions increase, the sample size needs to
increase exponentially in order to have an
effective estimate of multivariate densities.
• Several dimension reducers have been
developed over the years.
• Linear methods:
–Factor analysis
–MultiDimensional Scaling
–Principal Components Analysis
• Nonlinear methods:
–Locally Linear Embedding
–ISOMAP
The Curse of Dimensionality
 Feature Selection methods are aimed at
eliminating irrelevant and redundant features,
reducing the number of variables in the model.
 To find an optimal subset B that solves the
following optimization problem:
J(B)  criterion function
A  Original set of Features
Z  Subset of Features
d  minimum nº features
MultiDimensional Scaling:
 Method for situating a set of points in a low
space such that a classical distance measure
(like Euclidean) between them is as close as
possible to each pair of points.
 We can compute the distances in a
dimensional space of the original data and
then to use as input this distance matrix, which
then projects it in to a lower-dimensional
space so as to preserve these distances.
Data Sampling
 To reduce the number of instances submitted
to the DM algorithm.
 To support the selection of only those cases
in which the response is relatively
homogeneous.
 To assist regarding the balance of data and
occurrence of rare events.
 To divide a data set into three data sets to
carry out the subsequent analysis of DM
algorithms.
Data Sampling:Data Condensation
 It emerges from the fact that naive sampling
methods, such as random sampling or stratified
sampling, are not suitable for real-world problems
with noisy data since the performance of the
algorithms may change unpredictably and
significantly.
 They attempt to obtain a minimal set which
correctly classifies all the original examples.
Data Clustering
Clustering
• Partition Data Set Into Clusters Based On Similarity,
And Store Cluster Representation (E.G., Centroid And
Diameter) Only
• Can Be Very Effective If Data Is Clustered But Not If
Data Is “Smeared”
• Can Have Hierarchical Clustering And Be Stored In
Multidimensional Index Tree Structures
• There Are Many Choices Of Clustering Definitions
And Clustering Algorithms • Cluster Analysis Will Be
Studied
• Data Reduction Strategies – Dimensionality
Reduction, E.G., Remove Unimportant Attributes
• Wavelet Transforms
• Principal Components Analysis (PCA)
• Feature Subset Selection, Feature Creation –
Numerosity Reduction (Some Simply Call It: Data
Reduction)
• Regression And Log-linear Models
• Histograms, Clustering, Sampling
• Data Cube Aggregation – Data Compression
Data Reduction Strategies
• Discrete Wavelet Transform (DWT) For Linear
Signal Processing, Multi-resolution Analysis •
Compressed Approximation: Store Only A Small
Fraction Of The Strongest Of The Wavelet
Coefficients
• Similar To Discrete Fourier Transform (DFT),
But Better Lossy Compression, Localized In Space
• Method: – Length, L, Must Be An Integer Power
Of 2 (Padding With 0’s, When Necessary) – Each
Transform Has 2 Functions: Smoothing, Difference
– Applies To Pairs Of Data, Resulting In Two Set
Of Data Of Length L/2 – Applies Two Functions
Recursively, Until Reaches The Desired Length
WaveletTransformation
Region Where Points Cluster – Suppress
Weaker Information In Their
Boundaries
• Effective Removal Of Outliers –
Insensitive To Noise, Insensitive To Input
Order
• Multi-resolution – Detect Arbitrary
Shaped Clusters At Different Scales
• Efficient – Complexity O(N)
• Only Applicable To Low Dimensional
Data
– Normalize Input Data: Each Attribute Falls Within The
Same Range
– Compute K Orthonormal (Unit) Vectors, I.E., Principal
Components
– Each Input Data (Vector) Is A Linear Combination Of The
K Principal Component Vectors – The Principal Components
Are Sorted In Order Of Decreasing “Significance” Or
Strength
– Since The Components Are Sorted, The Size Of The Data
Can Be Reduced By Eliminating The Weak Components, I.E.,
Those With Low Variance (I.E., Using The Strongest Principal
Components, It Is Possible To Reconstruct A Good
Approximation Of The Original Data)
Principal Component Analysis (Steps)
• Another Way To Reduce Dimensionality Of
Data
• Redundant Attributes – Duplicate Much Or
All Of The Information Contained In One Or
More Other Attributes – E.G., Purchase Price
Of A Product And The Amount Of Sales Tax
Paid
• Irrelevant Attributes – Contain No
Information That Is Useful For The Data
Mining Task At Hand – E.G., Students' ID Is
Often Irrelevant To The Task Of Predicting
Students' GPA
Attribute Subset Selection
Reduce Data Volume By Choosing Alternative,
Smaller Forms Of Data Representation
• Parametric Methods (E.G., Regression) – Assume
The Data Fits Some Model, Estimate Model
Parameters, Store Only The Parameters, And Discard
The Data (Except Possible Outliers) – Ex.: Log-linear
Models—obtain Value At A Point In M-d Space As
The Product On Appropriate Marginal Subspaces
• Non-parametric Methods – Do Not Assume Models
– Major Families: Histograms, Clustering, Sampling,
Numerosity Reduction
 Linear Regression – Data Modeled To Fit A
Straight Line – Often Uses The Least-square
Method To Fit The Line
• Multiple Regression – Allows A Response
Variable Y To Be Modeled As A Linear
Function Of Multidimensional Feature Vector
• Log-linear Model – Approximates Discrete
Multidimensional Probability Distributions
Parametric Data Reduction:
Regression and Log-Linear Models

Data reduction

  • 1.
    By V.Sakthi Priya ,M.Sc(it) DepartmentOf CS & IT, Nadar Saraswathi College Of Arts And Science, Theni. Data Reduction
  • 2.
    Data Reduction 1.Overview 2.The Curseof Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality
  • 3.
    Overview  Data Reductiontechniques are usually categorized into three main families:  Dimensionality Reduction: ensures the reduction of the number of attributes or random variables in the data set.  Sample Numerosity Reduction: replaces the original data by an alternative smaller data representation  Cardinality Reduction: transformations applied to obtain a reduced representation of the original data
  • 4.
    The Curse ofDimensionality  Dimensionality becomes a serious obstacle for the efficiency of most of the DM algorithms.  It has been estimated that as the number of dimensions increase, the sample size needs to increase exponentially in order to have an effective estimate of multivariate densities.
  • 5.
    • Several dimensionreducers have been developed over the years. • Linear methods: –Factor analysis –MultiDimensional Scaling –Principal Components Analysis • Nonlinear methods: –Locally Linear Embedding –ISOMAP
  • 6.
    The Curse ofDimensionality  Feature Selection methods are aimed at eliminating irrelevant and redundant features, reducing the number of variables in the model.  To find an optimal subset B that solves the following optimization problem: J(B)  criterion function A  Original set of Features Z  Subset of Features d  minimum nº features
  • 7.
    MultiDimensional Scaling:  Methodfor situating a set of points in a low space such that a classical distance measure (like Euclidean) between them is as close as possible to each pair of points.  We can compute the distances in a dimensional space of the original data and then to use as input this distance matrix, which then projects it in to a lower-dimensional space so as to preserve these distances.
  • 8.
    Data Sampling  Toreduce the number of instances submitted to the DM algorithm.  To support the selection of only those cases in which the response is relatively homogeneous.  To assist regarding the balance of data and occurrence of rare events.  To divide a data set into three data sets to carry out the subsequent analysis of DM algorithms.
  • 9.
    Data Sampling:Data Condensation It emerges from the fact that naive sampling methods, such as random sampling or stratified sampling, are not suitable for real-world problems with noisy data since the performance of the algorithms may change unpredictably and significantly.  They attempt to obtain a minimal set which correctly classifies all the original examples.
  • 10.
  • 11.
    Clustering • Partition DataSet Into Clusters Based On Similarity, And Store Cluster Representation (E.G., Centroid And Diameter) Only • Can Be Very Effective If Data Is Clustered But Not If Data Is “Smeared” • Can Have Hierarchical Clustering And Be Stored In Multidimensional Index Tree Structures • There Are Many Choices Of Clustering Definitions And Clustering Algorithms • Cluster Analysis Will Be Studied
  • 12.
    • Data ReductionStrategies – Dimensionality Reduction, E.G., Remove Unimportant Attributes • Wavelet Transforms • Principal Components Analysis (PCA) • Feature Subset Selection, Feature Creation – Numerosity Reduction (Some Simply Call It: Data Reduction) • Regression And Log-linear Models • Histograms, Clustering, Sampling • Data Cube Aggregation – Data Compression Data Reduction Strategies
  • 13.
    • Discrete WaveletTransform (DWT) For Linear Signal Processing, Multi-resolution Analysis • Compressed Approximation: Store Only A Small Fraction Of The Strongest Of The Wavelet Coefficients • Similar To Discrete Fourier Transform (DFT), But Better Lossy Compression, Localized In Space • Method: – Length, L, Must Be An Integer Power Of 2 (Padding With 0’s, When Necessary) – Each Transform Has 2 Functions: Smoothing, Difference – Applies To Pairs Of Data, Resulting In Two Set Of Data Of Length L/2 – Applies Two Functions Recursively, Until Reaches The Desired Length WaveletTransformation
  • 14.
    Region Where PointsCluster – Suppress Weaker Information In Their Boundaries • Effective Removal Of Outliers – Insensitive To Noise, Insensitive To Input Order • Multi-resolution – Detect Arbitrary Shaped Clusters At Different Scales • Efficient – Complexity O(N) • Only Applicable To Low Dimensional Data
  • 15.
    – Normalize InputData: Each Attribute Falls Within The Same Range – Compute K Orthonormal (Unit) Vectors, I.E., Principal Components – Each Input Data (Vector) Is A Linear Combination Of The K Principal Component Vectors – The Principal Components Are Sorted In Order Of Decreasing “Significance” Or Strength – Since The Components Are Sorted, The Size Of The Data Can Be Reduced By Eliminating The Weak Components, I.E., Those With Low Variance (I.E., Using The Strongest Principal Components, It Is Possible To Reconstruct A Good Approximation Of The Original Data) Principal Component Analysis (Steps)
  • 16.
    • Another WayTo Reduce Dimensionality Of Data • Redundant Attributes – Duplicate Much Or All Of The Information Contained In One Or More Other Attributes – E.G., Purchase Price Of A Product And The Amount Of Sales Tax Paid • Irrelevant Attributes – Contain No Information That Is Useful For The Data Mining Task At Hand – E.G., Students' ID Is Often Irrelevant To The Task Of Predicting Students' GPA Attribute Subset Selection
  • 17.
    Reduce Data VolumeBy Choosing Alternative, Smaller Forms Of Data Representation • Parametric Methods (E.G., Regression) – Assume The Data Fits Some Model, Estimate Model Parameters, Store Only The Parameters, And Discard The Data (Except Possible Outliers) – Ex.: Log-linear Models—obtain Value At A Point In M-d Space As The Product On Appropriate Marginal Subspaces • Non-parametric Methods – Do Not Assume Models – Major Families: Histograms, Clustering, Sampling, Numerosity Reduction
  • 18.
     Linear Regression– Data Modeled To Fit A Straight Line – Often Uses The Least-square Method To Fit The Line • Multiple Regression – Allows A Response Variable Y To Be Modeled As A Linear Function Of Multidimensional Feature Vector • Log-linear Model – Approximates Discrete Multidimensional Probability Distributions Parametric Data Reduction: Regression and Log-Linear Models