Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Data preprocessing



Data preprocessing techniques

Data preprocessing techniques



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • This ppt is best...for data mining
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Data preprocessing Data preprocessing Presentation Transcript

  • A Brief Presentation on Data Mining Jason Rodrigues Data Preprocessing
    • Introduction
    • Why data proprocessing?
    • Data Cleaning
    • Data Integration and Transformation
    • Data Reduction
    • Discretization and concept Heirarchy generation
    • Takeaways
  • Why Data Preprocessing?
    • Data in the real world is dirty
      • incomplete : lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
      • noisy : containing errors or outliers
      • inconsistent : containing discrepancies in codes or names
    • No quality data, no quality mining results!
      • Quality decisions must be based on quality data
      • Data warehouse needs consistent integration of quality data
    • A multi-dimensional measure of data quality
      • A well-accepted multi-dimensional view:
      • accuracy, completeness, consistency, timeliness, believability, value added, interpretability, accessibility
    • Broad categories
      • intrinsic, contextual, representational, and accessibility
  • Data Preprocessing Major Tasks of Data Preprocessing
    • Data cleaning
      • Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
    • Data integration
      • Integration of multiple databases, data cubes, files, or notes
    • Data trasformation
      • Normalization (scaling to a specific range)
      • Aggregation
    • Data reduction
      • Obtains reduced representation in volume but produces the same or similar analytical results
      • Data discretization: with particular importance, especially for numerical data
      • Data aggregation, dimensionality reduction, data compression, generalization
  • Data Preprocessing Major Tasks of Data Preprocessing Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation Knowledge
  • Data Cleaning Tasks of Data Cleaning
      • Fill in missing values
      • Identify outliers and smooth noisy data
      • Correct inconsistent data
  • Data Cleaning Manage Missing Data
      • Ignore the tuple: usually done when class label is missing (assuming the task is classification — not effective in certain cases)
      • Fill in the missing value manually : tedious + infeasible?
      • Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!
      • Use the attribute mean to fill in the missing value
      • Use the attribute mean for all samples of the same class to fill in the missing value: smarter
      • Use the most probable value to fill in the missing value: inference-based such as regression, Bayesian formula, decision tree
  • Data Cleaning Manage Noisy Data
      • Binning Method:
      • first sort data and partition into (equi-depth) bins
      • then one can smooth by bin means, smooth by bin median, smooth by bin boundaries , etc
      • Clustering:
      • detect and remove outliers
      • Semi Automated
      • Computer and Manual Intervention
      • Regression
      • Use regression functions
  • Data Cleaning Cluster Analysis
  • Data Cleaning Regression Analysis x y y = x + 1 X1 Y1 Y1’
    • Linear regression (best line to fit
    • two variables)
    • Multiple linear regression (more
    • than two variables, fit to a
    • multidimensional surface
    • Manual correction using external references
    • Semi-automatic using various tools
      • To detect violation of known functional dependencies and data constraints
      • To correct redundant data
    Data Cleaning Inconsistant Data
    • Data integration:
      • combines data from multiple sources into a coherent store
    • Schema integration
      • integrate metadata from different sources
      • Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id  B.cust-#
    • Detecting and resolving data value conflicts
      • for the same real world entity, attribute values from different sources are different
      • possible reasons: different representations, different scales, e.g., metric vs. British units, different currency
    Data integration and transformation Tasks of Data Integration and transformation
    • Redundant data occur often when integrating multiple DBs
      • The same attribute may have different names in different databases
      • One attribute may be a “derived” attribute in another table, e.g., annual revenue
    • Redundant data may be able to be detected by correlational analysis
    • Careful integration can help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
    Manage Data Integration Data integration and transformation
    • Smoothing: remove noise from data (binning, clustering, regression)
    • Aggregation: summarization, data cube construction
    • Generalization: concept hierarchy climbing
    • Normalization: scaled to fall within a small, specified range
      • min-max normalization
      • z-score normalization
      • normalization by decimal scaling
    • Attribute/feature construction
      • New attributes constructed from the given ones
    Manage Data Transformation Data integration and transformation
    • Data reduction: reduced representation, while still retaining critical information
    • Data cube aggregation
    • Dimensionality reduction
    • Data compression
    • Numerosity reduction
    • Discretization and concept hierarchy generation
    Manage Data Reduction Data reduction
    • Multiple levels of aggregation in data cubes
      • Further reduce the size of data to deal with
    • Reference appropriate levels Use the smallest representation capable to solve the task
    Data Cube Aggregation Data reduction
    • String compression
      • There are extensive theories and well-tuned algorithms
      • Typically lossless
      • But only limited manipulation is possible without expansion
    • Audio/video, image compression
      • Typically lossy compression, with progressive refinement
      • Sometimes small fragments of signal can be reconstructed without reconstructing the whole
    • Time sequence is not audio
      • Typically short and vary slowly with time
    Data Compression Data reduction `
  • Decision Tree Data reduction
    • Proximity is used to refer to Similarity or Dissimilarity, since proximity between the object is a function of proximity between the corresponding attributes of two objects.
    • Similarity: Numeric measure of the degree to which the two objects are alike.
    • Dissimilarity: Numeric measure of the degree to which the two objects are different.
    Similarities and Dissimilarities Proximity
    • Similarity
      • Numerical measure of how alike two data objects are.
      • Is higher when objects are more alike.
      • Often falls in the range [0,1]
    • Dissimilarity
      • Numerical measure of how different are two data objects
      • Lower when objects are more alike
      • Minimum dissimilarity is often 0
      • Upper limit varies
    • Proximity refers to a similarity or dissimilarity
    Dissimilarities between Data Objects?
    • Euclidean Distance
      • Where n is the number of dimensions (attributes) and p k and q k are, respectively, the k th attributes (components) or data objects p and q .
    • Standardization is necessary, if scales differ.
    Euclidean Distance
    • Euclidean Distance
    Euclidean Distance
    • r = 1. City block (Manhattan, taxicab, L 1 norm) distance.
      • A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors
    • r = 2. Euclidean distance
    • r   . “supremum” (L max norm, L  norm) distance.
      • This is the maximum difference between any component of the vectors
      • Example: L_infinity of (1, 0, 2) and (6, 0, 3) = ??
      • Do not confuse r with n , i.e., all these distances are defined for all numbers of dimensions.
    Minkowski Distance
  • Minkowski Distance
    • Distances, such as the Euclidean distance, have some well known properties.
      • d(x, y)  0 for all x and y and d(x, y) = 0 only if x = y . (Positive definiteness)
      • d(x, y) = d(y, x) for all x and q . (Symmetry)
      • d (x, y)  d(x, y) + d(y, z) for all points x , y , and z . (Triangle Inequality)
    • where d(x, y) is the distance (dissimilarity) between points (data objects), x and y .
    • A distance that satisfies these properties is a metric, and a space is called a metric space
    Euclidean Distance Properties
    • non-metric measures are often robust (resistant to outliers, errors in objects, etc.)
      • the symmetry and mainly the triangular inequality are often violated
    • cannot be directly used with MAMs
    Non Metric Dissimilarities – Set Differences a b a > b + c c a b a ≠ b
    • various k-median distances
      • measure distance between the two (k-th) most similar portions in objects
      • back-propagation network with single output neuron serving as a distance, allows training
    • Dynamic Time Warping distance
      • sequence alignment technique
      • minimizes the sum of distances between sequence elements
    • fractional L p distances
      • generalization of Minkowski distances ( p <1 )
      • more robust to extreme differences in coordinates
    Non Metric Dissimilarities – Time
    • Recall: Jaccard coefficient is a commonly used measure of overlap of two sets A and B
    • jaccard (A,B) = |A ∩ B| / |A ∪ B|
    • jaccard (A,A) = 1
    • jaccard (A,B) = 0 if A ∩ B = 0
    • A and B don’t have to be the same size.
    • JC always assigns a number between 0 and 1 .
    Jaccard Coeffificient
  • Takeaways Why Data Preprocessing? Data Cleaning Data Integration and Transformation Data Reduction Discretization and concept Heirarchy generation