Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Data preprocessing

on

  • 4,424 views

Data preprocessing techniques

Data preprocessing techniques

Statistics

Views

Total Views
4,424
Views on SlideShare
4,424
Embed Views
0

Actions

Likes
1
Downloads
143
Comments
1

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • This ppt is best...for data mining
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Data preprocessing Data preprocessing Presentation Transcript

  • A Brief Presentation on Data Mining Jason Rodrigues Data Preprocessing
    • Introduction
    • Why data proprocessing?
    • Data Cleaning
    • Data Integration and Transformation
    • Data Reduction
    • Discretization and concept Heirarchy generation
    • Takeaways
    Agenda
  • Why Data Preprocessing?
    • Data in the real world is dirty
      • incomplete : lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
      • noisy : containing errors or outliers
      • inconsistent : containing discrepancies in codes or names
    • No quality data, no quality mining results!
      • Quality decisions must be based on quality data
      • Data warehouse needs consistent integration of quality data
    • A multi-dimensional measure of data quality
      • A well-accepted multi-dimensional view:
      • accuracy, completeness, consistency, timeliness, believability, value added, interpretability, accessibility
    • Broad categories
      • intrinsic, contextual, representational, and accessibility
  • Data Preprocessing Major Tasks of Data Preprocessing
    • Data cleaning
      • Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
    • Data integration
      • Integration of multiple databases, data cubes, files, or notes
    • Data trasformation
      • Normalization (scaling to a specific range)
      • Aggregation
    • Data reduction
      • Obtains reduced representation in volume but produces the same or similar analytical results
      • Data discretization: with particular importance, especially for numerical data
      • Data aggregation, dimensionality reduction, data compression, generalization
  • Data Preprocessing Major Tasks of Data Preprocessing Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation Knowledge
  • Data Cleaning Tasks of Data Cleaning
      • Fill in missing values
      • Identify outliers and smooth noisy data
      • Correct inconsistent data
  • Data Cleaning Manage Missing Data
      • Ignore the tuple: usually done when class label is missing (assuming the task is classification — not effective in certain cases)
      • Fill in the missing value manually : tedious + infeasible?
      • Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!
      • Use the attribute mean to fill in the missing value
      • Use the attribute mean for all samples of the same class to fill in the missing value: smarter
      • Use the most probable value to fill in the missing value: inference-based such as regression, Bayesian formula, decision tree
  • Data Cleaning Manage Noisy Data
      • Binning Method:
      • first sort data and partition into (equi-depth) bins
      • then one can smooth by bin means, smooth by bin median, smooth by bin boundaries , etc
      • Clustering:
      • detect and remove outliers
      • Semi Automated
      • Computer and Manual Intervention
      • Regression
      • Use regression functions
  • Data Cleaning Cluster Analysis
  • Data Cleaning Regression Analysis x y y = x + 1 X1 Y1 Y1’
    • Linear regression (best line to fit
    • two variables)
    • Multiple linear regression (more
    • than two variables, fit to a
    • multidimensional surface
    • Manual correction using external references
    • Semi-automatic using various tools
      • To detect violation of known functional dependencies and data constraints
      • To correct redundant data
    Data Cleaning Inconsistant Data
    • Data integration:
      • combines data from multiple sources into a coherent store
    • Schema integration
      • integrate metadata from different sources
      • Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id  B.cust-#
    • Detecting and resolving data value conflicts
      • for the same real world entity, attribute values from different sources are different
      • possible reasons: different representations, different scales, e.g., metric vs. British units, different currency
    Data integration and transformation Tasks of Data Integration and transformation
    • Redundant data occur often when integrating multiple DBs
      • The same attribute may have different names in different databases
      • One attribute may be a “derived” attribute in another table, e.g., annual revenue
    • Redundant data may be able to be detected by correlational analysis
    • Careful integration can help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
    Manage Data Integration Data integration and transformation
    • Smoothing: remove noise from data (binning, clustering, regression)
    • Aggregation: summarization, data cube construction
    • Generalization: concept hierarchy climbing
    • Normalization: scaled to fall within a small, specified range
      • min-max normalization
      • z-score normalization
      • normalization by decimal scaling
    • Attribute/feature construction
      • New attributes constructed from the given ones
    Manage Data Transformation Data integration and transformation
    • Data reduction: reduced representation, while still retaining critical information
    • Data cube aggregation
    • Dimensionality reduction
    • Data compression
    • Numerosity reduction
    • Discretization and concept hierarchy generation
    Manage Data Reduction Data reduction
    • Multiple levels of aggregation in data cubes
      • Further reduce the size of data to deal with
    • Reference appropriate levels Use the smallest representation capable to solve the task
    Data Cube Aggregation Data reduction
    • String compression
      • There are extensive theories and well-tuned algorithms
      • Typically lossless
      • But only limited manipulation is possible without expansion
    • Audio/video, image compression
      • Typically lossy compression, with progressive refinement
      • Sometimes small fragments of signal can be reconstructed without reconstructing the whole
    • Time sequence is not audio
      • Typically short and vary slowly with time
    Data Compression Data reduction `
  • Decision Tree Data reduction
    • Proximity is used to refer to Similarity or Dissimilarity, since proximity between the object is a function of proximity between the corresponding attributes of two objects.
    • Similarity: Numeric measure of the degree to which the two objects are alike.
    • Dissimilarity: Numeric measure of the degree to which the two objects are different.
    Similarities and Dissimilarities Proximity
    • Similarity
      • Numerical measure of how alike two data objects are.
      • Is higher when objects are more alike.
      • Often falls in the range [0,1]
    • Dissimilarity
      • Numerical measure of how different are two data objects
      • Lower when objects are more alike
      • Minimum dissimilarity is often 0
      • Upper limit varies
    • Proximity refers to a similarity or dissimilarity
    Dissimilarities between Data Objects?
    • Euclidean Distance
      • Where n is the number of dimensions (attributes) and p k and q k are, respectively, the k th attributes (components) or data objects p and q .
    • Standardization is necessary, if scales differ.
    Euclidean Distance
    • Euclidean Distance
    Euclidean Distance
    • r = 1. City block (Manhattan, taxicab, L 1 norm) distance.
      • A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors
    • r = 2. Euclidean distance
    • r   . “supremum” (L max norm, L  norm) distance.
      • This is the maximum difference between any component of the vectors
      • Example: L_infinity of (1, 0, 2) and (6, 0, 3) = ??
      • Do not confuse r with n , i.e., all these distances are defined for all numbers of dimensions.
    Minkowski Distance
  • Minkowski Distance
    • Distances, such as the Euclidean distance, have some well known properties.
      • d(x, y)  0 for all x and y and d(x, y) = 0 only if x = y . (Positive definiteness)
      • d(x, y) = d(y, x) for all x and q . (Symmetry)
      • d (x, y)  d(x, y) + d(y, z) for all points x , y , and z . (Triangle Inequality)
    • where d(x, y) is the distance (dissimilarity) between points (data objects), x and y .
    • A distance that satisfies these properties is a metric, and a space is called a metric space
    Euclidean Distance Properties
    • non-metric measures are often robust (resistant to outliers, errors in objects, etc.)
      • the symmetry and mainly the triangular inequality are often violated
    • cannot be directly used with MAMs
    Non Metric Dissimilarities – Set Differences a b a > b + c c a b a ≠ b
    • various k-median distances
      • measure distance between the two (k-th) most similar portions in objects
    • COSIMIR
      • back-propagation network with single output neuron serving as a distance, allows training
    • Dynamic Time Warping distance
      • sequence alignment technique
      • minimizes the sum of distances between sequence elements
    • fractional L p distances
      • generalization of Minkowski distances ( p <1 )
      • more robust to extreme differences in coordinates
    Non Metric Dissimilarities – Time
    • Recall: Jaccard coefficient is a commonly used measure of overlap of two sets A and B
    • jaccard (A,B) = |A ∩ B| / |A ∪ B|
    • jaccard (A,A) = 1
    • jaccard (A,B) = 0 if A ∩ B = 0
    • A and B don’t have to be the same size.
    • JC always assigns a number between 0 and 1 .
    Jaccard Coeffificient
  • Takeaways Why Data Preprocessing? Data Cleaning Data Integration and Transformation Data Reduction Discretization and concept Heirarchy generation