Your SlideShare is downloading. ×
0
Data preprocessing
Data preprocessing
Data preprocessing
Data preprocessing
Data preprocessing
Data preprocessing
Data preprocessing
Data preprocessing
Data preprocessing
Data preprocessing
Data preprocessing
Data preprocessing
Data preprocessing
Data preprocessing
Data preprocessing
Data preprocessing
Data preprocessing
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Data preprocessing

187

Published on

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
187
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Data Preprocessing BY: K.KOTTAISAMY II MCA
  • 2. Data Preprocessing Data cleaning Data integration Data transformation Data reduction Data discretization
  • 3. Major tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation
  • 4. Data reduction Reduced representation in volume but produces the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data
  • 5. Data Preprocessing
  • 6. Data Cleaning Data in the Real World Is Dirty: Lots of incorrect data, e.g., instrument faulty, human or computer error, transmission error incomplete: lacking attribute values, containing only aggregate data e.g., Occupation = “ ” (missing data) noisy: containing noise, errors, or outliers e.g., Salary = “−10” (an error)
  • 7. Data Integration Data integration: combines data from multiple sources Schema integration integrate metadata from different sources Detecting and resolving data value conflicts for the same real world entity, attribute values from different sources are different, e.g., different scales, metric vs. British units Removing duplicates and redundant data
  • 8. Data Transformation Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range min-max normalization z-score normalization normalization by decimal scaling
  • 9. Data Transformation: Normalization Min-max normalization: to [new_minA, new_maxA] v' v minA (new _ maxA new _ minA) new _ minA maxA minA Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to 73,600 98,000 12 ,000 (1.0 0) 12 ,000 0 0.716
  • 10. Data Transformation: Normalization Z-score normalization (μ: mean, σ: standard deviation): v' v A A Ex. Let μ = 54,000, σ = 16,000. Then 73,600 54 ,000 16 ,000 1.225
  • 11. Data Transformation: Normalization Normalization by decimal scaling v v' 10 j Where j is the smallest integer such that Max(|ν’|) < 1
  • 12. Data reduction Why data reduction? A database/data warehouse may store terabytes of data Complex data analysis/mining may take a very long time to run on the complete data set Data reduction The data set that is much smaller in volume but yet produce the same analytical results
  • 13. Data reduction strategies Data cube aggregation Dimensionality reduction — e.g., remove unimportant attributes Data Compression Discretization and concept hierarchy generation
  • 14. Data cube aggregation The lowest level of a data cube (base cuboid) The aggregated data for an individual entity of interest E.g., a customer in a phone calling data warehouse Multiple levels of aggregation in data cubes Further reduce the size of data to deal with
  • 15. Dimensionality Reduction Feature selection (attribute subset selection): Select a minimum set of attributes that is sufficient for the data mining task. Heuristic methods step-wise forward selection step-wise backward elimination combining forward selection and backward elimination
  • 16. Data Discretization Three types of attributes: Nominal — values from an unordered set Ordinal — values from an ordered set Continuous — real numbers Discretization: Some classification algorithms only accept categorical attributes. Reduce data size by discretization Prepare for further analysis

×