Your SlideShare is downloading. ×
Data Mining: Data processing
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Data Mining: Data processing


Published on

Data Mining: Data processing

Data Mining: Data processing

Published in: Technology

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Data Processing
  • 2. What is the need for Data Processing?
    To get the required information from huge, incomplete, noisy and inconsistent set of data it is necessary to use data processing.
  • 3. Steps in Data Processing
    Data Cleaning
    Data Integration
    Data Transformation
    Data reduction
    Data Summarization
  • 4. What is Data Cleaning?
    Data cleaning is a procedure to “clean” the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies
  • 5. What is Data Integration?
    Integrating multiple databases, data cubes, or files, this is called data integration.
  • 6. What is Data Transformation?
    Data transformation operations, such as normalization and aggregation, are additional data preprocessing procedures that would contribute toward the success of the mining process.
  • 7. What is Data Reduction?
    Data reduction obtains a reduced representation of the data set that is much smaller in volume, yet produces the same (or almost the same) analytical results.
  • 8. What is Data Summarization?
    It is the processes of representing the collected data in an accurate and compact way without losing any information, it also involves getting a information from collected data.
    Ex: Display the data as a graph and get the mean, median, mode etc.
  • 9. How to Clean Data?
    Handling Missing values
    Ignore the tuple
    Fill in the missing value manually
    Use a global constant to fill in the missing value
    Use the attribute mean to fill in the missing value
    Use the attribute mean for all samples belonging to the same class as the given tuple
    Use the most probable value to fill in the missing value
  • 10. How to Clean Data?
    Handle Noisy Data
    Binning: Binning methods smooth a sorted data value by consulting its “neighborhood”.
    Regression: Data can be smoothed by fitting the data to a function, such as with regression. 
    Clustering: Outliers may be detected by clustering, where similar values are organized into groups, or “clusters.”
  • 11. Data Integration
    Data Integration combines data from multiple sources into a coherent data store, as in data warehousing. These sources may include multiple databases, data cubes, or flat files. Issues that arises during data integration like Schema integration and object matching Redundancy is another important issue.
  • 12. Data Transformation
    Data transformation can be achieved in following ways
    Smoothing: which works to remove noise from the data
    Aggregation: where summary or aggregation operations are applied to the data. For example, the daily sales data may be aggregated so as to compute weekly and annuual total scores.
    Generalization of the data: where low-level or “primitive” (raw) data are replaced by higher-level concepts through the use of concept hierarchies. For example, categorical attributes, like street, can be generalized to higher-level concepts, like city or country.
    Normalization: where the attribute data are scaled so as to fall within a small specified range, such as −1.0 to 1.0, or 0.0 to 1.0.
    Attribute construction : this is where new attributes are constructed and added from the given set of attributes to help the mining process.
  • 13. Data Reduction techniques
    These are the techniques that can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data.
    Data cube aggregation
    Attribute subset selection
    Dimensionality reduction
    Numerosity reduction
    Discretization and concept hierarchy generation
  • 14. Visit more self help tutorials
    Pick a tutorial of your choice and browse through it at your own pace.
    The tutorials section is free, self-guiding and will not involve any additional support.
    Visit us at