What is the need for Data Processing? To get the required information from huge, incomplete, noisy and inconsistent set of data it is necessary to use data processing.
Steps in Data Processing Data Cleaning Data Integration Data Transformation Data reduction Data Summarization
What is Data Cleaning? Data cleaning is a procedure to “clean” the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies
What is Data Integration? Integrating multiple databases, data cubes, or files, this is called data integration.
What is Data Transformation? Data transformation operations, such as normalization and aggregation, are additional data preprocessing procedures that would contribute toward the success of the mining process.
What is Data Reduction? Data reduction obtains a reduced representation of the data set that is much smaller in volume, yet produces the same (or almost the same) analytical results.
What is Data Summarization? It is the processes of representing the collected data in an accurate and compact way without losing any information, it also involves getting a information from collected data. Ex: Display the data as a graph and get the mean, median, mode etc.
How to Clean Data? Handling Missing values Ignore the tuple Fill in the missing value manually Use a global constant to fill in the missing value Use the attribute mean to fill in the missing value Use the attribute mean for all samples belonging to the same class as the given tuple Use the most probable value to fill in the missing value
How to Clean Data? Handle Noisy Data Binning: Binning methods smooth a sorted data value by consulting its “neighborhood”. Regression: Data can be smoothed by fitting the data to a function, such as with regression. Clustering: Outliers may be detected by clustering, where similar values are organized into groups, or “clusters.”
Data Integration Data Integration combines data from multiple sources into a coherent data store, as in data warehousing. These sources may include multiple databases, data cubes, or flat files. Issues that arises during data integration like Schema integration and object matching Redundancy is another important issue.
Data Transformation Data transformation can be achieved in following ways Smoothing: which works to remove noise from the data Aggregation: where summary or aggregation operations are applied to the data. For example, the daily sales data may be aggregated so as to compute weekly and annuual total scores. Generalization of the data: where low-level or “primitive” (raw) data are replaced by higher-level concepts through the use of concept hierarchies. For example, categorical attributes, like street, can be generalized to higher-level concepts, like city or country. Normalization: where the attribute data are scaled so as to fall within a small specified range, such as −1.0 to 1.0, or 0.0 to 1.0. Attribute construction : this is where new attributes are constructed and added from the given set of attributes to help the mining process.
Data Reduction techniques These are the techniques that can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data. Data cube aggregation Attribute subset selection Dimensionality reduction Numerosity reduction Discretization and concept hierarchy generation
Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net