Data processing involves cleaning, integrating, transforming, reducing, and summarizing data from various sources into a coherent and useful format. It aims to handle issues like missing values, noise, inconsistencies, and volume to produce an accurate and compact representation of the original data without losing information. Some key techniques involved are data cleaning through binning, regression, and clustering to smooth or detect outliers; data integration to combine multiple sources; data transformation through smoothing, aggregation, generalization and normalization; and data reduction using cube aggregation, attribute selection, dimensionality reduction, and discretization.
What is theneed for Data Processing?To get the required information from huge, incomplete, noisy and inconsistent set of data it is necessary to use data processing.
3.
Steps in DataProcessingData CleaningData IntegrationData TransformationData reductionData Summarization
4.
What is DataCleaning?Data cleaning is a procedure to “clean” the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies
5.
What is DataIntegration?Integrating multiple databases, data cubes, or files, this is called data integration.
6.
What is DataTransformation?Data transformation operations, such as normalization and aggregation, are additional data preprocessing procedures that would contribute toward the success of the mining process.
7.
What is DataReduction?Data reduction obtains a reduced representation of the data set that is much smaller in volume, yet produces the same (or almost the same) analytical results.
8.
What is DataSummarization?It is the processes of representing the collected data in an accurate and compact way without losing any information, it also involves getting a information from collected data.Ex: Display the data as a graph and get the mean, median, mode etc.
9.
How to CleanData?Handling Missing valuesIgnore the tupleFill in the missing value manuallyUse a global constant to fill in the missing valueUse the attribute mean to fill in the missing valueUse the attribute mean for all samples belonging to the same class as the given tupleUse the most probable value to fill in the missing value
10.
How to CleanData?Handle Noisy DataBinning: Binning methods smooth a sorted data value by consulting its “neighborhood”.Regression: Data can be smoothed by fitting the data to a function, such as with regression. Clustering: Outliers may be detected by clustering, where similar values are organized into groups, or “clusters.”
11.
Data IntegrationData Integration combinesdata from multiple sources into a coherent data store, as in data warehousing. These sources may include multiple databases, data cubes, or flat files. Issues that arises during data integration like Schema integration and object matching Redundancy is another important issue.
12.
Data TransformationData transformationcan be achieved in following waysSmoothing: which works to remove noise from the dataAggregation: where summary or aggregation operations are applied to the data. For example, the daily sales data may be aggregated so as to compute weekly and annuual total scores.Generalization of the data: where low-level or “primitive” (raw) data are replaced by higher-level concepts through the use of concept hierarchies. For example, categorical attributes, like street, can be generalized to higher-level concepts, like city or country.Normalization: where the attribute data are scaled so as to fall within a small specified range, such as −1.0 to 1.0, or 0.0 to 1.0.Attribute construction : this is where new attributes are constructed and added from the given set of attributes to help the mining process.
13.
Data Reduction techniquesTheseare the techniques that can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data.Data cube aggregationAttribute subset selectionDimensionality reductionNumerosity reductionDiscretization and concept hierarchy generation
14.
Visit more selfhelp tutorialsPick a tutorial of your choice and browse through it at your own pace.The tutorials section is free, self-guiding and will not involve any additional support.Visit us at www.dataminingtools.net