Introduction
Domain Expert
Goal identification and Data Understanding
Data Cleaning
Missing values
Noisy Data
Inconsistent Data
Data Integration
Data Transformation
Data Reduction
Feature Selection
Sampling Discretization
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
5 data preparation and processing2
1. Data preparation and processing
Mahmoud Rafeek Alfarra
http://mfarra.cst.ps
University College of Science & Technology- Khan yonis
Development of computer systems
2016
Chapter 2 – Lecture 2
2. Outline
Introduction
Domain Expert
Goal identification and Data Understanding
Data Cleaning
Missing values
Noisy Data
Inconsistent Data
Data Integration
Data Transformation
Data Reduction
Feature Selection
Sampling
Discretization
4. Noise is a random error in measured variable.
Noisy data is meaningless data.
Any data that has been received, stored or changed
in such a manner that it cannot be read or used by the
program that originally created it can be described as
noisy.
Noisy Data
5. Source of Noisy data:
1. Data entry problem.
2. Faulty data collection instruments.
3. Data transmission.
Noisy Data
6. Binning method
Clustering
Combined computer and human inspections
Regression
How to handle noisy data ?
7. How to handle noisy data ?
Binning method:
1. Sort data
2. Partition into equal-frequency groups.
3. One can smooth by group means, smooth by
group median, smooth by group boundaries, etc.
8. How to handle noisy data ?
Sorted data for price: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Partition into (equal-frequency) groups:
-G1: 4, 8, 9, 15
-G2: 21, 21, 24, 25
-G3: 26, 28, 29, 34
Smoothing by bin means:
-G1: 9, 9, 9, 9
-G2: 23, 23, 23, 23
-G3: 29, 29, 29, 29
Smoothing by bin boundaries:
-G1: 4, 4, 4, 15
-G2: 21, 21, 25, 25
-G3: 26, 26, 26, 34
9. How to handle noisy data ?
Clustering: Outliers may be detected by clustering,
where similar values are organized into groups, values
that fall outside the set of clusters may be considered
outliers.
10. How to handle noisy data ?
Combined computer and human inspections: Outliers
may be identified by detect suspicious values and
check by human.
11. How to handle noisy data ?
Regression: Data can be smoothed by fitting the
data to a function.
12. Inconsistent Data
Data which is inconsistent with our models, should
be dealt with.
Common sense can also be used to detect such kind
of inconsistency:
The same name occurring differently in an application.
Different names can appear to be the same (Dennis Vs
Denis)
Inappropriate values (Males being pregnant, or having an
negative age) Was rating “1,2,3”, now rating “A, B, C”
Difference between duplicate records
13. Inconsistent Data
We want to transform all dates to the same format internally
Some systems accept dates in many formats
e.g. “Sep 24, 2003” , 9/24/03, 24.09.03, etc
dates are transformed internally to a standard value
Frequently, just the year (YYYY) is sufficient
For more details, we may need the month, the day, the hour,
etc
Representing date as YYYYMM or YYYYMMDD can be OK.
15. Data Integration
Combines data from multiple sources into a coherent
store.
Increasingly data a mining projects require data
from more than one data source.
Such as multiple databases, data warehouse, flat
files and historical data.
16. Data Integration
Data is stored in many systems across enterprise
and outside the enterprise
The source of data fall into two categories:
Internal sources that are generated through enterprise
activities such as databases, historical data, Web sites
and warehouses.
External sources such as credit bureaus, phone
companies and demographical information.
17. Data Integration
Data Warehouse: is a structure that links information
from two or more databases.
Data warehouse brings data from different data
sources into a central repository.
It performs some data integration, clean-up, and
summarization, and distribute the information data
marts.