Data preparation and processing
Mahmoud Rafeek Alfarra
http://mfarra.cst.ps
University College of Science & Technology- Khan yonis
Development of computer systems
2016
Chapter 2 – Lecture 2
Outline
 Introduction
 Domain Expert
 Goal identification and Data Understanding
 Data Cleaning
 Missing values
 Noisy Data
 Inconsistent Data
 Data Integration
 Data Transformation
 Data Reduction
 Feature Selection
 Sampling
 Discretization
Introduction
 Noise is a random error in measured variable.
 Noisy data is meaningless data.
 Any data that has been received, stored or changed
in such a manner that it cannot be read or used by the
program that originally created it can be described as
noisy.
Noisy Data
 Source of Noisy data:
1. Data entry problem.
2. Faulty data collection instruments.
3. Data transmission.
Noisy Data
 Binning method
 Clustering
 Combined computer and human inspections
 Regression
How to handle noisy data ?
How to handle noisy data ?
 Binning method:
1. Sort data
2. Partition into equal-frequency groups.
3. One can smooth by group means, smooth by
group median, smooth by group boundaries, etc.
How to handle noisy data ?
Sorted data for price: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Partition into (equal-frequency) groups:
-G1: 4, 8, 9, 15
-G2: 21, 21, 24, 25
-G3: 26, 28, 29, 34
Smoothing by bin means:
-G1: 9, 9, 9, 9
-G2: 23, 23, 23, 23
-G3: 29, 29, 29, 29
Smoothing by bin boundaries:
-G1: 4, 4, 4, 15
-G2: 21, 21, 25, 25
-G3: 26, 26, 26, 34
How to handle noisy data ?
Clustering: Outliers may be detected by clustering,
where similar values are organized into groups, values
that fall outside the set of clusters may be considered
outliers.
How to handle noisy data ?
 Combined computer and human inspections: Outliers
may be identified by detect suspicious values and
check by human.
How to handle noisy data ?
 Regression: Data can be smoothed by fitting the
data to a function.
Inconsistent Data
 Data which is inconsistent with our models, should
be dealt with.
 Common sense can also be used to detect such kind
of inconsistency:
The same name occurring differently in an application.
Different names can appear to be the same (Dennis Vs
Denis)
Inappropriate values (Males being pregnant, or having an
negative age) Was rating “1,2,3”, now rating “A, B, C”
Difference between duplicate records
Inconsistent Data
 We want to transform all dates to the same format internally
 Some systems accept dates in many formats
 e.g. “Sep 24, 2003” , 9/24/03, 24.09.03, etc
 dates are transformed internally to a standard value
 Frequently, just the year (YYYY) is sufficient
 For more details, we may need the month, the day, the hour,
etc
 Representing date as YYYYMM or YYYYMMDD can be OK.
Data Integration
Goal identification
& Data
Understanding
Data Cleaning Data Integration
Data
Transformation
Data
Reduction
Data Integration
 Combines data from multiple sources into a coherent
store.
 Increasingly data a mining projects require data
from more than one data source.
 Such as multiple databases, data warehouse, flat
files and historical data.
Data Integration
 Data is stored in many systems across enterprise
and outside the enterprise
The source of data fall into two categories:
 Internal sources that are generated through enterprise
activities such as databases, historical data, Web sites
and warehouses.
 External sources such as credit bureaus, phone
companies and demographical information.
Data Integration
 Data Warehouse: is a structure that links information
from two or more databases.
 Data warehouse brings data from different data
sources into a central repository.
 It performs some data integration, clean-up, and
summarization, and distribute the information data
marts.
Data Integration
Next:
Data Cleaning: Noisy Data

5 data preparation and processing2

  • 1.
    Data preparation andprocessing Mahmoud Rafeek Alfarra http://mfarra.cst.ps University College of Science & Technology- Khan yonis Development of computer systems 2016 Chapter 2 – Lecture 2
  • 2.
    Outline  Introduction  DomainExpert  Goal identification and Data Understanding  Data Cleaning  Missing values  Noisy Data  Inconsistent Data  Data Integration  Data Transformation  Data Reduction  Feature Selection  Sampling  Discretization
  • 3.
  • 4.
     Noise isa random error in measured variable.  Noisy data is meaningless data.  Any data that has been received, stored or changed in such a manner that it cannot be read or used by the program that originally created it can be described as noisy. Noisy Data
  • 5.
     Source ofNoisy data: 1. Data entry problem. 2. Faulty data collection instruments. 3. Data transmission. Noisy Data
  • 6.
     Binning method Clustering  Combined computer and human inspections  Regression How to handle noisy data ?
  • 7.
    How to handlenoisy data ?  Binning method: 1. Sort data 2. Partition into equal-frequency groups. 3. One can smooth by group means, smooth by group median, smooth by group boundaries, etc.
  • 8.
    How to handlenoisy data ? Sorted data for price: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 Partition into (equal-frequency) groups: -G1: 4, 8, 9, 15 -G2: 21, 21, 24, 25 -G3: 26, 28, 29, 34 Smoothing by bin means: -G1: 9, 9, 9, 9 -G2: 23, 23, 23, 23 -G3: 29, 29, 29, 29 Smoothing by bin boundaries: -G1: 4, 4, 4, 15 -G2: 21, 21, 25, 25 -G3: 26, 26, 26, 34
  • 9.
    How to handlenoisy data ? Clustering: Outliers may be detected by clustering, where similar values are organized into groups, values that fall outside the set of clusters may be considered outliers.
  • 10.
    How to handlenoisy data ?  Combined computer and human inspections: Outliers may be identified by detect suspicious values and check by human.
  • 11.
    How to handlenoisy data ?  Regression: Data can be smoothed by fitting the data to a function.
  • 12.
    Inconsistent Data  Datawhich is inconsistent with our models, should be dealt with.  Common sense can also be used to detect such kind of inconsistency: The same name occurring differently in an application. Different names can appear to be the same (Dennis Vs Denis) Inappropriate values (Males being pregnant, or having an negative age) Was rating “1,2,3”, now rating “A, B, C” Difference between duplicate records
  • 13.
    Inconsistent Data  Wewant to transform all dates to the same format internally  Some systems accept dates in many formats  e.g. “Sep 24, 2003” , 9/24/03, 24.09.03, etc  dates are transformed internally to a standard value  Frequently, just the year (YYYY) is sufficient  For more details, we may need the month, the day, the hour, etc  Representing date as YYYYMM or YYYYMMDD can be OK.
  • 14.
    Data Integration Goal identification &Data Understanding Data Cleaning Data Integration Data Transformation Data Reduction
  • 15.
    Data Integration  Combinesdata from multiple sources into a coherent store.  Increasingly data a mining projects require data from more than one data source.  Such as multiple databases, data warehouse, flat files and historical data.
  • 16.
    Data Integration  Datais stored in many systems across enterprise and outside the enterprise The source of data fall into two categories:  Internal sources that are generated through enterprise activities such as databases, historical data, Web sites and warehouses.  External sources such as credit bureaus, phone companies and demographical information.
  • 17.
    Data Integration  DataWarehouse: is a structure that links information from two or more databases.  Data warehouse brings data from different data sources into a central repository.  It performs some data integration, clean-up, and summarization, and distribute the information data marts.
  • 18.
  • 19.