Data preparation and processing
Mahmoud Rafeek Alfarra
http://mfarra.cst.ps
University College of Science & Technology- Khan yonis
Development of computer systems
2016
Chapter 2 – Lecture 1
Outline
 Introduction
 Domain Expert
 Goal identification and Data Understanding
 Data Cleaning
 Missing values
 Noisy Data
 Inconsistent Data
 Data Integration
 Data Transformation
 Data Reduction
 Feature Selection
 Sampling
 Discretization
Introduction
 The real –world database typically used in data
mining may have millions of records and thousands of
variables. It is noisy and has missing and inconsistent
values.
Data quality is a key issue with data mining so data
preparation is a necessary step for serious, effective,
real-world data mining.
Introduction
To increase the accuracy of the mining, has to
perform data preprocessing.
Otherwise, garbage in => garbage out
Data Preparation estimated to take 70-80% of the
time and effort.
Introduction
Domain Expertise
 Data quality expert: “We found these strange records
in your database after running sophisticated
algorithms!”
 Domain Experts: “Oh, those apples - we put them
in the same baskets as oranges because there are too
few apples to bother. Not a big deal. We knew that
already.”
Domain Expertise
Domain Expertise is important for understanding the
data, the problem and interpreting the results.
“The counter resets to 0 if the number of calls exceeds N”.
“The missing values are represented by 0, but the default billed
amount is 0 too.”
Insufficient Domain Expertise is a primary cause of
poor Data Quality– data are unusable.
Goal Identification
 To obtain the highest benefit from data mining, there
must be a clear statement of the business objectives.
 The first and most important step in any targeting-
model project is to establish a clear goal and develop a
process to achieve that goal.
Goal Identification
 Example of Goal for business company are:
 You want to attract new customers
 You want to avoid high -risk customers
 You want to understand the characteristics of your current customers?
 You want to make your unprofitable customers more profitable?
 You want to retain your profitable customers?
 You want to win back your lost customers?
 You want to improve customer satisfaction?
 You want to increase sales?
 You want to reduce expenses
Data Understanding
 Starts with an initial data collection and proceeds with
activities in order to get familiar with the data, to
identify data quality problems, to discover first closes
into the data.
Data Understanding
Data Understanding: Relevance:
 What data is available for the task?
 Is this data relevant?
 Is additional relevant data available?
 How much historical data is available?
 Who is the data expert ?
Data Understanding
Data Understanding: Quantity
 Number of instances (records)
 Rule of thumb: 5,000 or more desired
 if less, results are less reliable;
 Number of attributes (fields)
 Rule of thumb: for each field, 10 or more instances
 If more fields, use feature reduction and selection
 Number of targets
 Rule of thumb: >100 for each class
 if very unbalanced, use stratified sampling
Data Cleaning
Goal identification
& Data
Understanding
Data Cleaning Data Integration
Data
Transformation
Data
Reduction
Data Cleaning
Tid Refund
Marital
Status
Taxable
Income
Cheat
1 Yes 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced -95k Yes
6 No Married 60K No
7 Yes 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Attributes
Objects
Data Cleaning
 Real-world data tends to be incomplete, noisy and
inconsistent.
 Data Cleaning Steps
 Missing values
 Noisy Data
 Inconsistent Data
Missing values
 A missing value (Mv) is an empty cell in the table
that represents a dataset.
?Instances
Attributes
Dealing with missing values
1. Ignore records with missing values:
 This is usually done when the class label is missing.
 This method is not effective, unless the record contains
several attributes with missing values.
Dealing with missing values
2. Fill in the missing value manually:
In general, this approach is time-consuming and may be not
feeble given a large data set with many missing values.
3. Fill in the missing value manually:
Replace all missing values by same constant such as
“unknown”. Although this method is simple but it is not
recommended because results with “unknown values are not
“interesting”.
Dealing with missing values
4. Use the attribute mean to fill missing values:
For example in attribute income if the mean income is 28000,
use this value to replace the missing values.
5. Use the attribute mean for all samples belonging to the
same class
For example, if classifying customers according to credit risk,
replace the missing value with the mean income value for
customers in the same credit risk category as that of the given
record.
Dealing with missing values
6. Use advanced method
such as K-nearest neighbors formalism or decision
tree to predict the missing value using other values.
Dealing with missing values
k nearest neighbors Approach
Compute the k nearest neighbors and assign a value
from them.
Dealing with missing values
k nearest neighbors Approach
 For nominal values, use the most common value
among all neighbors.
 For numerical values use the average value.
 Indeed, we need to define a proximity measure
between instances, such as euclidian distance.
Next:
Data Cleaning: Noisy Data

4 Data preparation and processing

  • 1.
    Data preparation andprocessing Mahmoud Rafeek Alfarra http://mfarra.cst.ps University College of Science & Technology- Khan yonis Development of computer systems 2016 Chapter 2 – Lecture 1
  • 2.
    Outline  Introduction  DomainExpert  Goal identification and Data Understanding  Data Cleaning  Missing values  Noisy Data  Inconsistent Data  Data Integration  Data Transformation  Data Reduction  Feature Selection  Sampling  Discretization
  • 3.
  • 4.
     The real–world database typically used in data mining may have millions of records and thousands of variables. It is noisy and has missing and inconsistent values. Data quality is a key issue with data mining so data preparation is a necessary step for serious, effective, real-world data mining. Introduction
  • 5.
    To increase theaccuracy of the mining, has to perform data preprocessing. Otherwise, garbage in => garbage out Data Preparation estimated to take 70-80% of the time and effort. Introduction
  • 6.
    Domain Expertise  Dataquality expert: “We found these strange records in your database after running sophisticated algorithms!”  Domain Experts: “Oh, those apples - we put them in the same baskets as oranges because there are too few apples to bother. Not a big deal. We knew that already.”
  • 7.
    Domain Expertise Domain Expertiseis important for understanding the data, the problem and interpreting the results. “The counter resets to 0 if the number of calls exceeds N”. “The missing values are represented by 0, but the default billed amount is 0 too.” Insufficient Domain Expertise is a primary cause of poor Data Quality– data are unusable.
  • 8.
    Goal Identification  Toobtain the highest benefit from data mining, there must be a clear statement of the business objectives.  The first and most important step in any targeting- model project is to establish a clear goal and develop a process to achieve that goal.
  • 9.
    Goal Identification  Exampleof Goal for business company are:  You want to attract new customers  You want to avoid high -risk customers  You want to understand the characteristics of your current customers?  You want to make your unprofitable customers more profitable?  You want to retain your profitable customers?  You want to win back your lost customers?  You want to improve customer satisfaction?  You want to increase sales?  You want to reduce expenses
  • 10.
    Data Understanding  Startswith an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first closes into the data.
  • 11.
    Data Understanding Data Understanding:Relevance:  What data is available for the task?  Is this data relevant?  Is additional relevant data available?  How much historical data is available?  Who is the data expert ?
  • 12.
    Data Understanding Data Understanding:Quantity  Number of instances (records)  Rule of thumb: 5,000 or more desired  if less, results are less reliable;  Number of attributes (fields)  Rule of thumb: for each field, 10 or more instances  If more fields, use feature reduction and selection  Number of targets  Rule of thumb: >100 for each class  if very unbalanced, use stratified sampling
  • 13.
    Data Cleaning Goal identification &Data Understanding Data Cleaning Data Integration Data Transformation Data Reduction
  • 14.
    Data Cleaning Tid Refund Marital Status Taxable Income Cheat 1Yes 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced -95k Yes 6 No Married 60K No 7 Yes 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Attributes Objects
  • 15.
    Data Cleaning  Real-worlddata tends to be incomplete, noisy and inconsistent.  Data Cleaning Steps  Missing values  Noisy Data  Inconsistent Data
  • 16.
    Missing values  Amissing value (Mv) is an empty cell in the table that represents a dataset. ?Instances Attributes
  • 17.
    Dealing with missingvalues 1. Ignore records with missing values:  This is usually done when the class label is missing.  This method is not effective, unless the record contains several attributes with missing values.
  • 18.
    Dealing with missingvalues 2. Fill in the missing value manually: In general, this approach is time-consuming and may be not feeble given a large data set with many missing values. 3. Fill in the missing value manually: Replace all missing values by same constant such as “unknown”. Although this method is simple but it is not recommended because results with “unknown values are not “interesting”.
  • 19.
    Dealing with missingvalues 4. Use the attribute mean to fill missing values: For example in attribute income if the mean income is 28000, use this value to replace the missing values. 5. Use the attribute mean for all samples belonging to the same class For example, if classifying customers according to credit risk, replace the missing value with the mean income value for customers in the same credit risk category as that of the given record.
  • 20.
    Dealing with missingvalues 6. Use advanced method such as K-nearest neighbors formalism or decision tree to predict the missing value using other values.
  • 21.
    Dealing with missingvalues k nearest neighbors Approach Compute the k nearest neighbors and assign a value from them.
  • 22.
    Dealing with missingvalues k nearest neighbors Approach  For nominal values, use the most common value among all neighbors.  For numerical values use the average value.  Indeed, we need to define a proximity measure between instances, such as euclidian distance.
  • 23.