Data preparation and processing chapter 2

Data preparation and processing
Mahmoud Rafeek Alfarra
http://mfarra.cst.ps
University College of Science & Technology- Khan yonis
Development of computer systems
2016
Chapter 2 – Lecture 1

Outline
 Introduction
 Domain Expert
 Goal identification and Data Understanding
 Data Cleaning
 Missing values
 Noisy Data
 Inconsistent Data
 Data Integration
 Data Transformation
 Data Reduction
 Feature Selection
 Sampling
 Discretization

 The real –world database typically used in data
mining may have millions of records and thousands of
variables. It is noisy and has missing and inconsistent
values.
Data quality is a key issue with data mining so data
preparation is a necessary step for serious, effective,
real-world data mining.
Introduction

To increase the accuracy of the mining, has to
perform data preprocessing.
Otherwise, garbage in => garbage out
Data Preparation estimated to take 70-80% of the
time and effort.
Introduction

Domain Expertise
 Data quality expert: “We found these strange records
in your database after running sophisticated
algorithms!”
 Domain Experts: “Oh, those apples - we put them
in the same baskets as oranges because there are too
few apples to bother. Not a big deal. We knew that
already.”

Domain Expertise
Domain Expertise is important for understanding the
data, the problem and interpreting the results.
“The counter resets to 0 if the number of calls exceeds N”.
“The missing values are represented by 0, but the default billed
amount is 0 too.”
Insufficient Domain Expertise is a primary cause of
poor Data Quality– data are unusable.

Goal Identification
 To obtain the highest benefit from data mining, there
must be a clear statement of the business objectives.
 The first and most important step in any targeting-
model project is to establish a clear goal and develop a
process to achieve that goal.

Goal Identification
 Example of Goal for business company are:
 You want to attract new customers
 You want to avoid high -risk customers
 You want to understand the characteristics of your current customers?
 You want to make your unprofitable customers more profitable?
 You want to retain your profitable customers?
 You want to win back your lost customers?
 You want to improve customer satisfaction?
 You want to increase sales?
 You want to reduce expenses

Data Understanding
 Starts with an initial data collection and proceeds with
activities in order to get familiar with the data, to
identify data quality problems, to discover first closes
into the data.

Data Understanding
Data Understanding: Relevance:
 What data is available for the task?
 Is this data relevant?
 Is additional relevant data available?
 How much historical data is available?
 Who is the data expert ?

Data Understanding
Data Understanding: Quantity
 Number of instances (records)
 Rule of thumb: 5,000 or more desired
 if less, results are less reliable;
 Number of attributes (fields)
 Rule of thumb: for each field, 10 or more instances
 If more fields, use feature reduction and selection
 Number of targets
 Rule of thumb: >100 for each class
 if very unbalanced, use stratified sampling

Data Cleaning
Goal identification
& Data
Understanding
Data Cleaning Data Integration
Data
Transformation
Data
Reduction

Data Cleaning
Tid Refund
Marital
Status
Taxable
Income
Cheat
1 Yes 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced -95k Yes
6 No Married 60K No
7 Yes 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Attributes
Objects

Data Cleaning
 Real-world data tends to be incomplete, noisy and
inconsistent.
 Data Cleaning Steps
 Missing values
 Noisy Data
 Inconsistent Data

Missing values
 A missing value (Mv) is an empty cell in the table
that represents a dataset.
?Instances
Attributes

Dealing with missing values
1. Ignore records with missing values:
 This is usually done when the class label is missing.
 This method is not effective, unless the record contains
several attributes with missing values.

2. Fill in the missing value manually:
In general, this approach is time-consuming and may be not
feeble given a large data set with many missing values.
3. Fill in the missing value manually:
Replace all missing values by same constant such as
“unknown”. Although this method is simple but it is not
recommended because results with “unknown values are not
“interesting”.

4. Use the attribute mean to fill missing values:
For example in attribute income if the mean income is 28000,
use this value to replace the missing values.
5. Use the attribute mean for all samples belonging to the
same class
For example, if classifying customers according to credit risk,
replace the missing value with the mean income value for
customers in the same credit risk category as that of the given
record.

6. Use advanced method
such as K-nearest neighbors formalism or decision
tree to predict the missing value using other values.

k nearest neighbors Approach
Compute the k nearest neighbors and assign a value
from them.

k nearest neighbors Approach
 For nominal values, use the most common value
among all neighbors.
 For numerical values use the average value.
 Indeed, we need to define a proximity measure
between instances, such as euclidian distance.

Next:
Data Cleaning: Noisy Data

2016

 Noise is a random error in measured variable.
 Noisy data is meaningless data.
 Any data that has been received, stored or changed
in such a manner that it cannot be read or used by the
program that originally created it can be described as
noisy.
Noisy Data

 Source of Noisy data:
1. Data entry problem.
2. Faulty data collection instruments.
3. Data transmission.
Noisy Data

 Binning method
 Clustering
 Combined computer and human inspections
 Regression
How to handle noisy data ?

 Binning method:
1. Sort data
2. Partition into equal-frequency groups.
3. One can smooth by group means, smooth by
group median, smooth by group boundaries, etc.

Sorted data for price: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Partition into (equal-frequency) groups:
-G1: 4, 8, 9, 15
-G2: 21, 21, 24, 25
-G3: 26, 28, 29, 34
Smoothing by bin means:
-G1: 9, 9, 9, 9
-G2: 23, 23, 23, 23
-G3: 29, 29, 29, 29
Smoothing by bin boundaries:
-G1: 4, 4, 4, 15
-G2: 21, 21, 25, 25
-G3: 26, 26, 26, 34

Clustering: Outliers may be detected by clustering,
where similar values are organized into groups, values
that fall outside the set of clusters may be considered
outliers.

 Combined computer and human inspections: Outliers
may be identified by detect suspicious values and
check by human.

 Regression: Data can be smoothed by fitting the
data to a function.

Inconsistent Data
 Data which is inconsistent with our models, should
be dealt with.
 Common sense can also be used to detect such kind
of inconsistency:
The same name occurring differently in an application.
Different names can appear to be the same (Dennis Vs
Denis)
Inappropriate values (Males being pregnant, or having an
negative age) Was rating “1,2,3”, now rating “A, B, C”
Difference between duplicate records

Inconsistent Data
 We want to transform all dates to the same format internally
 Some systems accept dates in many formats
 e.g. “Sep 24, 2003” , 9/24/03, 24.09.03, etc
 dates are transformed internally to a standard value
 Frequently, just the year (YYYY) is sufficient
 For more details, we may need the month, the day, the hour,
etc
 Representing date as YYYYMM or YYYYMMDD can be OK.

Data Integration
Goal identification
& Data
Understanding
Data
Transformation
Data
Reduction

Data Integration
 Combines data from multiple sources into a coherent
store.
 Increasingly data a mining projects require data
from more than one data source.
 Such as multiple databases, data warehouse, flat
files and historical data.

Data Integration
 Data is stored in many systems across enterprise
and outside the enterprise
The source of data fall into two categories:
 Internal sources that are generated through enterprise
activities such as databases, historical data, Web sites
and warehouses.
 External sources such as credit bureaus, phone
companies and demographical information.

Data Integration
 Data Warehouse: is a structure that links information
from two or more databases.
 Data warehouse brings data from different data
sources into a central repository.
 It performs some data integration, clean-up, and
summarization, and distribute the information data
marts.

2016

 Definition 1: Transform the data into a form
appropriate for given data mining method.
 Definition 2: Data transformation is the process of
converting data or information from one format to
another, usually from the format of a source system
into the required format of a new destination system.
Data Transformation

 Methods include:
 Smoothing
 Aggregation
 Generalization
 Normalization (min-max)
Data Transformation

Methods of Data Transformation
 Normalization: Where the attributes are scaled so as to
fall within a small specified ranges such as -1.0 to 1.0.

2016

Introduction
Goal
identification and
Data
Understanding
Data TransformationData Reduction

 Warehouse may store terabytes of data: Complex
data analysis/mining may take a very long time to run
on the complete data set.
 Data reduction: Obtains a reduced representation of
the data set that is much smaller in volume but yet
produces the same (or almost the same) analytical
results.
Data Reduction

 The choice of data representation, and selection,
reduction or transformation of features is probably the
most important issue that determines the quality of a
data-mining solution.
Data Reduction

 The three basic operations in a data-reduction
process are:
 Delete a column (feature selection).
 Delete a row (sampling).
 Reduce the number of values in a column
(Discretization).
Data Reduction

Feature Selection
 We want to choose features (attributes) that are
relevant to our data-mining application in order to
achieve maximum performance with the minimum
measurement and processing effort.

Feature Selection
1. Redundant features
 Duplicate much or all of the information contained in
one or more other attributes
 E.g., purchase price of a product and the amount of
sales tax paid.

Feature Selection
2. Irrelevant features
 Contain no information that is useful for the data
mining task at hand.
E.g., students' ID is often irrelevant to the task of
predicting students' GPA.

Feature Selection
3. Selecting Most Relevant Fields
 If there are too many fields, select a subset that is most
relevant.
Can select top N fields using some computations.
What is good N?
 Rule of thumb -- keep top 50 fields

Feature Selection
 Two types of feature selection
 Unsupervised: Reduce fields without knowing class label.
Supervised: Select fields with respect to class label.

Sampling
 Sampling: Obtaining a small sample s to represent
the whole data set N.
Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data.

Sampling
 Key principle: Choose a representative subset of the
data.
 Simple random sampling may have very poor
performance in the presence of skew
 Develop adaptive sampling methods, e.g., stratified
sampling.

Sampling
8000 points 2000 Points 500 Points
Sample Size

Types of Sampling
 Sampling without replacement:
 Once an object is selected, it is removed from the population.
 Sampling with replacement
 A selected object is not removed from the population.
 Stratified sampling:
 Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the data)

Types of Sampling(Sampling without replacement)
Raw Data

Types of Sampling(Sampling with replacement)
Raw Data

Types of Sampling
Raw Data Cluster/Stratified Sample

Types of Sampling
Age
Young
Young
Young
Young
Middle-age
Middle-age
Middle-age
Middle-age
Middle-age
Middle-age
Middle-age
Senior
Senior
Age
Young
Young
Middle-age
Middle-age
Middle-age
Middle-age
Senior

Discretization
 Discretization is very useful for generating a
summary of data, also called “binning”.
 It does not use the class information.
 Suppose we have the following set of values for the
attribute - AGE : 0, 4, 12, 16, 16, 18, 24, 26, 28.
Two possible ways in which Binning can be applied
are: Equi-width binning or Equi-frequency binning .

Data preparation and processing chapter 2

More Related Content

What's hot

Similar to Data preparation and processing chapter 2

More from Mahmoud Alfarra

Recently uploaded

Data preparation and processing chapter 2