SlideShare a Scribd company logo
1 of 43
Data Preparation
1
Data Collection for Mining
• Data mining requires collecting great amount of data
(available in warehouses or databases) to achieve the
intended objective.
– Data mining starts by understanding the business or problem
domain in order to gain the business knowledge
• Business knowledge guides the process towards useful
results, and enables the recognition of those results that are
useful.
– Based on the business knowledge data related to the business
problem are identified from the data base/data warehouse for
mining.
• Before feeding data to DM we have to make sure the
quality of data?
2
3
Data Quality Measures
• A well-accepted multidimensional data quality
measures are the following:
– Accuracy
– Completeness
– Consistency
– Timeliness
– Believability
– Interpretability
• Most of the data in the real world are poor quality,
i.e.:
– Incomplete, Inconsistent, Noisy, Invalid, Redundant, …
Data is often of low quality
• Collecting the required data is challenging
– In addition to its heterogeneous & distributed nature of
data, data in the real world is dirty and low quality.
• Why?
– You didn’t collect it yourself!
– It probably was created for some other use, and then
you came along wanting to integrate it
– People make mistakes (typos)
– People are busy (“this is good enough”) to
systematically organize carefully using structured
formats
4
5
Types of problems with data
• Some data have problems on their own that needs to be
cleaned:
– Outliers: misleading data that do not fit to most of the data/facts
– Missing data: attributes values might be absent which needs to be
replaced with estimates
– Irrelevant data: attributes in the database that might not be of
interest to the DM task being developed
– Noisy data: attribute values that might be invalid or incorrect. E.g.
typographical errors
– Inconsistent data, duplicate data, etc.
• Other data are problematic only when we want to integrate it
– Everyone had their own way of structuring and formatting data,
based on what was convenient for them
– How to integrate data organized in different format following
different conventions.
6
Major Tasks in Data Preprocessing
• Data cleansing: to get rid of bad data
– Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
• Data integration
– Integration of data from multiple sources, such as databases,
data warehouses, or files
• Data reduction: obtains a reduced representation of the data
set that is much smaller in volume, yet produces almost the
same results.
– Dimensionality reduction
– Numerosity/size reduction
– Data compression
• Data transformation
– Normalization
– Discretization and/or Concept hierarchy generation
Compression Cont.
7
Case study: Government Agency Data
What we want:
ID Name City State
1 Ministry of Transportation Addis Ababa Addis Ababa
2 Ministry of Finance Addis Ababa Addis Ababa
3 Office of Foreign Affairs Addis Ababa Addis Ababa
8
Data Cleaning: Redundancy
• Duplicate or redundant data is data problems which
require data cleaning
• What’s wrong here?
• How to clean it: manually or automatically?
ID Name City State
1 Ministry of Transportation Addis Ababa Addis Ababa
2 Ministry of Finance Addis Ababa Addis Ababa
3 Ministry of Finance Addis Ababa Addis Ababa
9
Data Cleaning: Incomplete (Missing) Data
• Incomplete data:
– lacking certain attributes of interest
– containing only aggregate data
• e.g. traffic police car accident report: this much accident
occurs this day in this sub-city
– Data is not always available, lacking attribute values. E.g.,
Occupation=“ ”
• many tuples have no recorded value for several attributes,
such as customer income in sales data
• What’s wrong here? A missing required field
ID Name City State
1 Ministry of Transportation Addis Ababa Addis Ababa
2 Ministry of Finance ? Addis Ababa
3 Office of Foreign Affairs Addis Ababa Addis Ababa
10
11
Data Cleaning: Incomplete (Missing) Data
• Missing data may be due to
–Inconsistent with other recorded data and thus deleted
–Data not entered due to misunderstanding and may not be
considered important at the time of entry
–Not register history or changes of the data
• How to handle Missing data? Missing data may need to be
inferred
– Ignore the missing value: not effective when the percentage
of missing values per attribute varies considerably
– Fill in the missing value manually: tedious + infeasible?
– Fill automatically
• calculate, say, using Expected Maximization (EM) Algorithm
the most probable value
Predict missing value using EM
• Solves estimation with incomplete data.
– Obtain initial estimates for parameters using mean value.
– use estimates for calculating a value for missing data &
– The process continue Iteratively until convergence ((μi - μi+1) ≤ Ө).
• E.g.: out of six data items given known values= {1, 5, 10, 4},
estimate the two missing data items?
– Let the EM converge if two estimates differ in 0.05 & our initial guess
of the two missing values= 3.
• The algorithm
stop since the
last two
estimates are
only 0.05 apart.
• Thus, our
estimate for the
two items is
4.97. 12
13
Data Cleaning: Noisy Data
• Noisy: containing noise, errors, or outliers
– e.g., Salary=“−10” (an error)
• Typographical errors are errors that corrupt data
• Let say ‘green’ is written as ‘rgeen’
• Incorrect attribute values may be due to
– faulty data collection instruments (e.g.: OCR)
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
13
14
Data Cleaning: How to catch Noisy Data
• Manually check all data : tedious + infeasible?
• Sort data by frequency
– ‘green’ is more frequent than ‘rgeen’
– Works well for categorical data
• Use, say Numerical constraints to Catch Corrupt Data
• Weight can’t be negative
• People can’t have more than 2 parents
• Salary can’t be less than Birr 300
• Use statistical techniques to Catch Corrupt Data
– Check for outliers (the case of the 8 meters man)
– Check for correlated outliers using n-gram (“pregnant male”)
• People can be male
• People can be pregnant
• People can’t be male AND pregnant
14
Data Integration
• Data integration combines data from multiple sources
(database, data warehouse, files & sometimes from
non-electronic sources) into a coherent store
• Because of the use of different sources, data that is
fine on its own may become problematic when we
want to integrate it.
• Some of the issues are:
– Different formats and structures
– Conflicting and redundant data
– Data at different levels
15
Data Integration: Formats
• Not everyone uses the same format. Do you agree?
– Schema integration: e.g., A.cust-id  B.cust-#
• Integrate metadata from different sources
• Dates are especially problematic:
– 12/19/97
– 19/12/97
– 19/12/1997
– 19-12-97
– Dec 19, 1997
– 19 December 1997
– 19th Dec. 1997
• Are you frequently writing money as:
– Birr 200, Br. 200, 200 Birr, …
16
17
Data Integration: Inconsistent
• Inconsistent data: containing discrepancies in codes or
names, which is also the problem of lack of
standardization / naming conventions. e.g.,
– Age=“26” vs. Birthday=“03/07/1986”
– Some use “1,2,3” for rating; others “A, B, C”
• Discrepancy between duplicate records
ID Name City State
1 Ministry of Transportation Addis Ababa Addis Ababa region
2 Ministry of Finance Addis Ababa
Addis Ababa
administration
3 Office of Foreign Affairs Addis Ababa
Addis Ababa regional
administration
Data Integration: different structure
What’s wrong here? No data type constraints
18
ID Name City State
1234
Ministry of
Transportation Addis Ababa AA
ID Name City State
GCR34 Ministry of Finance Addis Ababa AA
Name ID City State
Office of Foreign
Affairs GCR34 Addis Ababa AA
Data Integration: Data that Moves
• Be careful of taking snapshots of a moving target
• Example: Let’s say you want to store the price of a shoe
in France, and the price of a shoe in Italy. Can we use
same currency (say, US$) or country’s currency?
– You can’t store it all in the same currency (say, US$) because
the exchange rate changes
– Price in foreign currency stays the same
– Must keep the data in foreign currency and use the current
exchange rate to convert
• The same needs to be done for ‘Age’
– It is better to store ‘Date of Birth’ than ‘Age’
19
Data at different level of detail than needed
•If it is at a finer level of detail, you can sometimes bin it
• Example
–I need age ranges of 20-30, 30-40, 40-50, etc.
–Imported data contains birth date
–No problem! Divide data into appropriate categories
• Sometimes you cannot bin it
• Example
– I need age ranges 20-30, 30-40, 40-50 etc.
– Data is of age ranges 25-35, 35-45, etc.
– What to do?
• Ignore age ranges because you aren’t sure
• Make educated guess based on imported data (e.g.,
assume that # people of age 25-35 are average # of
people of age 20-30 & 30-40) 20
Data Integration: Conflicting Data
• Detecting and resolving data value conflicts
–For the same real world entity, attribute values from different
sources are different
–Possible reasons: different representations, different scales, e.g.,
American vs. British units
• weight measurement: KG or pound
• Height measurement: meter or inch
• Information source #1 says that Alex lives in Bahirdar
– Information source #2 says that Alex lives in Mekele
• What to do?
– Use both (He lives in both places)
– Use the most recently updated piece of information
– Use the “most trusted” information
– Flag row to be investigated further by hand
– Use neither (We’d rather be incomplete than wrong)
21
22
Handling Redundancy in Data Integration
• Redundant data occur often when integration of
multiple databases
– Object identification: The same attribute or object may have
different names in different databases
– Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue, age
• Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
• Careful integration of the data from multiple sources
may help reduce/avoid redundancies and inconsistencies
and improve mining speed and quality
23
Covariance
• Covariance is similar to correlation
where n is the number of tuples, and are the respective
mean of p and q, σp and σq are the respective standard
deviation of p and q.
• It can be simplified in computation as
• Positive covariance: If Covp,q > 0, then p and q both tend to be
directly related.
• Negative covariance: If Covp,q < 0 then p and q are inversely
related.
• Independence: Covp,q = 0
p q
Example: Co-Variance
• Suppose two stocks A and B have the following values in
one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry
trends, will their prices rise or fall together?
– E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
– E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
– Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
• Thus, A and B rise together since Cov(A, B) > 0.
24
25
Data Reduction Strategies
• Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
• Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time
to run on the complete data set.
• Data reduction strategies
– Dimensionality reduction,
• Select best attributes or remove unimportant attributes
– Numerosity reduction
• Reduce data volume by choosing alternative, smaller forms of
data representation
– Data compression
• Is a technology that reduce the size of large files such that
smaller files take less memory space and fast to transfer over a
network or the Internet,
26
Data Reduction: Dimensionality Reduction
• Curse of dimensionality
– When dimensionality increases, data becomes increasingly sparse
• Dimensionality reduction
– Help eliminate irrelevant features and reduce noise
– Reduce time and space required in data mining
– Allow easier visualization
• Method: attribute subset selection
–One of the method to reduce dimensionality of data is by selecting
best attributes
–Helps to avoid redundant attributes : that contain duplicate
information in one or more other attributes
• E.g., purchase price of a product & the amount of sales tax paid
– Helps to avoid Irrelevant attributes: that contain no information
that is useful for the data mining task at hand
• E.g., is students' ID relevant to predict students' GPA?
27
Heuristic Search in Attribute Selection
• Given M attributes there are 2M possible attribute
combinations
• Commonly used heuristic attribute selection methods:
– Best step-wise feature selection:
• The best single-attribute is picked first
• Then next best attribute condition to the first, ...
• The process continues until the performance of the
combined attributes starts to decline
– Step-wise attribute elimination:
• Repeatedly eliminate the worst attribute
– Best combined attribute selection and elimination
28
Data Reduction: Numerosity Reduction
• Different methods can be used, including Clustering and
sampling
• Clustering
– Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
– There are many choices of clustering definitions and clustering
algorithms
• Sampling
– obtaining a small sample s to represent the whole data set N
– Allow a mining algorithm to run in complexity that is potentially
sub-linear to the size of the data
– Key principle: Choose a representative subset of the data using
suitable sampling technique
29
Types of Sampling
• Simple random sampling
– There is an equal probability of selecting any particular item
– Simple random sampling may have very poor performance in
the presence of skew
• Stratified sampling:
– Develop adaptive sampling methods, e.g., stratified sampling;
which partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same
percentage of the data)
– Used in conjunction with skewed data
• Sampling without replacement
– Once an object is selected, it is removed from the population
• Sampling with replacement
– A selected object is not removed from the population
30
Sampling: With or without Replacement
Raw Data
31
Sampling: Cluster or Stratified Sampling
Raw Data Cluster/Stratified Sample
32
Data Transformation
• A function that maps the entire set of values of a given attribute to
a new set of replacement values such that each old value can be
identified with one of the new values
• Methods for data transformation
– Normalization: Scaled to fall within a smaller, specified range of
values
• min-max normalization
• z-score normalization
– Discretization: Reduce data size by dividing the range of a
continuous attribute into intervals. Interval labels can then be
used to replace actual data values
• Discretization can be performed recursively on an attribute
using method such as
– Binning: divide values into intervals
– Concept hierarchy climbing: organizes concepts (i.e.,
attribute values) hierarchically
Normalization
• Min-max normalization:
– Ex. Let income range $12,000 to $98,000 is normalized to
[0.0, 1.0]. Then $73,600 is mapped to
• Z-score normalization (μ: mean, σ: standard deviation):
– Ex. Let μ = 54,000, σ = 16,000. Then,
716
.
0
0
)
0
0
.
1
(
000
,
12
000
,
98
000
,
12
600
,
73





A
A
A
min
max
min
v
v



'
A
A
v
v




'
225
.
1
000
,
16
000
,
54
600
,
73


33
34
Simple Discretization
• Equal-width (distance) partitioning
– Divides the range into N intervals of equal size: uniform
grid
– if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B –A)/N.
– This is the most straightforward, but outliers may
dominate presentation
– Skewed data is not handled well
• Equal-depth (frequency) partitioning
– Divides the range into N intervals, each containing
approximately same number of samples
– Good data scaling
– Managing categorical attributes can be tricky
35
Binning Methods for Data Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Concept Hierarchy Generation
• Concept hierarchy organizes concepts (i.e.,
attribute values) hierarchically and is usually
associated with each dimension in a data
warehouse
– Concept hierarchy formation: Recursively
reduce the data by collecting and replacing
low level concepts (such as numeric values
for age) by higher level concepts (such as
youth, adult, or senior)
• Concept hierarchies can be explicitly
specified by domain experts and/or data
warehouse designers
country
Region or state
city
Sub city
Kebele
• Concept hierarchy can be automatically formed by the analysis of
the number of distinct values. E.g., for a set of attributes: {Kebele,
city, state, country}
For numeric data, use discretization methods.
Data sets preparation for learning
• A standard machine learning technique is to divide the
dataset into a training set and a test set.
– Training dataset is used for learning the parameters of
the model in order to produce hypotheses.
• A training set is a set of problem instances (described as a set
of properties and their values), together with a classification
of the instance.
– Test dataset, which is never seen during the
hypothesis forming stage, is used to get a final,
unbiased estimate of how well the model works.
• Test set evaluates the accuracy of the model/hypothesis in
predicting the categorization of unseen examples.
• A set of instances and their classifications used to test the
accuracy of a learned hypothesis.
38
Classification: Train, Validation, Test split
Data
Predictions
Y N
Results Known
Training set
Test dataset
+
+
-
-
+
Model Builder
Evaluate
+
-
+
-
Final Model
Final Test Set
+
-
+
-
Final Evaluation
Model
Builder
Divide the dataset into training & test
• There are various ways in which to separate the data
into training and test sets
• The established ways by which to use the two sets to
assess the effectiveness and the predictive/ descriptive
accuracy of a machine learning techniques over unseen
examples.
– The holdout method
• Repeated holdout method
– Cross-validation
– The bootstrap
40
The holdout method
• The holdout method reserves a certain amount for
testing and uses the remainder for training
– Usually: one third for testing, the rest for training
• For small or “unbalanced” datasets, samples might not
be representative
– Few or none instances of some classes
• Stratified sample: advanced version of balancing the
data
– Make sure that each class is represented with approximately
equal proportions in both subsets
41
Repeated Holdout Method
• Holdout estimate can be made more reliable by repeating
the process with different subsamples, called the
repeated holdout method
– Random subsampling: performs K data splits of the dataset
• In each iteration, a certain proportion is randomly selected
for training (possibly with stratification) without
replacement
• For each data split we retrain the classifier from scratch with
the training examples and estimate accuracy with the test
examples
• The error rates on the different iterations are averaged to
yield an overall error rate
• Still not optimum: the different test sets overlap
– Can we prevent overlapping?
42
Cross-validation
• Cross-validation avoids overlapping test sets
– First step: data is split into k subsets of equal-sized sets
randomly. A partition of a set is a collection of subsets for
which the intersection of any pair of sets is empty. That is, no
element of one subset is an element of another subset in a
partition.
– Second step: each subset in turn is used for testing and the
remainder for training
• This is called k-fold cross-validation
– Often the subsets are stratified before the cross-validation is
performed
• The error estimates are averaged to yield an overall error
estimate
43
43
Cross-validation example:
— Break up data into groups of the same size
— Hold aside one group for testing and use the rest to build model
— Repeat
Test

More Related Content

Similar to DataPreprocessing.ppt

Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.pptchatbot9
 
10 Decisions You Will Face With Any Donor Data Migration Project
10 Decisions You Will Face With Any Donor Data Migration Project10 Decisions You Will Face With Any Donor Data Migration Project
10 Decisions You Will Face With Any Donor Data Migration ProjectBloomerang
 
Data preprocessing 2
Data preprocessing 2Data preprocessing 2
Data preprocessing 2extraganesh
 
data processing.pdf
data processing.pdfdata processing.pdf
data processing.pdfDimpyJindal4
 
10 tough decisions donor data migration decisions (Webinar hosted by Bloomera...
10 tough decisions donor data migration decisions (Webinar hosted by Bloomera...10 tough decisions donor data migration decisions (Webinar hosted by Bloomera...
10 tough decisions donor data migration decisions (Webinar hosted by Bloomera...Brandon Fix
 
Preprocessing
PreprocessingPreprocessing
Preprocessingmmuthuraj
 
omama munir 58.pptx
omama munir 58.pptxomama munir 58.pptx
omama munir 58.pptxOmamaNoor2
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.pptcongtran88
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2Mahmoud Alfarra
 
Asists in context nyacce 2013
Asists in context nyacce 2013Asists in context nyacce 2013
Asists in context nyacce 2013Venu Thelakkat
 
Asists in context nyacce 2013
Asists in context nyacce 2013Asists in context nyacce 2013
Asists in context nyacce 2013Venu Thelakkat
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3varshakumar21
 
Data preprocessing in Data Mining
Data preprocessing  in Data MiningData preprocessing  in Data Mining
Data preprocessing in Data MiningSamad Baseer Khan
 

Similar to DataPreprocessing.ppt (20)

Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
10 Decisions You Will Face With Any Donor Data Migration Project
10 Decisions You Will Face With Any Donor Data Migration Project10 Decisions You Will Face With Any Donor Data Migration Project
10 Decisions You Will Face With Any Donor Data Migration Project
 
Data preprocessing 2
Data preprocessing 2Data preprocessing 2
Data preprocessing 2
 
data processing.pdf
data processing.pdfdata processing.pdf
data processing.pdf
 
10 tough decisions donor data migration decisions (Webinar hosted by Bloomera...
10 tough decisions donor data migration decisions (Webinar hosted by Bloomera...10 tough decisions donor data migration decisions (Webinar hosted by Bloomera...
10 tough decisions donor data migration decisions (Webinar hosted by Bloomera...
 
Data science unit1
Data science unit1Data science unit1
Data science unit1
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
omama munir 58.pptx
omama munir 58.pptxomama munir 58.pptx
omama munir 58.pptx
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
 
Asists in context nyacce 2013
Asists in context nyacce 2013Asists in context nyacce 2013
Asists in context nyacce 2013
 
Asists in context nyacce 2013
Asists in context nyacce 2013Asists in context nyacce 2013
Asists in context nyacce 2013
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 
Data preprocessing in Data Mining
Data preprocessing  in Data MiningData preprocessing  in Data Mining
Data preprocessing in Data Mining
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 

More from TesfahunAsmare1

More from TesfahunAsmare1 (7)

8-11 (2).pptx
8-11 (2).pptx8-11 (2).pptx
8-11 (2).pptx
 
Ch-4.pptx
Ch-4.pptxCh-4.pptx
Ch-4.pptx
 
chapter 7.ppt
chapter 7.pptchapter 7.ppt
chapter 7.ppt
 
Ch-1.pptx
Ch-1.pptxCh-1.pptx
Ch-1.pptx
 
mk.pptx
mk.pptxmk.pptx
mk.pptx
 
Machine_Learning_KNN_Presentation.pptx
Machine_Learning_KNN_Presentation.pptxMachine_Learning_KNN_Presentation.pptx
Machine_Learning_KNN_Presentation.pptx
 
Incorporate_Measuring_Costs.pptx
Incorporate_Measuring_Costs.pptxIncorporate_Measuring_Costs.pptx
Incorporate_Measuring_Costs.pptx
 

Recently uploaded

100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 

Recently uploaded (20)

Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 

DataPreprocessing.ppt

  • 2. Data Collection for Mining • Data mining requires collecting great amount of data (available in warehouses or databases) to achieve the intended objective. – Data mining starts by understanding the business or problem domain in order to gain the business knowledge • Business knowledge guides the process towards useful results, and enables the recognition of those results that are useful. – Based on the business knowledge data related to the business problem are identified from the data base/data warehouse for mining. • Before feeding data to DM we have to make sure the quality of data? 2
  • 3. 3 Data Quality Measures • A well-accepted multidimensional data quality measures are the following: – Accuracy – Completeness – Consistency – Timeliness – Believability – Interpretability • Most of the data in the real world are poor quality, i.e.: – Incomplete, Inconsistent, Noisy, Invalid, Redundant, …
  • 4. Data is often of low quality • Collecting the required data is challenging – In addition to its heterogeneous & distributed nature of data, data in the real world is dirty and low quality. • Why? – You didn’t collect it yourself! – It probably was created for some other use, and then you came along wanting to integrate it – People make mistakes (typos) – People are busy (“this is good enough”) to systematically organize carefully using structured formats 4
  • 5. 5 Types of problems with data • Some data have problems on their own that needs to be cleaned: – Outliers: misleading data that do not fit to most of the data/facts – Missing data: attributes values might be absent which needs to be replaced with estimates – Irrelevant data: attributes in the database that might not be of interest to the DM task being developed – Noisy data: attribute values that might be invalid or incorrect. E.g. typographical errors – Inconsistent data, duplicate data, etc. • Other data are problematic only when we want to integrate it – Everyone had their own way of structuring and formatting data, based on what was convenient for them – How to integrate data organized in different format following different conventions.
  • 6. 6 Major Tasks in Data Preprocessing • Data cleansing: to get rid of bad data – Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration – Integration of data from multiple sources, such as databases, data warehouses, or files • Data reduction: obtains a reduced representation of the data set that is much smaller in volume, yet produces almost the same results. – Dimensionality reduction – Numerosity/size reduction – Data compression • Data transformation – Normalization – Discretization and/or Concept hierarchy generation
  • 8. Case study: Government Agency Data What we want: ID Name City State 1 Ministry of Transportation Addis Ababa Addis Ababa 2 Ministry of Finance Addis Ababa Addis Ababa 3 Office of Foreign Affairs Addis Ababa Addis Ababa 8
  • 9. Data Cleaning: Redundancy • Duplicate or redundant data is data problems which require data cleaning • What’s wrong here? • How to clean it: manually or automatically? ID Name City State 1 Ministry of Transportation Addis Ababa Addis Ababa 2 Ministry of Finance Addis Ababa Addis Ababa 3 Ministry of Finance Addis Ababa Addis Ababa 9
  • 10. Data Cleaning: Incomplete (Missing) Data • Incomplete data: – lacking certain attributes of interest – containing only aggregate data • e.g. traffic police car accident report: this much accident occurs this day in this sub-city – Data is not always available, lacking attribute values. E.g., Occupation=“ ” • many tuples have no recorded value for several attributes, such as customer income in sales data • What’s wrong here? A missing required field ID Name City State 1 Ministry of Transportation Addis Ababa Addis Ababa 2 Ministry of Finance ? Addis Ababa 3 Office of Foreign Affairs Addis Ababa Addis Ababa 10
  • 11. 11 Data Cleaning: Incomplete (Missing) Data • Missing data may be due to –Inconsistent with other recorded data and thus deleted –Data not entered due to misunderstanding and may not be considered important at the time of entry –Not register history or changes of the data • How to handle Missing data? Missing data may need to be inferred – Ignore the missing value: not effective when the percentage of missing values per attribute varies considerably – Fill in the missing value manually: tedious + infeasible? – Fill automatically • calculate, say, using Expected Maximization (EM) Algorithm the most probable value
  • 12. Predict missing value using EM • Solves estimation with incomplete data. – Obtain initial estimates for parameters using mean value. – use estimates for calculating a value for missing data & – The process continue Iteratively until convergence ((μi - μi+1) ≤ Ө). • E.g.: out of six data items given known values= {1, 5, 10, 4}, estimate the two missing data items? – Let the EM converge if two estimates differ in 0.05 & our initial guess of the two missing values= 3. • The algorithm stop since the last two estimates are only 0.05 apart. • Thus, our estimate for the two items is 4.97. 12
  • 13. 13 Data Cleaning: Noisy Data • Noisy: containing noise, errors, or outliers – e.g., Salary=“−10” (an error) • Typographical errors are errors that corrupt data • Let say ‘green’ is written as ‘rgeen’ • Incorrect attribute values may be due to – faulty data collection instruments (e.g.: OCR) – data entry problems – data transmission problems – technology limitation – inconsistency in naming convention 13
  • 14. 14 Data Cleaning: How to catch Noisy Data • Manually check all data : tedious + infeasible? • Sort data by frequency – ‘green’ is more frequent than ‘rgeen’ – Works well for categorical data • Use, say Numerical constraints to Catch Corrupt Data • Weight can’t be negative • People can’t have more than 2 parents • Salary can’t be less than Birr 300 • Use statistical techniques to Catch Corrupt Data – Check for outliers (the case of the 8 meters man) – Check for correlated outliers using n-gram (“pregnant male”) • People can be male • People can be pregnant • People can’t be male AND pregnant 14
  • 15. Data Integration • Data integration combines data from multiple sources (database, data warehouse, files & sometimes from non-electronic sources) into a coherent store • Because of the use of different sources, data that is fine on its own may become problematic when we want to integrate it. • Some of the issues are: – Different formats and structures – Conflicting and redundant data – Data at different levels 15
  • 16. Data Integration: Formats • Not everyone uses the same format. Do you agree? – Schema integration: e.g., A.cust-id  B.cust-# • Integrate metadata from different sources • Dates are especially problematic: – 12/19/97 – 19/12/97 – 19/12/1997 – 19-12-97 – Dec 19, 1997 – 19 December 1997 – 19th Dec. 1997 • Are you frequently writing money as: – Birr 200, Br. 200, 200 Birr, … 16
  • 17. 17 Data Integration: Inconsistent • Inconsistent data: containing discrepancies in codes or names, which is also the problem of lack of standardization / naming conventions. e.g., – Age=“26” vs. Birthday=“03/07/1986” – Some use “1,2,3” for rating; others “A, B, C” • Discrepancy between duplicate records ID Name City State 1 Ministry of Transportation Addis Ababa Addis Ababa region 2 Ministry of Finance Addis Ababa Addis Ababa administration 3 Office of Foreign Affairs Addis Ababa Addis Ababa regional administration
  • 18. Data Integration: different structure What’s wrong here? No data type constraints 18 ID Name City State 1234 Ministry of Transportation Addis Ababa AA ID Name City State GCR34 Ministry of Finance Addis Ababa AA Name ID City State Office of Foreign Affairs GCR34 Addis Ababa AA
  • 19. Data Integration: Data that Moves • Be careful of taking snapshots of a moving target • Example: Let’s say you want to store the price of a shoe in France, and the price of a shoe in Italy. Can we use same currency (say, US$) or country’s currency? – You can’t store it all in the same currency (say, US$) because the exchange rate changes – Price in foreign currency stays the same – Must keep the data in foreign currency and use the current exchange rate to convert • The same needs to be done for ‘Age’ – It is better to store ‘Date of Birth’ than ‘Age’ 19
  • 20. Data at different level of detail than needed •If it is at a finer level of detail, you can sometimes bin it • Example –I need age ranges of 20-30, 30-40, 40-50, etc. –Imported data contains birth date –No problem! Divide data into appropriate categories • Sometimes you cannot bin it • Example – I need age ranges 20-30, 30-40, 40-50 etc. – Data is of age ranges 25-35, 35-45, etc. – What to do? • Ignore age ranges because you aren’t sure • Make educated guess based on imported data (e.g., assume that # people of age 25-35 are average # of people of age 20-30 & 30-40) 20
  • 21. Data Integration: Conflicting Data • Detecting and resolving data value conflicts –For the same real world entity, attribute values from different sources are different –Possible reasons: different representations, different scales, e.g., American vs. British units • weight measurement: KG or pound • Height measurement: meter or inch • Information source #1 says that Alex lives in Bahirdar – Information source #2 says that Alex lives in Mekele • What to do? – Use both (He lives in both places) – Use the most recently updated piece of information – Use the “most trusted” information – Flag row to be investigated further by hand – Use neither (We’d rather be incomplete than wrong) 21
  • 22. 22 Handling Redundancy in Data Integration • Redundant data occur often when integration of multiple databases – Object identification: The same attribute or object may have different names in different databases – Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue, age • Redundant attributes may be able to be detected by correlation analysis and covariance analysis • Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
  • 23. 23 Covariance • Covariance is similar to correlation where n is the number of tuples, and are the respective mean of p and q, σp and σq are the respective standard deviation of p and q. • It can be simplified in computation as • Positive covariance: If Covp,q > 0, then p and q both tend to be directly related. • Negative covariance: If Covp,q < 0 then p and q are inversely related. • Independence: Covp,q = 0 p q
  • 24. Example: Co-Variance • Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14). • Question: If the stocks are affected by the same industry trends, will their prices rise or fall together? – E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4 – E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6 – Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4 • Thus, A and B rise together since Cov(A, B) > 0. 24
  • 25. 25 Data Reduction Strategies • Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results • Why data reduction? — A database/data warehouse may store terabytes of data. Complex data analysis may take a very long time to run on the complete data set. • Data reduction strategies – Dimensionality reduction, • Select best attributes or remove unimportant attributes – Numerosity reduction • Reduce data volume by choosing alternative, smaller forms of data representation – Data compression • Is a technology that reduce the size of large files such that smaller files take less memory space and fast to transfer over a network or the Internet,
  • 26. 26 Data Reduction: Dimensionality Reduction • Curse of dimensionality – When dimensionality increases, data becomes increasingly sparse • Dimensionality reduction – Help eliminate irrelevant features and reduce noise – Reduce time and space required in data mining – Allow easier visualization • Method: attribute subset selection –One of the method to reduce dimensionality of data is by selecting best attributes –Helps to avoid redundant attributes : that contain duplicate information in one or more other attributes • E.g., purchase price of a product & the amount of sales tax paid – Helps to avoid Irrelevant attributes: that contain no information that is useful for the data mining task at hand • E.g., is students' ID relevant to predict students' GPA?
  • 27. 27 Heuristic Search in Attribute Selection • Given M attributes there are 2M possible attribute combinations • Commonly used heuristic attribute selection methods: – Best step-wise feature selection: • The best single-attribute is picked first • Then next best attribute condition to the first, ... • The process continues until the performance of the combined attributes starts to decline – Step-wise attribute elimination: • Repeatedly eliminate the worst attribute – Best combined attribute selection and elimination
  • 28. 28 Data Reduction: Numerosity Reduction • Different methods can be used, including Clustering and sampling • Clustering – Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and diameter) only – There are many choices of clustering definitions and clustering algorithms • Sampling – obtaining a small sample s to represent the whole data set N – Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data – Key principle: Choose a representative subset of the data using suitable sampling technique
  • 29. 29 Types of Sampling • Simple random sampling – There is an equal probability of selecting any particular item – Simple random sampling may have very poor performance in the presence of skew • Stratified sampling: – Develop adaptive sampling methods, e.g., stratified sampling; which partition the data set, and draw samples from each partition (proportionally, i.e., approximately the same percentage of the data) – Used in conjunction with skewed data • Sampling without replacement – Once an object is selected, it is removed from the population • Sampling with replacement – A selected object is not removed from the population
  • 30. 30 Sampling: With or without Replacement Raw Data
  • 31. 31 Sampling: Cluster or Stratified Sampling Raw Data Cluster/Stratified Sample
  • 32. 32 Data Transformation • A function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values • Methods for data transformation – Normalization: Scaled to fall within a smaller, specified range of values • min-max normalization • z-score normalization – Discretization: Reduce data size by dividing the range of a continuous attribute into intervals. Interval labels can then be used to replace actual data values • Discretization can be performed recursively on an attribute using method such as – Binning: divide values into intervals – Concept hierarchy climbing: organizes concepts (i.e., attribute values) hierarchically
  • 33. Normalization • Min-max normalization: – Ex. Let income range $12,000 to $98,000 is normalized to [0.0, 1.0]. Then $73,600 is mapped to • Z-score normalization (μ: mean, σ: standard deviation): – Ex. Let μ = 54,000, σ = 16,000. Then, 716 . 0 0 ) 0 0 . 1 ( 000 , 12 000 , 98 000 , 12 600 , 73      A A A min max min v v    ' A A v v     ' 225 . 1 000 , 16 000 , 54 600 , 73   33
  • 34. 34 Simple Discretization • Equal-width (distance) partitioning – Divides the range into N intervals of equal size: uniform grid – if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N. – This is the most straightforward, but outliers may dominate presentation – Skewed data is not handled well • Equal-depth (frequency) partitioning – Divides the range into N intervals, each containing approximately same number of samples – Good data scaling – Managing categorical attributes can be tricky
  • 35. 35 Binning Methods for Data Smoothing  Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
  • 36. Concept Hierarchy Generation • Concept hierarchy organizes concepts (i.e., attribute values) hierarchically and is usually associated with each dimension in a data warehouse – Concept hierarchy formation: Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as youth, adult, or senior) • Concept hierarchies can be explicitly specified by domain experts and/or data warehouse designers country Region or state city Sub city Kebele • Concept hierarchy can be automatically formed by the analysis of the number of distinct values. E.g., for a set of attributes: {Kebele, city, state, country} For numeric data, use discretization methods.
  • 37. Data sets preparation for learning • A standard machine learning technique is to divide the dataset into a training set and a test set. – Training dataset is used for learning the parameters of the model in order to produce hypotheses. • A training set is a set of problem instances (described as a set of properties and their values), together with a classification of the instance. – Test dataset, which is never seen during the hypothesis forming stage, is used to get a final, unbiased estimate of how well the model works. • Test set evaluates the accuracy of the model/hypothesis in predicting the categorization of unseen examples. • A set of instances and their classifications used to test the accuracy of a learned hypothesis.
  • 38. 38 Classification: Train, Validation, Test split Data Predictions Y N Results Known Training set Test dataset + + - - + Model Builder Evaluate + - + - Final Model Final Test Set + - + - Final Evaluation Model Builder
  • 39. Divide the dataset into training & test • There are various ways in which to separate the data into training and test sets • The established ways by which to use the two sets to assess the effectiveness and the predictive/ descriptive accuracy of a machine learning techniques over unseen examples. – The holdout method • Repeated holdout method – Cross-validation – The bootstrap
  • 40. 40 The holdout method • The holdout method reserves a certain amount for testing and uses the remainder for training – Usually: one third for testing, the rest for training • For small or “unbalanced” datasets, samples might not be representative – Few or none instances of some classes • Stratified sample: advanced version of balancing the data – Make sure that each class is represented with approximately equal proportions in both subsets
  • 41. 41 Repeated Holdout Method • Holdout estimate can be made more reliable by repeating the process with different subsamples, called the repeated holdout method – Random subsampling: performs K data splits of the dataset • In each iteration, a certain proportion is randomly selected for training (possibly with stratification) without replacement • For each data split we retrain the classifier from scratch with the training examples and estimate accuracy with the test examples • The error rates on the different iterations are averaged to yield an overall error rate • Still not optimum: the different test sets overlap – Can we prevent overlapping?
  • 42. 42 Cross-validation • Cross-validation avoids overlapping test sets – First step: data is split into k subsets of equal-sized sets randomly. A partition of a set is a collection of subsets for which the intersection of any pair of sets is empty. That is, no element of one subset is an element of another subset in a partition. – Second step: each subset in turn is used for testing and the remainder for training • This is called k-fold cross-validation – Often the subsets are stratified before the cross-validation is performed • The error estimates are averaged to yield an overall error estimate
  • 43. 43 43 Cross-validation example: — Break up data into groups of the same size — Hold aside one group for testing and use the rest to build model — Repeat Test