SlideShare a Scribd company logo
Prof. Pier Luca Lanzi
Data Preparation
Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
Prof. Pier Luca Lanzi
Why Data Preprocessing?
•  Data in the real world are “dirty”
•  They are usually incomplete
§ Missing attribute values
§ Missing attributes of interest, or containing only aggregate data
•  They are usually noisy, containing errors or outliers
§ e.g., Salary=“-10”
•  They are usually inconsistent and contain discrepancies
§ e.g., Age=“42” Birthday=“03/07/1997”
§ e.g., Was rating “1,2,3”, now rating “A, B, C”
§ e.g., discrepancy between duplicate records
2
Prof. Pier Luca Lanzi
Why Is Data Dirty?
•  Incomplete data may come from
§ “Not applicable” data value when collected
§ Different considerations between the time when the data was
collected and when it is analyzed.
§ Human/hardware/software problems
•  Noisy data (incorrect values) may come from
§ Faulty data collection instruments
§ Human or computer error at data entry
§ Errors in data transmission
•  Inconsistent data may come from
§ Different data sources
§ Functional dependency violation (e.g., modify some linked data)
•  Duplicate records also need data cleaning
3
Prof. Pier Luca Lanzi
Why Is Data Preprocessing Important?
•  No quality in data, no quality in the mining results!
(trash in, trash out)
•  Quality decisions needs quality in data
•  E.g., duplicate or missing data may cause incorrect or even
misleading statistics.
•  Data warehouses need consistent integration of quality data
•  Data extraction, cleaning, and transformation comprises the
majority of the work of building a data warehouse
4
Prof. Pier Luca Lanzi
Major Tasks in Data Preprocessing
•  Data cleaning
§ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
•  Data integration
§ Integration of multiple databases, data cubes, or files
•  Data reduction
§ Dimensionality reduction
§ Numerosity reduction
§ Data compression
•  Data transformation and data discretization
§ Normalization
§ Concept hierarchy generation
5
Prof. Pier Luca Lanzi
Outliers
•  Outliers are data objects with characteristics that are considerably
different than most of the other data objects in the data set
6
Prof. Pier Luca Lanzi
Missing Values
•  Reasons for missing values
§ Information is not collected 
(e.g., people decline to give their age and weight)
§ Attributes may not be applicable to all cases 
(e.g., annual income is not applicable to children)
•  Handling missing values
§ Eliminate Data Objects
§ Estimate Missing Values
§ Ignore the Missing Value During Analysis
§ Replace with all possible values (weighted by their
probabilities)
7
Prof. Pier Luca Lanzi
Duplicate Data
•  Data set may include data objects that are duplicates, or almost
duplicates of one another
•  Major issue when merging data from heterogeous sources
•  Examples: same person with multiple email addresses
•  Data cleaning: process of dealing with duplicate data issues
8
Prof. Pier Luca Lanzi
Data Preprocessing
•  Aggregation
•  Sampling
•  Dimensionality Reduction
•  Feature subset selection
•  Feature creation
•  Discretization and Binarization
9
Prof. Pier Luca Lanzi
Aggregation
•  Combining two or more attributes (or objects) 
into a single attribute (or object)
•  Data reduction
§ Reduce the number of attributes or objects
•  Change of scale
§ Cities aggregated into regions
§ States, countries, etc
•  More “stable” data
§ Aggregated data tends to have less variability
10
Prof. Pier Luca Lanzi
Sampling
•  Sampling is the main technique employed for data selection
•  It is often used for both the preliminary investigation of the data
and the final data analysis.
•  Statisticians sample because obtaining the entire set of data of
interest is too expensive or time consuming
•  Sampling is used in data mining because processing the entire set
of data of interest is too expensive or time consuming
•  Using a sample will work almost as well as using the entire data
sets, if the sample is representative
•  A sample is representative if it has approximately the same
property (of interest) as the original set of data
11
Prof. Pier Luca Lanzi
Types of Sampling
•  Simple Random Sampling
§ There is an equal probability of selecting any particular item
•  Sampling without replacement
§ As each item is selected, it is removed from the population
•  Sampling with replacement
§ Objects are not removed from the population as they are
selected for the sample.
§ In sampling with replacement, the same object can 
be picked up more than once
•  Stratified sampling
§ Split the data into several partitions
§ Then draw random samples from each partition
12
Prof. Pier Luca Lanzi
2000 Points8000 points 500 Points
Prof. Pier Luca Lanzi
Sample Size
•  What sample size is necessary to get at least 
one object from each of 10 groups.
14
Prof. Pier Luca Lanzi
Dimensionality Reduction
•  Purpose
§ Avoid curse of dimensionality
§ Reduce amount of time and memory required by data mining
algorithms
§ Allow data to be more easily visualized
§ May help to eliminate irrelevant features or reduce noise
•  Techniques
§ Principle Component Analysis
§ Singular Value Decomposition
§ Others: supervised and non-linear techniques
15
Prof. Pier Luca Lanzi
Dimensionality Reduction: 
Principal Component Analysis (PCA)
•  Given N data vectors from n-dimensions, find k ≤ n orthogonal
vectors (principal components) that can be best used to
represent data
•  Works for numeric data only
•  Used when the number of dimensions is large
16
Prof. Pier Luca Lanzi
Dimensionality Reduction: 
Principal Component Analysis (PCA)
•  Normalize input data
•  Compute k orthonormal (unit) vectors, i.e., principal components
•  Each input data (vector) is a linear combination of the k principal
component vectors
•  The principal components are sorted in order of decreasing
“significance” or strength
•  Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance. (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data
17
Prof. Pier Luca Lanzi
Dimensionality Reduction: 
Principal Component Analysis (PCA)
•  Goal is to find a projection that captures the largest amount of
variation in data
18
Original axes
**
*
*
*
*
* *
*
*
*
*
*
*
*
*
*
* *
*
*
*
*
*
Data points
First principal componentSecond principal component
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Singular Value Decomposition
•  Theorem (Press 1992). It is always possible to decompose matrix
A into A = UΛVT where U, Λ, V are unique
•  U, V are column orthonormal, i.e., columns are unit vectors,
orthogonal to each other, so that UTU = I; VTV = I
•  Li,i are the singular values, they are positive, and sorted in
decreasing order
21
Prof. Pier Luca Lanzi
Singular Value Decomposition
•  Data are viewed as a matrix A with n examples described by m
numerical attributes

A[n x m] = U[n x r] Λ [ r x r] (V[m x r])T
•  A stores the original data
•  U represents the n examples using r new concepts/attributes
•  Λ represents the strength of each ‘concept’ (r is the rank of A)
•  V contain m terms, r concepts
•  Λi,i are called the singular values of A, if A is singular, some of the
wi will be 0; in general rank(A) = number of nonzero wi
•  SVD is mostly unique (up to permutation of singular values, or if
some Li are equal)
22
Prof. Pier Luca Lanzi
Singular Value Decomposition
for Dimensionality Reduction
•  Suppose you want to find best rank-k approximation to A
•  Set all but the largest k singular values to zero
•  Can form compact representation by eliminating columns of U
and V corresponding to zeroed Li
23
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.80
0 0.27
=
9.64 0
0 5.29
x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
Prof. Pier Luca Lanzi
Singular Value Decomposition
for Dimensionality Reduction
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
0.18
0.36
0.18
0.90
0
0
0
~
9.64
x
0.58 0.58 0.58 0 0
x
24
Prof. Pier Luca Lanzi
Singular Value Decomposition
for Dimensionality Reduction
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
~
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
25
Prof. Pier Luca Lanzi
Feature Subset Selection
•  Another way to reduce dimensionality of data
•  Redundant features
§ duplicate much or all of the information contained in one or
more other attributes
§ Example: purchase price of a product and the amount of sales
tax paid
•  Irrelevant features
§ contain no information that is useful for the data mining task
at hand
§ Example: students' ID is often irrelevant to the task of
predicting students' GPA
26
Prof. Pier Luca Lanzi
Feature Subset Selection
•  Brute-force approach
§ Try all possible feature subsets as input to data mining algorithm
•  Embedded approaches
§ Feature selection occurs naturally as part of 
the data mining algorithm
•  Filter approaches
§ Features are selected using a procedure that is 
independent from a specific data mining algorithm
§ E.g., attributes are selected based on correlation measures
•  Wrapper approaches
§ Use a data mining algorithm as a black box 
to find best subset of attributes
§ E.g., apply a genetic algorithm and an algorithm for decision tree to
find the best set of features for a decision tree
27
Prof. Pier Luca Lanzi
Example of using the information gain to rank the importance of attributes.
Prof. Pier Luca Lanzi
Decision trees applied to all the attributes.
Prof. Pier Luca Lanzi
Decision trees applied to all the breast dataset without the four least ranked attributes.
Prof. Pier Luca Lanzi
Wrapper approach using exhaustive search and decision trees.
Prof. Pier Luca Lanzi
Feature Creation
•  Create new attributes that can capture the important information
in a data set much more efficiently than the original attributes
•  E.g., given the birthday, create the attribute age
•  Three general methodologies:
§ Feature Extraction: domain-specific
§ Mapping Data to New Space
§ Feature Construction: combining features
32
Prof. Pier Luca Lanzi
Discretization
•  Three types of attributes
§ Nominal: values from an unordered set, e.g., color
§ Ordinal: values from an ordered set, 
e.g., military or academic rank
§ Continuous: real numbers, 
e.g., integer or real numbers
•  Discretization
§ Divide the range of a continuous attribute into intervals
§ Some classification algorithms only accept categorical
attributes
§ Reduce data size by discretization
§ Prepare for further analysis
33
Prof. Pier Luca Lanzi
Discretization Approaches
•  Supervised
§ Attributes are discretized using the class information
§ Generates intervals that tries to minimize the loss of
information about the class
•  Unsupervised
§ Attributes are discretized solely based on their values
34
Prof. Pier Luca Lanzi
Supervised Discretization
•  Entropy based approach
35
3 categories for both x and y 5 categories for both x and y
Prof. Pier Luca Lanzi
Discretization Without Using Class Labels 36
Data Equal interval width
Equal frequency K-means
Prof. Pier Luca Lanzi
Segmentation by Natural Partitioning
•  A simply 3-4-5 rule can be used to segment numeric data into
relatively uniform, “natural” intervals.
•  If an interval covers 3, 6, 7 or 9 distinct values at the most
significant digit, partition the range into 3 equi-width intervals (3
equal-width intervals for 3, 6, and 9; and 3 intervals in the
grouping of 2-3-2 for 7)
•  If it covers 2, 4, or 8 distinct values at the most significant digit,
partition the range into 4 intervals
•  If it covers 5 or 10 distinct values at the most significant digit,
partition the range into 5 intervals
37
Prof. Pier Luca Lanzi
Summary
•  Data exploration and preparation, or preprocessing, is a big issue
for both data warehousing and data mining
•  Descriptive data summarization is need for quality data
preprocessing
•  Data preparation includes
§ Data cleaning and data integration
§ Data reduction and feature selection
§ Discretization
•  A lot a methods have been developed but data preprocessing still
an active area of research
38

More Related Content

What's hot

DMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to ClusteringDMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to Clustering
Pier Luca Lanzi
 
DMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationDMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparation
Pier Luca Lanzi
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text mining
Pier Luca Lanzi
 
DMTM Lecture 05 Data representation
DMTM Lecture 05 Data representationDMTM Lecture 05 Data representation
DMTM Lecture 05 Data representation
Pier Luca Lanzi
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethods
Pier Luca Lanzi
 
DMTM 2015 - 09 Density Based Clustering
DMTM 2015 - 09 Density Based ClusteringDMTM 2015 - 09 Density Based Clustering
DMTM 2015 - 09 Density Based Clustering
Pier Luca Lanzi
 
DMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningDMTM Lecture 02 Data mining
DMTM Lecture 02 Data mining
Pier Luca Lanzi
 
DMTM Lecture 04 Classification
DMTM Lecture 04 ClassificationDMTM Lecture 04 Classification
DMTM Lecture 04 Classification
Pier Luca Lanzi
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 Clustering
Pier Luca Lanzi
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data exploration
Pier Luca Lanzi
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensembles
Pier Luca Lanzi
 
DMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision TreesDMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision Trees
Pier Luca Lanzi
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluation
Pier Luca Lanzi
 
DMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesDMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision trees
Pier Luca Lanzi
 
DMTM Lecture 03 Regression
DMTM Lecture 03 RegressionDMTM Lecture 03 Regression
DMTM Lecture 03 Regression
Pier Luca Lanzi
 
DMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification EnsemblesDMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification Ensembles
Pier Luca Lanzi
 
DMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification ModelsDMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification Models
Pier Luca Lanzi
 
DMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based ClusteringDMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based Clustering
Pier Luca Lanzi
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluation
Pier Luca Lanzi
 
An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier
ananth
 

What's hot (20)

DMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to ClusteringDMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to Clustering
 
DMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationDMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparation
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text mining
 
DMTM Lecture 05 Data representation
DMTM Lecture 05 Data representationDMTM Lecture 05 Data representation
DMTM Lecture 05 Data representation
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethods
 
DMTM 2015 - 09 Density Based Clustering
DMTM 2015 - 09 Density Based ClusteringDMTM 2015 - 09 Density Based Clustering
DMTM 2015 - 09 Density Based Clustering
 
DMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningDMTM Lecture 02 Data mining
DMTM Lecture 02 Data mining
 
DMTM Lecture 04 Classification
DMTM Lecture 04 ClassificationDMTM Lecture 04 Classification
DMTM Lecture 04 Classification
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 Clustering
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data exploration
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensembles
 
DMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision TreesDMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision Trees
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluation
 
DMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesDMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision trees
 
DMTM Lecture 03 Regression
DMTM Lecture 03 RegressionDMTM Lecture 03 Regression
DMTM Lecture 03 Regression
 
DMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification EnsemblesDMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification Ensembles
 
DMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification ModelsDMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification Models
 
DMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based ClusteringDMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based Clustering
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluation
 
An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier
 

Viewers also liked

Focus Junior - 14 Maggio 2016
Focus Junior - 14 Maggio 2016Focus Junior - 14 Maggio 2016
Focus Junior - 14 Maggio 2016
Pier Luca Lanzi
 
DMTM 2015 - 12 Classification Rules
DMTM 2015 - 12 Classification RulesDMTM 2015 - 12 Classification Rules
DMTM 2015 - 12 Classification Rules
Pier Luca Lanzi
 
DMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringDMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical Clustering
Pier Luca Lanzi
 
Course Introduction
Course IntroductionCourse Introduction
Course Introduction
Pier Luca Lanzi
 
DMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data MiningDMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data Mining
Pier Luca Lanzi
 
DMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course IntroductionDMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course Introduction
Pier Luca Lanzi
 
Course Organization
Course OrganizationCourse Organization
Course Organization
Pier Luca Lanzi
 
DMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association RulesDMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association Rules
Pier Luca Lanzi
 
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification Rules
Pier Luca Lanzi
 
Transparency in Game Mechanics
Transparency in Game MechanicsTransparency in Game Mechanics
Transparency in Game Mechanics
Pier Luca Lanzi
 
Idea Generation and Conceptualization
Idea Generation and ConceptualizationIdea Generation and Conceptualization
Idea Generation and Conceptualization
Pier Luca Lanzi
 
Game Balancing
Game BalancingGame Balancing
Game Balancing
Pier Luca Lanzi
 
Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014
Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014
Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014
Pier Luca Lanzi
 
Working with Formal Elements
Working with Formal ElementsWorking with Formal Elements
Working with Formal Elements
Pier Luca Lanzi
 
The Structure of Games
The Structure of GamesThe Structure of Games
The Structure of Games
Pier Luca Lanzi
 
The Design Document
The Design DocumentThe Design Document
The Design Document
Pier Luca Lanzi
 
Elements for the Theory of Fun
Elements for the Theory of FunElements for the Theory of Fun
Elements for the Theory of Fun
Pier Luca Lanzi
 

Viewers also liked (17)

Focus Junior - 14 Maggio 2016
Focus Junior - 14 Maggio 2016Focus Junior - 14 Maggio 2016
Focus Junior - 14 Maggio 2016
 
DMTM 2015 - 12 Classification Rules
DMTM 2015 - 12 Classification RulesDMTM 2015 - 12 Classification Rules
DMTM 2015 - 12 Classification Rules
 
DMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringDMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical Clustering
 
Course Introduction
Course IntroductionCourse Introduction
Course Introduction
 
DMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data MiningDMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data Mining
 
DMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course IntroductionDMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course Introduction
 
Course Organization
Course OrganizationCourse Organization
Course Organization
 
DMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association RulesDMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association Rules
 
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification Rules
 
Transparency in Game Mechanics
Transparency in Game MechanicsTransparency in Game Mechanics
Transparency in Game Mechanics
 
Idea Generation and Conceptualization
Idea Generation and ConceptualizationIdea Generation and Conceptualization
Idea Generation and Conceptualization
 
Game Balancing
Game BalancingGame Balancing
Game Balancing
 
Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014
Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014
Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014
 
Working with Formal Elements
Working with Formal ElementsWorking with Formal Elements
Working with Formal Elements
 
The Structure of Games
The Structure of GamesThe Structure of Games
The Structure of Games
 
The Design Document
The Design DocumentThe Design Document
The Design Document
 
Elements for the Theory of Fun
Elements for the Theory of FunElements for the Theory of Fun
Elements for the Theory of Fun
 

Similar to DMTM 2015 - 16 Data Preparation

L5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature EngineeringL5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature Engineering
Machine Learning Valencia
 
CS194Lec0hbh6EDA.pptx
CS194Lec0hbh6EDA.pptxCS194Lec0hbh6EDA.pptx
CS194Lec0hbh6EDA.pptx
PrudhvirajEluri1
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data Science
CS, NcState
 
OpenLSH - a framework for locality sensitive hashing
OpenLSH  - a framework for locality sensitive hashingOpenLSH  - a framework for locality sensitive hashing
OpenLSH - a framework for locality sensitive hashing
J Singh
 
Debugging machine-learning
Debugging machine-learningDebugging machine-learning
Debugging machine-learning
Michał Łopuszyński
 
Predict the Oscars with Data Science
Predict the Oscars with Data SciencePredict the Oscars with Data Science
Predict the Oscars with Data Science
Thinkful
 
Deductive databases
Deductive databasesDeductive databases
Deductive databases
John Popoola
 
Predict oscars (4:17)
Predict oscars (4:17)Predict oscars (4:17)
Predict oscars (4:17)
Thinkful
 
data mining
data miningdata mining
data mining
nehaanand123
 
Part1
Part1Part1
UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
subhashchandra197
 
overview of_data_processing
overview of_data_processingoverview of_data_processing
overview of_data_processing
FEG
 
Text Mining
Text MiningText Mining
Text Mining
sathish sak
 
Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptx
ImXaib
 
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and SparkReproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
Adaryl "Bob" Wakefield, MBA
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
Krishna Sankar
 
Predict oscars (5:11)
Predict oscars (5:11)Predict oscars (5:11)
Predict oscars (5:11)
Thinkful
 
Core Methods In Educational Data Mining
Core Methods In Educational Data MiningCore Methods In Educational Data Mining
Core Methods In Educational Data Mining
ebelani
 
Data pre processing
Data pre processingData pre processing
Data pre processingpommurajopt
 

Similar to DMTM 2015 - 16 Data Preparation (20)

L5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature EngineeringL5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature Engineering
 
CS194Lec0hbh6EDA.pptx
CS194Lec0hbh6EDA.pptxCS194Lec0hbh6EDA.pptx
CS194Lec0hbh6EDA.pptx
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data Science
 
OpenLSH - a framework for locality sensitive hashing
OpenLSH  - a framework for locality sensitive hashingOpenLSH  - a framework for locality sensitive hashing
OpenLSH - a framework for locality sensitive hashing
 
Dmblog
DmblogDmblog
Dmblog
 
Debugging machine-learning
Debugging machine-learningDebugging machine-learning
Debugging machine-learning
 
Predict the Oscars with Data Science
Predict the Oscars with Data SciencePredict the Oscars with Data Science
Predict the Oscars with Data Science
 
Deductive databases
Deductive databasesDeductive databases
Deductive databases
 
Predict oscars (4:17)
Predict oscars (4:17)Predict oscars (4:17)
Predict oscars (4:17)
 
data mining
data miningdata mining
data mining
 
Part1
Part1Part1
Part1
 
UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
 
overview of_data_processing
overview of_data_processingoverview of_data_processing
overview of_data_processing
 
Text Mining
Text MiningText Mining
Text Mining
 
Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptx
 
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and SparkReproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
 
Predict oscars (5:11)
Predict oscars (5:11)Predict oscars (5:11)
Predict oscars (5:11)
 
Core Methods In Educational Data Mining
Core Methods In Educational Data MiningCore Methods In Educational Data Mining
Core Methods In Educational Data Mining
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 

More from Pier Luca Lanzi

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi
Pier Luca Lanzi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei Videogiochi
Pier Luca Lanzi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning Welcome
Pier Luca Lanzi
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018
Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di apertura
Pier Luca Lanzi
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018
Pier Luca Lanzi
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph mining
Pier Luca Lanzi
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clustering
Pier Luca Lanzi
 
DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clustering
Pier Luca Lanzi
 
DMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringDMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clustering
Pier Luca Lanzi
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rules
Pier Luca Lanzi
 
DMTM Lecture 01 Introduction
DMTM Lecture 01 IntroductionDMTM Lecture 01 Introduction
DMTM Lecture 01 Introduction
Pier Luca Lanzi
 
VDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipelineVDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipeline
Pier Luca Lanzi
 
VDP2016 - Lecture 15 PCG with Unity
VDP2016 - Lecture 15 PCG with UnityVDP2016 - Lecture 15 PCG with Unity
VDP2016 - Lecture 15 PCG with Unity
Pier Luca Lanzi
 
VDP2016 - Lecture 14 Procedural content generation
VDP2016 - Lecture 14 Procedural content generationVDP2016 - Lecture 14 Procedural content generation
VDP2016 - Lecture 14 Procedural content generation
Pier Luca Lanzi
 

More from Pier Luca Lanzi (16)

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei Videogiochi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning Welcome
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di apertura
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph mining
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clustering
 
DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clustering
 
DMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringDMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clustering
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rules
 
DMTM Lecture 01 Introduction
DMTM Lecture 01 IntroductionDMTM Lecture 01 Introduction
DMTM Lecture 01 Introduction
 
VDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipelineVDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipeline
 
VDP2016 - Lecture 15 PCG with Unity
VDP2016 - Lecture 15 PCG with UnityVDP2016 - Lecture 15 PCG with Unity
VDP2016 - Lecture 15 PCG with Unity
 
VDP2016 - Lecture 14 Procedural content generation
VDP2016 - Lecture 14 Procedural content generationVDP2016 - Lecture 14 Procedural content generation
VDP2016 - Lecture 14 Procedural content generation
 

Recently uploaded

Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
beazzy04
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
Peter Windle
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
timhan337
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
Pavel ( NSTU)
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
SACHIN R KONDAGURI
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
joachimlavalley1
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
kaushalkr1407
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
Atul Kumar Singh
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
RaedMohamed3
 

Recently uploaded (20)

Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
 

DMTM 2015 - 16 Data Preparation

  • 1. Prof. Pier Luca Lanzi Data Preparation Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
  • 2. Prof. Pier Luca Lanzi Why Data Preprocessing? •  Data in the real world are “dirty” •  They are usually incomplete § Missing attribute values § Missing attributes of interest, or containing only aggregate data •  They are usually noisy, containing errors or outliers § e.g., Salary=“-10” •  They are usually inconsistent and contain discrepancies § e.g., Age=“42” Birthday=“03/07/1997” § e.g., Was rating “1,2,3”, now rating “A, B, C” § e.g., discrepancy between duplicate records 2
  • 3. Prof. Pier Luca Lanzi Why Is Data Dirty? •  Incomplete data may come from § “Not applicable” data value when collected § Different considerations between the time when the data was collected and when it is analyzed. § Human/hardware/software problems •  Noisy data (incorrect values) may come from § Faulty data collection instruments § Human or computer error at data entry § Errors in data transmission •  Inconsistent data may come from § Different data sources § Functional dependency violation (e.g., modify some linked data) •  Duplicate records also need data cleaning 3
  • 4. Prof. Pier Luca Lanzi Why Is Data Preprocessing Important? •  No quality in data, no quality in the mining results! (trash in, trash out) •  Quality decisions needs quality in data •  E.g., duplicate or missing data may cause incorrect or even misleading statistics. •  Data warehouses need consistent integration of quality data •  Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse 4
  • 5. Prof. Pier Luca Lanzi Major Tasks in Data Preprocessing •  Data cleaning § Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies •  Data integration § Integration of multiple databases, data cubes, or files •  Data reduction § Dimensionality reduction § Numerosity reduction § Data compression •  Data transformation and data discretization § Normalization § Concept hierarchy generation 5
  • 6. Prof. Pier Luca Lanzi Outliers •  Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set 6
  • 7. Prof. Pier Luca Lanzi Missing Values •  Reasons for missing values § Information is not collected (e.g., people decline to give their age and weight) § Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) •  Handling missing values § Eliminate Data Objects § Estimate Missing Values § Ignore the Missing Value During Analysis § Replace with all possible values (weighted by their probabilities) 7
  • 8. Prof. Pier Luca Lanzi Duplicate Data •  Data set may include data objects that are duplicates, or almost duplicates of one another •  Major issue when merging data from heterogeous sources •  Examples: same person with multiple email addresses •  Data cleaning: process of dealing with duplicate data issues 8
  • 9. Prof. Pier Luca Lanzi Data Preprocessing •  Aggregation •  Sampling •  Dimensionality Reduction •  Feature subset selection •  Feature creation •  Discretization and Binarization 9
  • 10. Prof. Pier Luca Lanzi Aggregation •  Combining two or more attributes (or objects) into a single attribute (or object) •  Data reduction § Reduce the number of attributes or objects •  Change of scale § Cities aggregated into regions § States, countries, etc •  More “stable” data § Aggregated data tends to have less variability 10
  • 11. Prof. Pier Luca Lanzi Sampling •  Sampling is the main technique employed for data selection •  It is often used for both the preliminary investigation of the data and the final data analysis. •  Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming •  Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming •  Using a sample will work almost as well as using the entire data sets, if the sample is representative •  A sample is representative if it has approximately the same property (of interest) as the original set of data 11
  • 12. Prof. Pier Luca Lanzi Types of Sampling •  Simple Random Sampling § There is an equal probability of selecting any particular item •  Sampling without replacement § As each item is selected, it is removed from the population •  Sampling with replacement § Objects are not removed from the population as they are selected for the sample. § In sampling with replacement, the same object can be picked up more than once •  Stratified sampling § Split the data into several partitions § Then draw random samples from each partition 12
  • 13. Prof. Pier Luca Lanzi 2000 Points8000 points 500 Points
  • 14. Prof. Pier Luca Lanzi Sample Size •  What sample size is necessary to get at least one object from each of 10 groups. 14
  • 15. Prof. Pier Luca Lanzi Dimensionality Reduction •  Purpose § Avoid curse of dimensionality § Reduce amount of time and memory required by data mining algorithms § Allow data to be more easily visualized § May help to eliminate irrelevant features or reduce noise •  Techniques § Principle Component Analysis § Singular Value Decomposition § Others: supervised and non-linear techniques 15
  • 16. Prof. Pier Luca Lanzi Dimensionality Reduction: Principal Component Analysis (PCA) •  Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components) that can be best used to represent data •  Works for numeric data only •  Used when the number of dimensions is large 16
  • 17. Prof. Pier Luca Lanzi Dimensionality Reduction: Principal Component Analysis (PCA) •  Normalize input data •  Compute k orthonormal (unit) vectors, i.e., principal components •  Each input data (vector) is a linear combination of the k principal component vectors •  The principal components are sorted in order of decreasing “significance” or strength •  Since the components are sorted, the size of the data can be reduced by eliminating the weak components, i.e., those with low variance. (i.e., using the strongest principal components, it is possible to reconstruct a good approximation of the original data 17
  • 18. Prof. Pier Luca Lanzi Dimensionality Reduction: Principal Component Analysis (PCA) •  Goal is to find a projection that captures the largest amount of variation in data 18 Original axes ** * * * * * * * * * * * * * * * * * * * * * * Data points First principal componentSecond principal component
  • 21. Prof. Pier Luca Lanzi Singular Value Decomposition •  Theorem (Press 1992). It is always possible to decompose matrix A into A = UΛVT where U, Λ, V are unique •  U, V are column orthonormal, i.e., columns are unit vectors, orthogonal to each other, so that UTU = I; VTV = I •  Li,i are the singular values, they are positive, and sorted in decreasing order 21
  • 22. Prof. Pier Luca Lanzi Singular Value Decomposition •  Data are viewed as a matrix A with n examples described by m numerical attributes A[n x m] = U[n x r] Λ [ r x r] (V[m x r])T •  A stores the original data •  U represents the n examples using r new concepts/attributes •  Λ represents the strength of each ‘concept’ (r is the rank of A) •  V contain m terms, r concepts •  Λi,i are called the singular values of A, if A is singular, some of the wi will be 0; in general rank(A) = number of nonzero wi •  SVD is mostly unique (up to permutation of singular values, or if some Li are equal) 22
  • 23. Prof. Pier Luca Lanzi Singular Value Decomposition for Dimensionality Reduction •  Suppose you want to find best rank-k approximation to A •  Set all but the largest k singular values to zero •  Can form compact representation by eliminating columns of U and V corresponding to zeroed Li 23 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 0.18 0 0.36 0 0.18 0 0.90 0 0 0.53 0 0.80 0 0.27 = 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71 x
  • 24. Prof. Pier Luca Lanzi Singular Value Decomposition for Dimensionality Reduction 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 0.18 0.36 0.18 0.90 0 0 0 ~ 9.64 x 0.58 0.58 0.58 0 0 x 24
  • 25. Prof. Pier Luca Lanzi Singular Value Decomposition for Dimensionality Reduction 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 ~ 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25
  • 26. Prof. Pier Luca Lanzi Feature Subset Selection •  Another way to reduce dimensionality of data •  Redundant features § duplicate much or all of the information contained in one or more other attributes § Example: purchase price of a product and the amount of sales tax paid •  Irrelevant features § contain no information that is useful for the data mining task at hand § Example: students' ID is often irrelevant to the task of predicting students' GPA 26
  • 27. Prof. Pier Luca Lanzi Feature Subset Selection •  Brute-force approach § Try all possible feature subsets as input to data mining algorithm •  Embedded approaches § Feature selection occurs naturally as part of the data mining algorithm •  Filter approaches § Features are selected using a procedure that is independent from a specific data mining algorithm § E.g., attributes are selected based on correlation measures •  Wrapper approaches § Use a data mining algorithm as a black box to find best subset of attributes § E.g., apply a genetic algorithm and an algorithm for decision tree to find the best set of features for a decision tree 27
  • 28. Prof. Pier Luca Lanzi Example of using the information gain to rank the importance of attributes.
  • 29. Prof. Pier Luca Lanzi Decision trees applied to all the attributes.
  • 30. Prof. Pier Luca Lanzi Decision trees applied to all the breast dataset without the four least ranked attributes.
  • 31. Prof. Pier Luca Lanzi Wrapper approach using exhaustive search and decision trees.
  • 32. Prof. Pier Luca Lanzi Feature Creation •  Create new attributes that can capture the important information in a data set much more efficiently than the original attributes •  E.g., given the birthday, create the attribute age •  Three general methodologies: § Feature Extraction: domain-specific § Mapping Data to New Space § Feature Construction: combining features 32
  • 33. Prof. Pier Luca Lanzi Discretization •  Three types of attributes § Nominal: values from an unordered set, e.g., color § Ordinal: values from an ordered set, e.g., military or academic rank § Continuous: real numbers, e.g., integer or real numbers •  Discretization § Divide the range of a continuous attribute into intervals § Some classification algorithms only accept categorical attributes § Reduce data size by discretization § Prepare for further analysis 33
  • 34. Prof. Pier Luca Lanzi Discretization Approaches •  Supervised § Attributes are discretized using the class information § Generates intervals that tries to minimize the loss of information about the class •  Unsupervised § Attributes are discretized solely based on their values 34
  • 35. Prof. Pier Luca Lanzi Supervised Discretization •  Entropy based approach 35 3 categories for both x and y 5 categories for both x and y
  • 36. Prof. Pier Luca Lanzi Discretization Without Using Class Labels 36 Data Equal interval width Equal frequency K-means
  • 37. Prof. Pier Luca Lanzi Segmentation by Natural Partitioning •  A simply 3-4-5 rule can be used to segment numeric data into relatively uniform, “natural” intervals. •  If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the range into 3 equi-width intervals (3 equal-width intervals for 3, 6, and 9; and 3 intervals in the grouping of 2-3-2 for 7) •  If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervals •  If it covers 5 or 10 distinct values at the most significant digit, partition the range into 5 intervals 37
  • 38. Prof. Pier Luca Lanzi Summary •  Data exploration and preparation, or preprocessing, is a big issue for both data warehousing and data mining •  Descriptive data summarization is need for quality data preprocessing •  Data preparation includes § Data cleaning and data integration § Data reduction and feature selection § Discretization •  A lot a methods have been developed but data preprocessing still an active area of research 38