SlideShare a Scribd company logo
1 of 80
I/O Data Engineering
“Garbage in, garbage out”
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 1
Preamble
• This work is a fusion of ideas and work (slides, text, images etc.) I found on the internet
or had and wrote on my own regarding the area of input/output data engineering in data
mining and machine learning.
• Attribution:
• The slides are based on the PowerPoint accompanying slides of the Data Mining, Practical
Machine Learning Tools, Witten et al., 4th ed., 2017 and in particular Chapter 8, available at:
http://www.cs.waikato.ac.nz/ml/weka/book.html
• Slides from the Machine Learning MOOC by Prof. Andrew Ng: http://ml-class.org (PCA parts)
• Slides from Learning from Data MOOC by Prof. Yaser S. Abu-Mustafa its support site:
http://work.caltech.edu/telecourse.html (The digits dataset and the non-linear transformation)
• Slides from the Pattern Recognition class by Prof. Andreas L. Symeonidis, ECE department,
Aristotle University of Thessaloniki
• A tutorial on Principal Component Analysis by Lindsay I. Smith, February 2002
(http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf)
• Introduction to Data Mining, Tan et al., 2006: http://www-users.cs.umn.edu/~kumar/dmbook/
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 2
Successful data mining: Just apply a learner!
No?
• Select the learning algorithm
• Scheme/parameter selection/tuning
• Treat selection process as part of the learning process to avoid optimistic
performance estimates
• Estimate the expected true performance of a learning scheme
• Split
• Cross-validation
• Data Engineering
• Engineering the input data into a form suitable for the learning scheme chosen
• Data engineering to make learning possible or easier
• Engineering the output to make it more effective
• Converting multi-class problems into two-class ones
• Re-calibrating probability estimates
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 3
Data Transformations – Outline
Topics covered:
1. Attribute Selection
2. Discretizing Numeric Attributes
3. Data Projection
4. Data Cleansing
5. Transforming Multiple Classes to Binary Ones
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 4
It is a jungle out there!
Data Transformations
Feature Selection
Feature Engineering
Data Engineering Dimensionality Reduction
Principal Components Analysis
Pre- and post-processing
Data Cleansing
ETL
Feature Learning
Wrapper methods
Filter methods
Independent Component Analysis
Outlier Detection
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 5
Attribute/Feature Selection
Removing attributes that are not useful to the task at hand
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 6
Motivation
• Experiments have showed that adding useless attributes causes the performance of learning
schemes (decision tree and rules, linear regression, instance-based learners) to deteriorate
• Adding a random binary variable effects:
• Divide-and-conquer tree learners and separate-and-conquer rule learners
• If you reach depths at which only a small amount of data is available for picking a split, the random attribute will look good by
chance
• C4.5 deterioration in performance 5-10% for 1 random variable
• Instance-based learners
• Susceptible as well, reason: work in local neighborhoods
• The number of training instances needed to produce a predetermined level of performance for instance-based learning
increases exponentially with the number of irrelevant attributes present
• Naive Bayes
• Not susceptible
• It assumes by design that all attributes are independent of one another, an assumption that is just right for random “distracter”
attributes
• On the other hand: pays a heavy price in other ways because its operation is damaged by adding redundant attributes
• Independence “thrown out of the window”
• Conclusion: Relevant attributes can also be harmful if they mislead the learning algorithm
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 7
Advantages of Attribute Selection
• Improves performance of learning algorithms
• Speeds them up
• Outweighed by the computation involved in attribute selection
• Yields a more compact, more easily interpretable representation
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 8
Attribute Selection Types
• Manually
• The best way
• Requires deep understanding of the learning problem and what the attributes
actually mean
• Filter-method – Scheme-Independent Attribute Selection
• Make an independent assessment based on general characteristics of the
data
• Wrapper method – Scheme-Dependent Attribute Selection
• Evaluate the subset using the machine learning algorithm that will ultimately
be employed for learning
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 9
Scheme-Independent Attribute Selection
• aka Filter approach to attribute selection:
assess attributes based on general
characteristics of the data
• Attributes are selected in a manner that is
independent of the target machine
learning scheme
• One method: find smallest subset of
attributes that separates data
• Another method: use a fast learning
scheme that is different from the target
learning scheme to find relevant
attributes
• E.g., use attributes selected by C4.5, or
coefficients of linear model, possibly
applied recursively (recursive feature
elimination)
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 10
By Lucien Mousin - Own work, GFDL, https://commons.wikimedia.org/w/index.php?curid=37776286
Recursive Feature Elimination
1
2
…
F1 F2 F3
Learning
Algorithm
Ranking: F2 F1 F3
1
2
…
F1 F2
Learning
Algorithm
Ranking: F1 F2 Final Ranking: F1 F2 F3
Learning algorithm should produce a ranking, i.e. a linear SVM,
where ranks are based on the size of the coefficients
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 11
Correlation-based Feature Selection (CFS)
• Correlation between attributes measured by symmetric uncertainty:
where H is the entropy function:
• Goodness of subset of attributes measured by
where C is the class attribute, breaking ties in favour of smaller subsets.
]1,0[
)()(
),()()(
2),( 



BHAH
BAHBHAH
BAU
  j i j jij AAUCAU ),(),(
𝐻 𝑋, 𝑌 = −
𝑆 𝑋 𝑆 𝑌
𝑝 𝑥, 𝑦 log(𝑝 𝑥, 𝑦 )𝐻 𝑋 = −
𝑆 𝑋
𝑝 𝑥 log(𝑝 𝑥 )
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 12
The Weather Data
Outlook Temperature Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 13
Symmetric Uncertainty Example Calculation
• H(Outlook) = - 5/14log(5/14) – 4/14log(4/14) – 5/14 log(5/14) = 1.577
• H(Temperature) = - 4/14log(4/14) – 6/14log(6/14) – 4/14 log(4/14) = 1.556
• H(Outlook, Temperature) =
- p(s,h)logp(s,h) - p(s,m)logp(s,m) - p(s,c)logp(s,c) - p(o,h)logp(o,h) - p(o,m)logp(o,m) -
p(o,c)logp(o,c) - p(r,h)logp(r,h) - p(r,m)logp(r,m) - p(r,c)logp(r,c) =
- 2/14log(2/14) – 2/14log(2/14) – 1/14 log(1/14) - 2/14log(2/14) – 1/14log(1/14) – 1/14 log(1/14)
- 0/14log(0/14) – 3/14log(3/14) – 2/14 log(2/14) =
2.896
• U(Outlook, Temperature) = 2*(1.577 + 1.566 – 2.896)/(1.577 + 1.566) = 0.1512927
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 14
Attribute subsets for weather data
The number of possible attribute
subsets increases exponentially with
the number of attributes, making
exhaustive search impractical on all
but the simplest problems.
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 15
Scheme-specific selection
• Wrapper approach to attribute selection: attributes
are selected with target scheme in the loop
• Implement “wrapper” around learning scheme
• Evaluation criterion: cross-validation performance
• Time consuming in general
• greedy approach, k attributes, evaluation time
multiplied by a factor of k2, worst case
• prior ranking of attributes, complexity linear in k
• Can use significance test (paired t-test) to stop
cross-validation for a subset early if it is unlikely
to “win” (race search)
• Can be used with forward, backward selection, prior
ranking, or special-purpose schemata search
• Efficient for decision tables and Naïve Bayes
(Selective Naïve Bayes)
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 16
By Lastdreamer7591 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=37208688
Selective Naïve Bayes
• Use the forward selection algorithm
• Better able to detect a redundant attribute than backward elimination
• Use as metric the quality of an attribute to be simply the performance
on the training set
• We know that: Training set performance not a reliable indicator of
test set performance
• But Naïve Bayes is less likely to overfit
• Plus, as discussed, robust to random variables
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 17
Complexity example
• If I do 10-fold CV I must train the algorithm 10 times = 10
• I should do also the 10-fold CV 10 times to obtain a more reliable
estimate = 10*10
• If I have 10 features the total search space is 2^10 = 1024 different
subsets = 10*10*1024 = 102,400
• Then I should also tune the parameters of the learning algorithm…
• Or should I do that before…
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 18
Searching the attribute space
• Number of attribute subsets is
exponential in the number of attributes
• Common greedy approaches:
• forward selection
• backward elimination
• More sophisticated strategies:
• Bidirectional search
• Best-first search:
• can find optimum solution,
• does not just terminate when the
performance starts to drop keeps a list of all
attribute subsets evaluated so far, sorted in
order of the performance measure, so that it
can revisit an earlier configuration
• Beam search: approximation to best-first
search, keeps a truncated list
• Genetic algorithms
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 19
Discretization
Transforming numeric attributes into discrete ones
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 20
Motivation
• Essential if the task involves numeric attributes but the chosen
learning scheme can only handle categorical ones
• Schemes that can handle numeric attributes often produce better
results, or work faster, if the attributes are pre-discretized.
• The converse situation, in which categorical attributes must be
represented numerically, also occurs (although less often)
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 21
Attribute discretization
• Discretization can be useful even if a learning algorithm can be run on
numeric attributes directly
• Avoids normality assumption in Naïve Bayes and clustering
• Examples of discretization we have already encountered:
• Decision trees perform local discretization
• Global discretization can be advantageous because it is based on more data
• Apply learner to
• k-valued discretized attribute or to
• k – 1 binary attributes that code the cut points
• The latter approach often works better when learning decision trees or rule
sets
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 22
Discretization: unsupervised
• Unsupervised discretization: determine
intervals without knowing class labels
• When clustering, the only possible way!
• Two well-known strategies:
• Equal-interval binning
• Equal-frequency binning
(also called histogram equalization)
• Unsupervised discretization is normally
inferior to supervised schemes when
applied in classification tasks
• But equal-frequency binning works well
with Naïve Bayes if the number of intervals
is set to the square root of the size of
dataset (proportional k-interval
discretization)
Data Equal interval width
Equal frequency K-means
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 23
Discretization: supervised
• Classic approach to supervised discretization is entropy-based
• This method builds a decision tree with pre-pruning on the attribute being
discretized
• Uses entropy as splitting criterion
• Uses the minimum description length principle as the stopping criterion for pre-pruning
• Works well: still the state of the art
• To apply the minimum description length principle, the “theory” is
• the splitting point (can be coded in log2[N – 1] bits)
• plus class distribution in each subset (a more involved expression)
• Description length is the number of bits needed for coding both the splitting
point and the class distributions
• Compare description lengths before/after adding split
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 24
Example: temperature attribute
Play
Temperature
Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No
64 65 68 69 70 71 72 72 75 75 80 81 83 85
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 25
Final
It can be shown theoretically that a cut point that minimizes the information value
will never occur between two instances of the same class
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 26
Formula for MDL stopping criterion
• Can be formulated in terms of the information gain
• Assume we have N instances
• Original set: k classes, entropy E
• First subset: k1 classes, entropy E1
• Second subset: k2 classes, entropy E2
• If the information gain is greater than the expression on the right, we continue
splitting
• Results in no discretization intervals for the temperature attribute in the
weather data
• Fail to play a role in the final decision structure
N
EkEkkE
N
N
gain
k
221122 )23(log)1(log 



November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 27
Supervised discretization: other methods
• Can replace top-down procedure by bottom-up method
• This bottom-up method has been applied in conjunction with the chi-
squared test
• Continue to merge intervals until they become significantly different
• Can use dynamic programming to find optimum k-way split for given
additive criterion
• Requires time quadratic in the number of instances
• But can be done in linear time if error rate is used instead of entropy
• Error rate: count the number of errors that a discretization makes when predicting
each training instance’s class, assuming that each interval receives the majority class.
• However, using error rate is generally not a good idea when discretizing an attribute
as we will see
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 28
Error-based vs. entropy-based
• Question:
could the best discretization ever have two adjacent intervals with the
same class?
• Wrong answer: No. For if so,
• Collapse the two
• Free up an interval
• Use it somewhere else
• (This is what error-based discretization will do)
• Right answer: Surprisingly, yes.
• (and entropy-based discretization can do it)
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 29
Error-based vs. entropy-based
A 2-class,
2-attribute problem
Entropy-based discretization can detect change of class distribution (from 100% to 50%)
Class 1: a1 < 0.3 or if a1 < 0.7 and a2 < 0.5
Class 2: otherwise
Best discretization
a2: no problem
a1: middle will have whatever label
happens to occur most
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 30
Data Projections and
Dimensionality Reduction
Projecting data into a more suitable space
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 31
Motivation
• Curse of Dimensionality
• Visualization
• Add new, synthetic attributes whose purpose is to present existing
information in a form that is suitable for the machine learning scheme
to pick up on.
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 32
Curse of Dimensionality
• When dimensions increase, data
become increasingly sparse
• Density and distance between
points which are important
criteria for clustering and outlier
detection loose their importance
•Create 500 points
•Calculate the max and min distance between
any pair of points
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 33
Projections
• Definition: a projection is a kind of function or mapping that transforms data in
some way
• Simple transformations can often make a large difference in performance
• Example transformations (not necessarily for performance improvement):
• Difference of two date attributes  age
• Ratio of two numeric (ratio-scale) attributes
• Useful for algorithms doing axis parallel splits
• Concatenating the values of nominal attributes
• Encoding cluster membership
• Adding noise to data
• Removing data randomly or selectively
• Obfuscating the data
• Anonymising
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 34
Digits dataset
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 35
From: Learning from data MOOC, http://work.caltech.edu/telecourse.html
Input representation
• ‘raw’ input x = (x0, x1, x2, …,
x256)
• Linear model: (w0, w1, w2, …,
w256)
• Features: extract useful
information, e.g.,
• Intensity and symmetry x = (x0, x1,
x2)
• Linear model: (w0, w1, w2)
From: Learning from data MOOC, http://work.caltech.edu/telecourse.html
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 36
Illustration of features
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 37
From: Learning from data MOOC, http://work.caltech.edu/telecourse.html
Another one
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 38
From: Learning from data MOOC, http://work.caltech.edu/telecourse.html
Methods
• Unsupervised
• Principal Components Analysis (PCA)
• Independent Component Analysis (ICA)
• Random Projections
• Supervised
• Partial Least Squares (PLS)
• Linear Discriminant Analysis (LDA)
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 39
Principal Components Analysis
aka PCA
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 40
Principal component analysis in a glance
• Unsupervised method for identifying the important directions in a
dataset
• We can then rotate the data into the new coordinate system that is given
by those directions
• Finally we can keep the new dimension that are of more importance
• PCA is a method for dimensionality reduction
• Algorithm:
1. Find direction (axis) of greatest variance
2. Find direction of greatest variance that is perpendicular to previous direction and
repeat
• Implementation: find eigenvectors of the covariance matrix of the data
• Eigenvectors (sorted by eigenvalues) are the directions
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 41
PCA problem formulation
Reduce from 2-dimension to 1-dimension: Find a direction (a vector )
onto which to project the data so as to minimize the projection error.
Reduce from n-dimension to k-dimension: Find vectors
onto which to project the data, so as to minimize the projection error.
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 42
Data in a matrix form
• Let n instances with d attributes. Every instance is described by d
numerical values.
• We represent our data as a nd matrix A with real numbers.
• We can use linear algebra to process the matrix
• Our goal is to produce a new nk matrix B such as:
• It contains as much information as the original matrix A
• To reveal something about the structure of data of A
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 43
Principal Components
• The first principal component is the direction of the axis with the
largest variance in the data
• The second principal component is the next orthogonal direction with
the largest variance in the data
• And so on..
• The 1st PC contains the largest variance
• The kth PC contains the kth fraction of variance
• For n original dimensions, the covariance matrix is nxn and has up to
n eigenvectors. Thus, n PC.
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 44
Example: 10-dimensional data
• Data is normally standardized or mean-centered for PCA
• Can also apply this recursively in a tree learnerNovember 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 45
PCA example
• Dataset with 2 attributes x1 and x2.
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 46
PCA example – Step 1: Get some data
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 47
X - Data:
x1 x2
2.5 2.4
0.5 0.7
2.2 2.9
1.9 2.2
3.1 3.0
2.3 2.7
2 1.6
1 1.1
1.5 1.6
1.1 0.9
Data Pre-processing
Preprocessing (feature scaling/mean normalization):
Replace each with .
If different features on different scales (e.g., size of house,
number of bedrooms), scale features to have comparable
range of values.
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 48
PCA example – Step 2: Subtract the mean
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 49
X’- Mean normalization:
x1 x2
.69 .49
-1.31 -1.21
.39 .99
.09 .29
1.29 1.09
.49 .79
.19 -.31
-.81 -.81
-.31 -.31
-.71 -1.01
Principal Component Analysis (PCA) algorithm
Reduce data from -dimensions to -dimensions
Compute “covariance matrix”:
Compute “eigenvectors” of matrix :
[U,S,V] = svd(Sigma);
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 50
PCA Example – Step 3: Calculate the
covariance matrix
• Given that non-diagonal values are positive, we expect that x1 and x2
will increase together (+ sign of cov(x1, x2))
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 51
S =
0.6165556 0.6154444
0.6154444 0.7165556
Principal Component Analysis (PCA) algorithm
From , we get:[U,S,V] = svd(Sigma)
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 52
PCA Example – Step 4: Compute the
eigenvectors of S
• [U, D, V] = SVD(S)
• The 1st eigenvector has an
eigenvalue of 1.2840277, while
the 2nd an eigenvalue of
0.0490834
• Eigenvectors are perpendicular to
each other: orthogonal
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 53
U =
−0.6778734 −0.7351787
−0.7351787 0.6778734
Choosing k number of principal components
Typically, choose to be smallest value so that
“99% of variance is retained”
(1%)
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 54
Choosing k number of principal components
[U,S,V] = svd(Sigma)
Pick smallest value of for which
(99% of variance retained)
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 55
PCA Example – Step 5: Choosing components
• Choosing the 1st component will
retain more than 95% of the
variance
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 56
Principal Component Analysis (PCA) algorithm
After mean normalization (ensure every feature has
zero mean) and optionally feature scaling:
Sigma =
[U,S,V] = svd(Sigma);
Ureduce = U(:,1:k);
z = Ureduce’*x;
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 57
PCA Example – Step 6: Deriving the new data
• FinalData = RowFeatureVector x RowDataAdjust
• RowDataAdjust = X’T
• Normalized with zero mean with every row being a
dimension and every column a point (inverted form)
• RowFeatureVector = U’T
• Eigenvectors are in rows with the most important in
the first row
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 58
Supervised learning speedup
Extract inputs:
Unlabeled dataset:
New training set:
Note: Mapping should be defined by running PCA
only on the training set. This mapping can be applied as well to
the examples and in the cross validation and test sets.
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 59
Applications
Application of PCA
- Compression
- Reduce memory/disk needed to store data
- Speed up learning algorithm
- Visualization
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 60
Bad use of PCA: To prevent overfitting
Use instead of to reduce the number of
features to
Thus, fewer features, less likely to overfit.
This might work OK, but isn’t a good way to address
overfitting. Use regularization instead.
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 61
PCA is sometimes used where it shouldn’t be
Design of ML system:
- Get training set
- Run PCA to reduce in dimension to get
- Train logistic regression on
- Test on test set: Map to . Run on
How about doing the whole thing without using PCA?
Before implementing PCA, first try running whatever you want to
do with the original/raw data . Only if that doesn’t do what
you want, then implement PCA and consider using .
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 62
Data Cleansing
Data Cleaning, Data Scrubbing, or Data Reconciliation
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 63
What is data cleansing?
• "Detect and remove errors and inconsistencies from data in order
to improve the quality of data" [Rahm]
• "The process of detecting and correcting (or removing) corrupt or
inaccurate records from a record set, table, or database"
[Wikipedia]
• Integral part of data processing and maintenance
• Usually semi-automatic process, highly application specific
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 64
How to
• Necessity of getting to know your data: understanding the meaning of
all the different attributes, the conventions used in coding them, the
significance of missing values and duplicate data, measurement noise,
typographical errors, and the presence of systematic errors—even
deliberate ones.
• There are also automatic methods of cleansing data, of detecting
outliers, and of spotting anomalies, which we describe—including a
class of techniques referred to as “one-class learning” in which only a
single class of instances is available at training time.
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 65
Data Cleansing vs. Data Validation
• Data validation almost invariably means data is rejected from the
system at entry and is performed at entry time, rather than on
batches of data.
• Example: Data validation in web forms
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 66
Anomalies Classification
• Syntactical Anomalies
• Lexical Errors (Gender: {M, M, F, 5' 8})
• Domain format errors (Smith, John vs. Smith John)
• Irregularities: non-uniform use of values, units, abbreviations (examples: different
currencies in the salaries, different use of abbreviations)
• Semantic Anomalies
• Integrity constraint violations (AGE >= 0)
• Contradictions (AGE vs. CURRENT_DATE - DATE_OF_BIRTH)
• Duplicates
• Invalid tuples
• Coverage Anomalies
• Missing values
• Missing tuples
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 67
Data Quality Criteria: Accuracy + Uniqueness
• Accuracy = Integrity + Consistency + Density
• Integrity = Completeness + Validity
• Completeness: M in D / M i.e. I should have 1000 tuples and I have 500 => 50% (Missing
values)
• Validity: M in D / D, i.e. From the 500 tuples I have in D the 400 are valid => 80% (Illegal
values)
• Consistency = Schema conformance + Uniformity
• Schema conformance: tuples conforming to syntactical structure / overall number of
tuples (if in the database then it conforms)
• Uniformity: attributes with no irregularities (non-uniform use of values) / total number
of attributes
• Density: missing values in the tuples in D / total values in D
• Uniqueness: tuples of the same entity / total number of tuples
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 68
Data Cleansing Operations
1. Format adaptation for tuples and values
2. Integrity constraint enforcement
3. Derivation of missing values from existing ones
4. Removing contradictions within or between tuples
5. Merging and eliminating duplicates
6. Detection of outliers, i.e. tuples and values having a high potential
of being invalid
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 69
Data Cleansing Process
1. Data auditing
2. Workflow specification, i.e. choose appropriate methods to
automatically detect and remove them
3. Workflow execution, apply the methods to the tuples in the data
collection
4. Post-processing / Control
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 70
Data Auditing
• Data Profiling: Instance analysis
• Data Mining: Whole data collection analysis
• Examples
• Minimal, Maximal values
• Value range
• Variance
• Uniquness
• Null value occurences
• Typical string patterns (through RegExps for example)
• Search for characteristics that could be used for the correction of
anomalies
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 71
Methods for Data Cleansing
• Parsing (Syntax Errors)
• Data Transformation (source to target format)
• Integrity Constraint Enforcement (checking & maintenance)
• Duplicate Elimination
• Statistical Methods
• Outliers
• Detection: mean, std, range, clustering, association rules
• Remedy: set to average or other statistical value, censored, truncated (dropped)
• Missing
• Detection: It's missing :)
• Remedy: Filling-in (imputing) by a number of ways (mean, median, regression, propensity score,
Markov-Chain-Monte-Carlo method)
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 72
Outlier (Error) Detection in Datasets
• Statistical: mean (μ), std (σ), range (Chebyshev's theorem)
• Accept (μ - ε * σ) < f < (μ + ε σ), where ε=5, else reject
• Needs training/testing data for finding the best ε {3,4,5,6, etc.}
• Boxplots (univariate data)
• Clustering, high-computational burden
• Pattern-based, i.e. find a pattern where 90% of data exhibit the
same characteristics
• Association rules: pattern = association rule with high confidence
and support
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 73
Detecting anomalies
• Visualization can help to detect anomalies
• Automatic approach: apply committee of different learning schemes,
e.g.,
• decision tree
• nearest-neighbor learner
• linear discriminant function
• Conservative consensus approach: delete instances incorrectly
classified by all of them
• Problem: might sacrifice instances of small classes
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 74
One-Class Learning
• Usually training data is available for all classes
• Some problems exhibit only a single class at training time
• Test instances may belong to this class or a new class not present at
training time
• This the problem of one-class classification
• Predict either target or unknown
• Note that, in practice, some one-class problems can be re-
formulated into two-class ones by collecting negative data
• Other applications truly do not have negative data, e.g., password
hardening, nuclear plant operational status
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 75
Outlier detection
• One-class classification is often used for outlier/anomaly/novelty
detection
• First, a one-class models is built from the dataset
• Then, outliers are defined as instances that are classified as
unknown
• Another method: identify outliers as instances that lie beyond
distance d from percentage p of training data
• Density estimation is a very useful approach for one-class
classification and outlier detection
• Estimate density of the target class and mark low probability test instances
as outliers
• Threshold can be adjusted to calibrate sensitivity
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 76
Transforming multiple classes to
binary ones
Output processing
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 77
Transforming multiple classes to binary ones
• Some learning algorithms only work with two class problems
• e.g., standard support vector machines—only work with two-class problems.
• Sophisticated multi-class variants exist in many cases but can be very
slow or difficult to implement
• A common alternative is to transform multi-class problems into multiple
two-class ones
• Simple methods:
• Discriminate each class against the union of the others – one-vs.-rest
• Build a classifier for every pair of classes – pairwise classification
• We will discuss error-correcting output codes, which can often improve
on these
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 78
Error-correcting output codes
• Multiclass problem  multiple binary
problems
• Simple one-vs.-rest scheme:
One-per-class coding
• 1010??
• Idea: use error-correcting
codes instead
• base classifiers predict
1011111, true class = ??
• Use bit vectors (codes) so that we
have large Hamming distance
between any pair of bit vectors:
• Can correct up to (d – 1)/2 single-bit
errors
0001d
0010c
0100b
1000a
class vectorclass
0101010d
0011001c
0000111b
1111111a
class vectorclass
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 79
The End
Kyriakos C. Chatzidimitriou
http://kyrcha.info
kyrcha@gmail.com
November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 80

More Related Content

Viewers also liked

Grupos alimenticios
Grupos alimenticiosGrupos alimenticios
Grupos alimenticiosAli Crespo
 
Bloque Realidad - Provincias
Bloque Realidad - ProvinciasBloque Realidad - Provincias
Bloque Realidad - ProvinciasstjTeresianas
 
Gestion por procesos
Gestion por procesosGestion por procesos
Gestion por procesospaty172536
 
портфолио графический дизайнер
портфолио графический дизайнерпортфолио графический дизайнер
портфолио графический дизайнерKristina Rudneva
 
What do employers want?
What do employers want?What do employers want?
What do employers want?archana cks
 
CVswati main - Copy
CVswati main - CopyCVswati main - Copy
CVswati main - CopySwati Saini
 
Capítulo General Oracion Universal
Capítulo General Oracion UniversalCapítulo General Oracion Universal
Capítulo General Oracion UniversalstjTeresianas
 

Viewers also liked (14)

Paradox Thinking
Paradox ThinkingParadox Thinking
Paradox Thinking
 
iLink Presentation
iLink PresentationiLink Presentation
iLink Presentation
 
Linea 10
Linea 10Linea 10
Linea 10
 
Grupos alimenticios
Grupos alimenticiosGrupos alimenticios
Grupos alimenticios
 
Taller precalificaciones
Taller precalificacionesTaller precalificaciones
Taller precalificaciones
 
Bloque Realidad - Provincias
Bloque Realidad - ProvinciasBloque Realidad - Provincias
Bloque Realidad - Provincias
 
Gestion por procesos
Gestion por procesosGestion por procesos
Gestion por procesos
 
alphastor
alphastoralphastor
alphastor
 
портфолио графический дизайнер
портфолио графический дизайнерпортфолио графический дизайнер
портфолио графический дизайнер
 
What do employers want?
What do employers want?What do employers want?
What do employers want?
 
Dgm analisis
Dgm analisisDgm analisis
Dgm analisis
 
Trabajo
TrabajoTrabajo
Trabajo
 
CVswati main - Copy
CVswati main - CopyCVswati main - Copy
CVswati main - Copy
 
Capítulo General Oracion Universal
Capítulo General Oracion UniversalCapítulo General Oracion Universal
Capítulo General Oracion Universal
 

Similar to Ι/Ο Data Εngineering

ACRL 2011 Data-Driven Library Web Design
ACRL 2011 Data-Driven Library Web DesignACRL 2011 Data-Driven Library Web Design
ACRL 2011 Data-Driven Library Web DesignAmanda Dinscore
 
Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsYalçın Yenigün
 
Beyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To CodeBeyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To CodeYuto Hayamizu
 
AzureML – zero to hero
AzureML – zero to heroAzureML – zero to hero
AzureML – zero to heroGovind Kanshi
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesVinay Shukla
 
Introducing Apereo and the Apereo Learning Analytics Initiative
Introducing Apereo and the Apereo Learning Analytics InitiativeIntroducing Apereo and the Apereo Learning Analytics Initiative
Introducing Apereo and the Apereo Learning Analytics InitiativeIan Dolphin
 
Resources for Getting Started in Predictive Analytics
Resources for Getting Started in Predictive AnalyticsResources for Getting Started in Predictive Analytics
Resources for Getting Started in Predictive Analyticsmeepbobeep
 
Bike Sharing Demand: Akshay Patil
Bike Sharing Demand: Akshay PatilBike Sharing Demand: Akshay Patil
Bike Sharing Demand: Akshay PatilAkshay Patil
 
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...Lucas Jellema
 
Smarter Data for Smarter Libraries
Smarter Data for Smarter LibrariesSmarter Data for Smarter Libraries
Smarter Data for Smarter LibrariesOCLC
 
How to solve a problem with machine learning
How to solve a problem with machine learningHow to solve a problem with machine learning
How to solve a problem with machine learningAmendra Shrestha
 
crisp.ppt
crisp.pptcrisp.ppt
crisp.pptSK Chew
 
Stacked Ensembles in H2O
Stacked Ensembles in H2OStacked Ensembles in H2O
Stacked Ensembles in H2OSri Ambati
 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for EveryoneAly Abdelkareem
 
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...Robert Williams
 

Similar to Ι/Ο Data Εngineering (20)

ACRL 2011 Data-Driven Library Web Design
ACRL 2011 Data-Driven Library Web DesignACRL 2011 Data-Driven Library Web Design
ACRL 2011 Data-Driven Library Web Design
 
Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning Applications
 
Beyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To CodeBeyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To Code
 
Future se oct15
Future se oct15Future se oct15
Future se oct15
 
Data Science at Udemy
Data Science at UdemyData Science at Udemy
Data Science at Udemy
 
AzureML – zero to hero
AzureML – zero to heroAzureML – zero to hero
AzureML – zero to hero
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenches
 
Introducing Apereo and the Apereo Learning Analytics Initiative
Introducing Apereo and the Apereo Learning Analytics InitiativeIntroducing Apereo and the Apereo Learning Analytics Initiative
Introducing Apereo and the Apereo Learning Analytics Initiative
 
Resources for Getting Started in Predictive Analytics
Resources for Getting Started in Predictive AnalyticsResources for Getting Started in Predictive Analytics
Resources for Getting Started in Predictive Analytics
 
Bike Sharing Demand: Akshay Patil
Bike Sharing Demand: Akshay PatilBike Sharing Demand: Akshay Patil
Bike Sharing Demand: Akshay Patil
 
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
 
Internship Presentation.pdf
Internship Presentation.pdfInternship Presentation.pdf
Internship Presentation.pdf
 
Smarter Data for Smarter Libraries
Smarter Data for Smarter LibrariesSmarter Data for Smarter Libraries
Smarter Data for Smarter Libraries
 
How to solve a problem with machine learning
How to solve a problem with machine learningHow to solve a problem with machine learning
How to solve a problem with machine learning
 
crisp.ppt
crisp.pptcrisp.ppt
crisp.ppt
 
crisp.ppt
crisp.pptcrisp.ppt
crisp.ppt
 
Stacked Ensembles in H2O
Stacked Ensembles in H2OStacked Ensembles in H2O
Stacked Ensembles in H2O
 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for Everyone
 
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
 
Madlangsakay, Luzahlia
Madlangsakay, Luzahlia Madlangsakay, Luzahlia
Madlangsakay, Luzahlia
 

More from Kyriakos Chatzidimitriou

Simple rules for building robust machine learning models
Simple rules for building robust machine learning modelsSimple rules for building robust machine learning models
Simple rules for building robust machine learning modelsKyriakos Chatzidimitriou
 
Συμβουλές και στρατηγικές που αποκόμισα από το πρώτο μου εγχείρημα
Συμβουλές και στρατηγικές που αποκόμισα από το πρώτο μου εγχείρημαΣυμβουλές και στρατηγικές που αποκόμισα από το πρώτο μου εγχείρημα
Συμβουλές και στρατηγικές που αποκόμισα από το πρώτο μου εγχείρημαKyriakos Chatzidimitriou
 
Advices and strategies I learned from my first business attempt
Advices and strategies I learned from my first business attemptAdvices and strategies I learned from my first business attempt
Advices and strategies I learned from my first business attemptKyriakos Chatzidimitriou
 
Μηχανισμοί Ενισχυτικής Μάθησης και Εξελικτικής Υπολογιστικής για Αυτόνομους Π...
Μηχανισμοί Ενισχυτικής Μάθησης και Εξελικτικής Υπολογιστικής για Αυτόνομους Π...Μηχανισμοί Ενισχυτικής Μάθησης και Εξελικτικής Υπολογιστικής για Αυτόνομους Π...
Μηχανισμοί Ενισχυτικής Μάθησης και Εξελικτικής Υπολογιστικής για Αυτόνομους Π...Kyriakos Chatzidimitriou
 
An Adaptive Proportional Value-per-Click Agent for Bidding in Ad Auctions
An Adaptive Proportional Value-per-Click Agent for Bidding in Ad AuctionsAn Adaptive Proportional Value-per-Click Agent for Bidding in Ad Auctions
An Adaptive Proportional Value-per-Click Agent for Bidding in Ad AuctionsKyriakos Chatzidimitriou
 
Μια βραδιά στο μέλλον - Οι πράκτορες Mertacor
Μια βραδιά στο μέλλον - Οι πράκτορες MertacorΜια βραδιά στο μέλλον - Οι πράκτορες Mertacor
Μια βραδιά στο μέλλον - Οι πράκτορες MertacorKyriakos Chatzidimitriou
 
A NEAT Way for Evolving Echo State Networks
A NEAT Way for Evolving Echo State NetworksA NEAT Way for Evolving Echo State Networks
A NEAT Way for Evolving Echo State NetworksKyriakos Chatzidimitriou
 

More from Kyriakos Chatzidimitriou (7)

Simple rules for building robust machine learning models
Simple rules for building robust machine learning modelsSimple rules for building robust machine learning models
Simple rules for building robust machine learning models
 
Συμβουλές και στρατηγικές που αποκόμισα από το πρώτο μου εγχείρημα
Συμβουλές και στρατηγικές που αποκόμισα από το πρώτο μου εγχείρημαΣυμβουλές και στρατηγικές που αποκόμισα από το πρώτο μου εγχείρημα
Συμβουλές και στρατηγικές που αποκόμισα από το πρώτο μου εγχείρημα
 
Advices and strategies I learned from my first business attempt
Advices and strategies I learned from my first business attemptAdvices and strategies I learned from my first business attempt
Advices and strategies I learned from my first business attempt
 
Μηχανισμοί Ενισχυτικής Μάθησης και Εξελικτικής Υπολογιστικής για Αυτόνομους Π...
Μηχανισμοί Ενισχυτικής Μάθησης και Εξελικτικής Υπολογιστικής για Αυτόνομους Π...Μηχανισμοί Ενισχυτικής Μάθησης και Εξελικτικής Υπολογιστικής για Αυτόνομους Π...
Μηχανισμοί Ενισχυτικής Μάθησης και Εξελικτικής Υπολογιστικής για Αυτόνομους Π...
 
An Adaptive Proportional Value-per-Click Agent for Bidding in Ad Auctions
An Adaptive Proportional Value-per-Click Agent for Bidding in Ad AuctionsAn Adaptive Proportional Value-per-Click Agent for Bidding in Ad Auctions
An Adaptive Proportional Value-per-Click Agent for Bidding in Ad Auctions
 
Μια βραδιά στο μέλλον - Οι πράκτορες Mertacor
Μια βραδιά στο μέλλον - Οι πράκτορες MertacorΜια βραδιά στο μέλλον - Οι πράκτορες Mertacor
Μια βραδιά στο μέλλον - Οι πράκτορες Mertacor
 
A NEAT Way for Evolving Echo State Networks
A NEAT Way for Evolving Echo State NetworksA NEAT Way for Evolving Echo State Networks
A NEAT Way for Evolving Echo State Networks
 

Recently uploaded

Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 

Recently uploaded (20)

Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 

Ι/Ο Data Εngineering

  • 1. I/O Data Engineering “Garbage in, garbage out” November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 1
  • 2. Preamble • This work is a fusion of ideas and work (slides, text, images etc.) I found on the internet or had and wrote on my own regarding the area of input/output data engineering in data mining and machine learning. • Attribution: • The slides are based on the PowerPoint accompanying slides of the Data Mining, Practical Machine Learning Tools, Witten et al., 4th ed., 2017 and in particular Chapter 8, available at: http://www.cs.waikato.ac.nz/ml/weka/book.html • Slides from the Machine Learning MOOC by Prof. Andrew Ng: http://ml-class.org (PCA parts) • Slides from Learning from Data MOOC by Prof. Yaser S. Abu-Mustafa its support site: http://work.caltech.edu/telecourse.html (The digits dataset and the non-linear transformation) • Slides from the Pattern Recognition class by Prof. Andreas L. Symeonidis, ECE department, Aristotle University of Thessaloniki • A tutorial on Principal Component Analysis by Lindsay I. Smith, February 2002 (http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf) • Introduction to Data Mining, Tan et al., 2006: http://www-users.cs.umn.edu/~kumar/dmbook/ November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 2
  • 3. Successful data mining: Just apply a learner! No? • Select the learning algorithm • Scheme/parameter selection/tuning • Treat selection process as part of the learning process to avoid optimistic performance estimates • Estimate the expected true performance of a learning scheme • Split • Cross-validation • Data Engineering • Engineering the input data into a form suitable for the learning scheme chosen • Data engineering to make learning possible or easier • Engineering the output to make it more effective • Converting multi-class problems into two-class ones • Re-calibrating probability estimates November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 3
  • 4. Data Transformations – Outline Topics covered: 1. Attribute Selection 2. Discretizing Numeric Attributes 3. Data Projection 4. Data Cleansing 5. Transforming Multiple Classes to Binary Ones November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 4
  • 5. It is a jungle out there! Data Transformations Feature Selection Feature Engineering Data Engineering Dimensionality Reduction Principal Components Analysis Pre- and post-processing Data Cleansing ETL Feature Learning Wrapper methods Filter methods Independent Component Analysis Outlier Detection November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 5
  • 6. Attribute/Feature Selection Removing attributes that are not useful to the task at hand November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 6
  • 7. Motivation • Experiments have showed that adding useless attributes causes the performance of learning schemes (decision tree and rules, linear regression, instance-based learners) to deteriorate • Adding a random binary variable effects: • Divide-and-conquer tree learners and separate-and-conquer rule learners • If you reach depths at which only a small amount of data is available for picking a split, the random attribute will look good by chance • C4.5 deterioration in performance 5-10% for 1 random variable • Instance-based learners • Susceptible as well, reason: work in local neighborhoods • The number of training instances needed to produce a predetermined level of performance for instance-based learning increases exponentially with the number of irrelevant attributes present • Naive Bayes • Not susceptible • It assumes by design that all attributes are independent of one another, an assumption that is just right for random “distracter” attributes • On the other hand: pays a heavy price in other ways because its operation is damaged by adding redundant attributes • Independence “thrown out of the window” • Conclusion: Relevant attributes can also be harmful if they mislead the learning algorithm November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 7
  • 8. Advantages of Attribute Selection • Improves performance of learning algorithms • Speeds them up • Outweighed by the computation involved in attribute selection • Yields a more compact, more easily interpretable representation November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 8
  • 9. Attribute Selection Types • Manually • The best way • Requires deep understanding of the learning problem and what the attributes actually mean • Filter-method – Scheme-Independent Attribute Selection • Make an independent assessment based on general characteristics of the data • Wrapper method – Scheme-Dependent Attribute Selection • Evaluate the subset using the machine learning algorithm that will ultimately be employed for learning November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 9
  • 10. Scheme-Independent Attribute Selection • aka Filter approach to attribute selection: assess attributes based on general characteristics of the data • Attributes are selected in a manner that is independent of the target machine learning scheme • One method: find smallest subset of attributes that separates data • Another method: use a fast learning scheme that is different from the target learning scheme to find relevant attributes • E.g., use attributes selected by C4.5, or coefficients of linear model, possibly applied recursively (recursive feature elimination) November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 10 By Lucien Mousin - Own work, GFDL, https://commons.wikimedia.org/w/index.php?curid=37776286
  • 11. Recursive Feature Elimination 1 2 … F1 F2 F3 Learning Algorithm Ranking: F2 F1 F3 1 2 … F1 F2 Learning Algorithm Ranking: F1 F2 Final Ranking: F1 F2 F3 Learning algorithm should produce a ranking, i.e. a linear SVM, where ranks are based on the size of the coefficients November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 11
  • 12. Correlation-based Feature Selection (CFS) • Correlation between attributes measured by symmetric uncertainty: where H is the entropy function: • Goodness of subset of attributes measured by where C is the class attribute, breaking ties in favour of smaller subsets. ]1,0[ )()( ),()()( 2),(     BHAH BAHBHAH BAU   j i j jij AAUCAU ),(),( 𝐻 𝑋, 𝑌 = − 𝑆 𝑋 𝑆 𝑌 𝑝 𝑥, 𝑦 log(𝑝 𝑥, 𝑦 )𝐻 𝑋 = − 𝑆 𝑋 𝑝 𝑥 log(𝑝 𝑥 ) November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 12
  • 13. The Weather Data Outlook Temperature Humidity Windy Play Sunny Hot High False No Sunny Hot High True No Overcast Hot High False Yes Rainy Mild High False Yes Rainy Cool Normal False Yes Rainy Cool Normal True No Overcast Cool Normal True Yes Sunny Mild High False No Sunny Cool Normal False Yes Rainy Mild Normal False Yes Sunny Mild Normal True Yes Overcast Mild High True Yes Overcast Hot Normal False Yes Rainy Mild High True No November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 13
  • 14. Symmetric Uncertainty Example Calculation • H(Outlook) = - 5/14log(5/14) – 4/14log(4/14) – 5/14 log(5/14) = 1.577 • H(Temperature) = - 4/14log(4/14) – 6/14log(6/14) – 4/14 log(4/14) = 1.556 • H(Outlook, Temperature) = - p(s,h)logp(s,h) - p(s,m)logp(s,m) - p(s,c)logp(s,c) - p(o,h)logp(o,h) - p(o,m)logp(o,m) - p(o,c)logp(o,c) - p(r,h)logp(r,h) - p(r,m)logp(r,m) - p(r,c)logp(r,c) = - 2/14log(2/14) – 2/14log(2/14) – 1/14 log(1/14) - 2/14log(2/14) – 1/14log(1/14) – 1/14 log(1/14) - 0/14log(0/14) – 3/14log(3/14) – 2/14 log(2/14) = 2.896 • U(Outlook, Temperature) = 2*(1.577 + 1.566 – 2.896)/(1.577 + 1.566) = 0.1512927 November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 14
  • 15. Attribute subsets for weather data The number of possible attribute subsets increases exponentially with the number of attributes, making exhaustive search impractical on all but the simplest problems. November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 15
  • 16. Scheme-specific selection • Wrapper approach to attribute selection: attributes are selected with target scheme in the loop • Implement “wrapper” around learning scheme • Evaluation criterion: cross-validation performance • Time consuming in general • greedy approach, k attributes, evaluation time multiplied by a factor of k2, worst case • prior ranking of attributes, complexity linear in k • Can use significance test (paired t-test) to stop cross-validation for a subset early if it is unlikely to “win” (race search) • Can be used with forward, backward selection, prior ranking, or special-purpose schemata search • Efficient for decision tables and Naïve Bayes (Selective Naïve Bayes) November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 16 By Lastdreamer7591 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=37208688
  • 17. Selective Naïve Bayes • Use the forward selection algorithm • Better able to detect a redundant attribute than backward elimination • Use as metric the quality of an attribute to be simply the performance on the training set • We know that: Training set performance not a reliable indicator of test set performance • But Naïve Bayes is less likely to overfit • Plus, as discussed, robust to random variables November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 17
  • 18. Complexity example • If I do 10-fold CV I must train the algorithm 10 times = 10 • I should do also the 10-fold CV 10 times to obtain a more reliable estimate = 10*10 • If I have 10 features the total search space is 2^10 = 1024 different subsets = 10*10*1024 = 102,400 • Then I should also tune the parameters of the learning algorithm… • Or should I do that before… November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 18
  • 19. Searching the attribute space • Number of attribute subsets is exponential in the number of attributes • Common greedy approaches: • forward selection • backward elimination • More sophisticated strategies: • Bidirectional search • Best-first search: • can find optimum solution, • does not just terminate when the performance starts to drop keeps a list of all attribute subsets evaluated so far, sorted in order of the performance measure, so that it can revisit an earlier configuration • Beam search: approximation to best-first search, keeps a truncated list • Genetic algorithms November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 19
  • 20. Discretization Transforming numeric attributes into discrete ones November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 20
  • 21. Motivation • Essential if the task involves numeric attributes but the chosen learning scheme can only handle categorical ones • Schemes that can handle numeric attributes often produce better results, or work faster, if the attributes are pre-discretized. • The converse situation, in which categorical attributes must be represented numerically, also occurs (although less often) November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 21
  • 22. Attribute discretization • Discretization can be useful even if a learning algorithm can be run on numeric attributes directly • Avoids normality assumption in Naïve Bayes and clustering • Examples of discretization we have already encountered: • Decision trees perform local discretization • Global discretization can be advantageous because it is based on more data • Apply learner to • k-valued discretized attribute or to • k – 1 binary attributes that code the cut points • The latter approach often works better when learning decision trees or rule sets November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 22
  • 23. Discretization: unsupervised • Unsupervised discretization: determine intervals without knowing class labels • When clustering, the only possible way! • Two well-known strategies: • Equal-interval binning • Equal-frequency binning (also called histogram equalization) • Unsupervised discretization is normally inferior to supervised schemes when applied in classification tasks • But equal-frequency binning works well with Naïve Bayes if the number of intervals is set to the square root of the size of dataset (proportional k-interval discretization) Data Equal interval width Equal frequency K-means November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 23
  • 24. Discretization: supervised • Classic approach to supervised discretization is entropy-based • This method builds a decision tree with pre-pruning on the attribute being discretized • Uses entropy as splitting criterion • Uses the minimum description length principle as the stopping criterion for pre-pruning • Works well: still the state of the art • To apply the minimum description length principle, the “theory” is • the splitting point (can be coded in log2[N – 1] bits) • plus class distribution in each subset (a more involved expression) • Description length is the number of bits needed for coding both the splitting point and the class distributions • Compare description lengths before/after adding split November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 24
  • 25. Example: temperature attribute Play Temperature Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No 64 65 68 69 70 71 72 72 75 75 80 81 83 85 November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 25
  • 26. Final It can be shown theoretically that a cut point that minimizes the information value will never occur between two instances of the same class November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 26
  • 27. Formula for MDL stopping criterion • Can be formulated in terms of the information gain • Assume we have N instances • Original set: k classes, entropy E • First subset: k1 classes, entropy E1 • Second subset: k2 classes, entropy E2 • If the information gain is greater than the expression on the right, we continue splitting • Results in no discretization intervals for the temperature attribute in the weather data • Fail to play a role in the final decision structure N EkEkkE N N gain k 221122 )23(log)1(log     November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 27
  • 28. Supervised discretization: other methods • Can replace top-down procedure by bottom-up method • This bottom-up method has been applied in conjunction with the chi- squared test • Continue to merge intervals until they become significantly different • Can use dynamic programming to find optimum k-way split for given additive criterion • Requires time quadratic in the number of instances • But can be done in linear time if error rate is used instead of entropy • Error rate: count the number of errors that a discretization makes when predicting each training instance’s class, assuming that each interval receives the majority class. • However, using error rate is generally not a good idea when discretizing an attribute as we will see November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 28
  • 29. Error-based vs. entropy-based • Question: could the best discretization ever have two adjacent intervals with the same class? • Wrong answer: No. For if so, • Collapse the two • Free up an interval • Use it somewhere else • (This is what error-based discretization will do) • Right answer: Surprisingly, yes. • (and entropy-based discretization can do it) November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 29
  • 30. Error-based vs. entropy-based A 2-class, 2-attribute problem Entropy-based discretization can detect change of class distribution (from 100% to 50%) Class 1: a1 < 0.3 or if a1 < 0.7 and a2 < 0.5 Class 2: otherwise Best discretization a2: no problem a1: middle will have whatever label happens to occur most November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 30
  • 31. Data Projections and Dimensionality Reduction Projecting data into a more suitable space November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 31
  • 32. Motivation • Curse of Dimensionality • Visualization • Add new, synthetic attributes whose purpose is to present existing information in a form that is suitable for the machine learning scheme to pick up on. November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 32
  • 33. Curse of Dimensionality • When dimensions increase, data become increasingly sparse • Density and distance between points which are important criteria for clustering and outlier detection loose their importance •Create 500 points •Calculate the max and min distance between any pair of points November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 33
  • 34. Projections • Definition: a projection is a kind of function or mapping that transforms data in some way • Simple transformations can often make a large difference in performance • Example transformations (not necessarily for performance improvement): • Difference of two date attributes  age • Ratio of two numeric (ratio-scale) attributes • Useful for algorithms doing axis parallel splits • Concatenating the values of nominal attributes • Encoding cluster membership • Adding noise to data • Removing data randomly or selectively • Obfuscating the data • Anonymising November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 34
  • 35. Digits dataset November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 35 From: Learning from data MOOC, http://work.caltech.edu/telecourse.html
  • 36. Input representation • ‘raw’ input x = (x0, x1, x2, …, x256) • Linear model: (w0, w1, w2, …, w256) • Features: extract useful information, e.g., • Intensity and symmetry x = (x0, x1, x2) • Linear model: (w0, w1, w2) From: Learning from data MOOC, http://work.caltech.edu/telecourse.html November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 36
  • 37. Illustration of features November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 37 From: Learning from data MOOC, http://work.caltech.edu/telecourse.html
  • 38. Another one November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 38 From: Learning from data MOOC, http://work.caltech.edu/telecourse.html
  • 39. Methods • Unsupervised • Principal Components Analysis (PCA) • Independent Component Analysis (ICA) • Random Projections • Supervised • Partial Least Squares (PLS) • Linear Discriminant Analysis (LDA) November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 39
  • 40. Principal Components Analysis aka PCA November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 40
  • 41. Principal component analysis in a glance • Unsupervised method for identifying the important directions in a dataset • We can then rotate the data into the new coordinate system that is given by those directions • Finally we can keep the new dimension that are of more importance • PCA is a method for dimensionality reduction • Algorithm: 1. Find direction (axis) of greatest variance 2. Find direction of greatest variance that is perpendicular to previous direction and repeat • Implementation: find eigenvectors of the covariance matrix of the data • Eigenvectors (sorted by eigenvalues) are the directions November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 41
  • 42. PCA problem formulation Reduce from 2-dimension to 1-dimension: Find a direction (a vector ) onto which to project the data so as to minimize the projection error. Reduce from n-dimension to k-dimension: Find vectors onto which to project the data, so as to minimize the projection error. November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 42
  • 43. Data in a matrix form • Let n instances with d attributes. Every instance is described by d numerical values. • We represent our data as a nd matrix A with real numbers. • We can use linear algebra to process the matrix • Our goal is to produce a new nk matrix B such as: • It contains as much information as the original matrix A • To reveal something about the structure of data of A November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 43
  • 44. Principal Components • The first principal component is the direction of the axis with the largest variance in the data • The second principal component is the next orthogonal direction with the largest variance in the data • And so on.. • The 1st PC contains the largest variance • The kth PC contains the kth fraction of variance • For n original dimensions, the covariance matrix is nxn and has up to n eigenvectors. Thus, n PC. November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 44
  • 45. Example: 10-dimensional data • Data is normally standardized or mean-centered for PCA • Can also apply this recursively in a tree learnerNovember 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 45
  • 46. PCA example • Dataset with 2 attributes x1 and x2. November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 46
  • 47. PCA example – Step 1: Get some data November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 47 X - Data: x1 x2 2.5 2.4 0.5 0.7 2.2 2.9 1.9 2.2 3.1 3.0 2.3 2.7 2 1.6 1 1.1 1.5 1.6 1.1 0.9
  • 48. Data Pre-processing Preprocessing (feature scaling/mean normalization): Replace each with . If different features on different scales (e.g., size of house, number of bedrooms), scale features to have comparable range of values. November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 48
  • 49. PCA example – Step 2: Subtract the mean November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 49 X’- Mean normalization: x1 x2 .69 .49 -1.31 -1.21 .39 .99 .09 .29 1.29 1.09 .49 .79 .19 -.31 -.81 -.81 -.31 -.31 -.71 -1.01
  • 50. Principal Component Analysis (PCA) algorithm Reduce data from -dimensions to -dimensions Compute “covariance matrix”: Compute “eigenvectors” of matrix : [U,S,V] = svd(Sigma); November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 50
  • 51. PCA Example – Step 3: Calculate the covariance matrix • Given that non-diagonal values are positive, we expect that x1 and x2 will increase together (+ sign of cov(x1, x2)) November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 51 S = 0.6165556 0.6154444 0.6154444 0.7165556
  • 52. Principal Component Analysis (PCA) algorithm From , we get:[U,S,V] = svd(Sigma) November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 52
  • 53. PCA Example – Step 4: Compute the eigenvectors of S • [U, D, V] = SVD(S) • The 1st eigenvector has an eigenvalue of 1.2840277, while the 2nd an eigenvalue of 0.0490834 • Eigenvectors are perpendicular to each other: orthogonal November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 53 U = −0.6778734 −0.7351787 −0.7351787 0.6778734
  • 54. Choosing k number of principal components Typically, choose to be smallest value so that “99% of variance is retained” (1%) November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 54
  • 55. Choosing k number of principal components [U,S,V] = svd(Sigma) Pick smallest value of for which (99% of variance retained) November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 55
  • 56. PCA Example – Step 5: Choosing components • Choosing the 1st component will retain more than 95% of the variance November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 56
  • 57. Principal Component Analysis (PCA) algorithm After mean normalization (ensure every feature has zero mean) and optionally feature scaling: Sigma = [U,S,V] = svd(Sigma); Ureduce = U(:,1:k); z = Ureduce’*x; November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 57
  • 58. PCA Example – Step 6: Deriving the new data • FinalData = RowFeatureVector x RowDataAdjust • RowDataAdjust = X’T • Normalized with zero mean with every row being a dimension and every column a point (inverted form) • RowFeatureVector = U’T • Eigenvectors are in rows with the most important in the first row November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 58
  • 59. Supervised learning speedup Extract inputs: Unlabeled dataset: New training set: Note: Mapping should be defined by running PCA only on the training set. This mapping can be applied as well to the examples and in the cross validation and test sets. November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 59
  • 60. Applications Application of PCA - Compression - Reduce memory/disk needed to store data - Speed up learning algorithm - Visualization November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 60
  • 61. Bad use of PCA: To prevent overfitting Use instead of to reduce the number of features to Thus, fewer features, less likely to overfit. This might work OK, but isn’t a good way to address overfitting. Use regularization instead. November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 61
  • 62. PCA is sometimes used where it shouldn’t be Design of ML system: - Get training set - Run PCA to reduce in dimension to get - Train logistic regression on - Test on test set: Map to . Run on How about doing the whole thing without using PCA? Before implementing PCA, first try running whatever you want to do with the original/raw data . Only if that doesn’t do what you want, then implement PCA and consider using . November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 62
  • 63. Data Cleansing Data Cleaning, Data Scrubbing, or Data Reconciliation November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 63
  • 64. What is data cleansing? • "Detect and remove errors and inconsistencies from data in order to improve the quality of data" [Rahm] • "The process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database" [Wikipedia] • Integral part of data processing and maintenance • Usually semi-automatic process, highly application specific November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 64
  • 65. How to • Necessity of getting to know your data: understanding the meaning of all the different attributes, the conventions used in coding them, the significance of missing values and duplicate data, measurement noise, typographical errors, and the presence of systematic errors—even deliberate ones. • There are also automatic methods of cleansing data, of detecting outliers, and of spotting anomalies, which we describe—including a class of techniques referred to as “one-class learning” in which only a single class of instances is available at training time. November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 65
  • 66. Data Cleansing vs. Data Validation • Data validation almost invariably means data is rejected from the system at entry and is performed at entry time, rather than on batches of data. • Example: Data validation in web forms November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 66
  • 67. Anomalies Classification • Syntactical Anomalies • Lexical Errors (Gender: {M, M, F, 5' 8}) • Domain format errors (Smith, John vs. Smith John) • Irregularities: non-uniform use of values, units, abbreviations (examples: different currencies in the salaries, different use of abbreviations) • Semantic Anomalies • Integrity constraint violations (AGE >= 0) • Contradictions (AGE vs. CURRENT_DATE - DATE_OF_BIRTH) • Duplicates • Invalid tuples • Coverage Anomalies • Missing values • Missing tuples November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 67
  • 68. Data Quality Criteria: Accuracy + Uniqueness • Accuracy = Integrity + Consistency + Density • Integrity = Completeness + Validity • Completeness: M in D / M i.e. I should have 1000 tuples and I have 500 => 50% (Missing values) • Validity: M in D / D, i.e. From the 500 tuples I have in D the 400 are valid => 80% (Illegal values) • Consistency = Schema conformance + Uniformity • Schema conformance: tuples conforming to syntactical structure / overall number of tuples (if in the database then it conforms) • Uniformity: attributes with no irregularities (non-uniform use of values) / total number of attributes • Density: missing values in the tuples in D / total values in D • Uniqueness: tuples of the same entity / total number of tuples November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 68
  • 69. Data Cleansing Operations 1. Format adaptation for tuples and values 2. Integrity constraint enforcement 3. Derivation of missing values from existing ones 4. Removing contradictions within or between tuples 5. Merging and eliminating duplicates 6. Detection of outliers, i.e. tuples and values having a high potential of being invalid November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 69
  • 70. Data Cleansing Process 1. Data auditing 2. Workflow specification, i.e. choose appropriate methods to automatically detect and remove them 3. Workflow execution, apply the methods to the tuples in the data collection 4. Post-processing / Control November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 70
  • 71. Data Auditing • Data Profiling: Instance analysis • Data Mining: Whole data collection analysis • Examples • Minimal, Maximal values • Value range • Variance • Uniquness • Null value occurences • Typical string patterns (through RegExps for example) • Search for characteristics that could be used for the correction of anomalies November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 71
  • 72. Methods for Data Cleansing • Parsing (Syntax Errors) • Data Transformation (source to target format) • Integrity Constraint Enforcement (checking & maintenance) • Duplicate Elimination • Statistical Methods • Outliers • Detection: mean, std, range, clustering, association rules • Remedy: set to average or other statistical value, censored, truncated (dropped) • Missing • Detection: It's missing :) • Remedy: Filling-in (imputing) by a number of ways (mean, median, regression, propensity score, Markov-Chain-Monte-Carlo method) November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 72
  • 73. Outlier (Error) Detection in Datasets • Statistical: mean (μ), std (σ), range (Chebyshev's theorem) • Accept (μ - ε * σ) < f < (μ + ε σ), where ε=5, else reject • Needs training/testing data for finding the best ε {3,4,5,6, etc.} • Boxplots (univariate data) • Clustering, high-computational burden • Pattern-based, i.e. find a pattern where 90% of data exhibit the same characteristics • Association rules: pattern = association rule with high confidence and support November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 73
  • 74. Detecting anomalies • Visualization can help to detect anomalies • Automatic approach: apply committee of different learning schemes, e.g., • decision tree • nearest-neighbor learner • linear discriminant function • Conservative consensus approach: delete instances incorrectly classified by all of them • Problem: might sacrifice instances of small classes November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 74
  • 75. One-Class Learning • Usually training data is available for all classes • Some problems exhibit only a single class at training time • Test instances may belong to this class or a new class not present at training time • This the problem of one-class classification • Predict either target or unknown • Note that, in practice, some one-class problems can be re- formulated into two-class ones by collecting negative data • Other applications truly do not have negative data, e.g., password hardening, nuclear plant operational status November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 75
  • 76. Outlier detection • One-class classification is often used for outlier/anomaly/novelty detection • First, a one-class models is built from the dataset • Then, outliers are defined as instances that are classified as unknown • Another method: identify outliers as instances that lie beyond distance d from percentage p of training data • Density estimation is a very useful approach for one-class classification and outlier detection • Estimate density of the target class and mark low probability test instances as outliers • Threshold can be adjusted to calibrate sensitivity November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 76
  • 77. Transforming multiple classes to binary ones Output processing November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 77
  • 78. Transforming multiple classes to binary ones • Some learning algorithms only work with two class problems • e.g., standard support vector machines—only work with two-class problems. • Sophisticated multi-class variants exist in many cases but can be very slow or difficult to implement • A common alternative is to transform multi-class problems into multiple two-class ones • Simple methods: • Discriminate each class against the union of the others – one-vs.-rest • Build a classifier for every pair of classes – pairwise classification • We will discuss error-correcting output codes, which can often improve on these November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 78
  • 79. Error-correcting output codes • Multiclass problem  multiple binary problems • Simple one-vs.-rest scheme: One-per-class coding • 1010?? • Idea: use error-correcting codes instead • base classifiers predict 1011111, true class = ?? • Use bit vectors (codes) so that we have large Hamming distance between any pair of bit vectors: • Can correct up to (d – 1)/2 single-bit errors 0001d 0010c 0100b 1000a class vectorclass 0101010d 0011001c 0000111b 1111111a class vectorclass November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 79
  • 80. The End Kyriakos C. Chatzidimitriou http://kyrcha.info kyrcha@gmail.com November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 80

Editor's Notes

  1. 22
  2. 29
  3. 74