1. Unit 3: Data Quality and
Preprocessing
Mr. V. H. Kondekar
E&TC Dept., WIT, Solapur
2. Course Outcomes
ET424.1 Discuss challenges in big data analytics and
Describe fundamental techniques and principles for data
analytics.
ET424.2 Identify, organize and operate on the datasets to
compute statistics for data analysis
ET424.3 Select and implement appropriate data
visualizations to clearly communicate analytic insights.
ET424.4 Apply different preprocessing techniques for data
quality enhnacement
ET424.5 Use the tools and techniques to apply different
algorithms and methodologies
3. Data Quality ?
Depending on the type of data scale, different data quality and
preprocessing techniques can be used.
The nature of the application domain,
Human error
The integration of different data sets (say, from different
devices),
The methodology used to collect data
Generate data sets that are
Noisy,
Inconsistent,
Contain duplicate records.
4. Why Preprocessing?
When these data are used by algorithms that learn from
data – ML algorithms – the analysis problem can look
more complex than it really is if there is no data pre-
processing.
This increases the time required for the induction of
assumptions or models and resulting in models that do
not capture the true patterns present in the data set.
The elimination or even just the reduction of these
problems can lead to an improvement in the quality of
knowledge extracted by data analysis processes.
5. What affects data Quality?
Data quality is important and can be affected by internal
and external factors.
Internal factors can be linked to the measurement
process and the collection of information through the
attributes chosen.
External factors are related to faults in the data
collection process, and can involve the absence of
values for some attributes and the voluntary or
involuntary addition of errors to others.
6. What are the main problems affecting data
quality are?
The main problems affecting data quality are
associated with
Missing values
Inconsistency
Redundancy
Noise
outliers
8. Missing Values?
There are several causes of missing values, among them:
attributes values only recorded some time after the start of data collection,
so that early records do not have a value
the value of an attribute being unknown at time of collection
distraction, misunderstanding or refusal at time of collection
attribute not required for particular objects
non-existence of a value
fault in the data collection device
cost or difficulty of assigning a class label to an object in classification
problems.
9. How to deal with missing values?
Ignore missing values:
– Use for each object only the attributes with values, without paying
attention to missing values. This does not require any change in the
modeling
algorithm used, but the distance function should ignore the values
of
attributes with at least one missing value;
– Modify a learning algorithm to allow it to accept and work with
missing
values.
Remove objects: Use only those objects with values for all
attributes.
10. How to estimate values?
Several methods can be used:
• Fill with a location value: the mean or median for quantitative and ordinal
attributes, and the mode for nominal values. The mean is just the average
of the values and the mode is the quantitative value that appears most
often in the attribute. The median is the value that is greater than half of
the values and lower than the remaining half.
• For classification tasks, we can use the previous method, namely using
only instances from the same class to calculate the location statistic. In
other words, if we intend to fill the value of attribute at of instance i that
belongs to class C1, we will use only instances from the class C1 that do
not have missing values in the at attribute.
• A learning algorithm can be used to as a prediction model giving a
replacement value for one that is missing in a particular attribute. The
learning algorithm uses all other attributes as predictors and the one to be
filled as the target.
15. How to detect Outliers?
A simple yet effective method to detect outliers in
quantitative attributes is based in the interquartile range.
Let Q1 and Q3 be the first quartile and the
third quartile, respectively.
The interquartile range is given by IQ = Q3 − Q1.
Values below Q1 − 1.5 × IQ or above Q3 + 1.5 × IQ are
considered too far away from central values to be
reasonable.
16. How to detect Outliers?
A simple yet effective method to detect outliers in
quantitative attributes is based in the interquartile range.
Let Q1 and Q3 be the first quartile and the
third quartile, respectively.
The interquartile range is given by IQ = Q3 − Q1.
Values below Q1 − 1.5 × IQ or above Q3 + 1.5 × IQ are
considered too far away from central values to be
reasonable.
17. Classification: Nominal, Ordinal, Interval,
Ratio
Data Element
Nominal:
A scale of
measurement
where levels
are distinct but
do not vary in
magnitude.
Ordinal: A scale of
measurement where
levels vary in order of
magnitude but equal
intervals between
levels cannot be
assumed.
Interval: The interval
level of measurement
has the characteristics
of distinct levels,
ordering in magnitude,
and equal intervals.
Equal intervals are
obtained if equivalent
differences between
measurements
represent the same
amount of difference in
the property being
measured.
Ratio: The ratio level
of measurement has
characteristics of
distinct levels, ordering
in magnitude, equal
intervals, and an
absolute zero.
A measurement has an
absolute zero when a
measurement of zero
represents the
absence of the
property being
measured.
18. Contrasting Nominal, Ordinal,
Interval and Ratio
Scale has levels that are: Nominal Ordinal Interval Ratio
Distinctive X X X X
Ordered X X X
Equally spaced X X
Has an absolute zero X
Qualitative /
categorical
Quantitetive/
numerical
19. Converting to a Different Scale Type?
Converting Nominal to Relative
Since the nominal scale does not assume an order
between its values, to keep this information, nominal
values should be converted to relative or binary
values.
The most common conversion is called “1-of-n”, also
known as canonical or one-attribute- per-value
conversion, which transforms n values of a nominal
attribute into n binary attributes. A binary attribute has
only two values, 0 or 1.