Fall 2024-25
Precision Agriculture
Dr. C. Moganapriya
moganapriya.c@vit.ac.in
Data Preprocessing
2
Data Preprocessing
 Data Preprocessing: An Overview
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation
3
3
Data Quality: Why Preprocess the Data?
 Measures for data quality: A multidimensional view
 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, …
 Consistency: some modified but some not, dangling, …
 Timeliness: timely update?

Believability: how trustable the data are correct?
 Interpretability: how easily the data can be
understood?
4
Major Tasks in Data Preprocessing
 Data cleaning
 Data integration
 Data reduction
 Data transformation
5
Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
 incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data

e.g., Occupation=“ ” (missing data)

noisy: containing noise, errors, or outliers

e.g., Salary=“−10” (an error)
 inconsistent: containing discrepancies in codes or names, e.g.,

Age=“42”, Birthday=“03/07/2010”

Was rating “1, 2, 3”, now rating “A, B, C”

discrepancy between duplicate records

Intentional (e.g., disguised missing data)

Jan. 1 as everyone’s birthday?
6
How to Handle Missing Data?
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the
same class: smarter
 the most probable value: inference-based such as
Bayesian formula or decision tree
 Manual- small data set
 Automatic – larger data set –more efficient
7
8
Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may be due to

faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which require data cleaning

duplicate records
 incomplete data

inconsistent data
How to Handle Noisy Data?
 Binning

first sort data and partition into (equal-frequency) bins

then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
 Regression

smooth by fitting the data into regression functions
 Clustering

Similar item are grouped and detect and remove
outliers
 Combined computer and human inspection

detect suspicious values and check by human (e.g., deal
with possible outliers)
9
Data Integration
 Data integration:
 Combines data from multiple heterogeneous sources into a
coherent store
 2 types
 Tight Coupling
 Data is combined together into a physical location
 Loose coupling
 only an interface is created, and the data is combined through
the interface and accessed through the interface
 data reminds in actual database only
10
10
Data Reduction
 Volume off data is reduced to make analysis easier
11
11
Data Reduction
 Dimensionality reduction
 Reduces the number of input variables in the data set because
the large input variables will result in poor performance
 Data cube aggregation
 Data is combined to construct a data cube
 Attribute subset selection

highly relevant attributes should be used, and other attributes
should be discarded or removed, so in this way the data can be
reduced
 Numerosity Reduction:
 Here, we store only model of data instead of entire data

Parametric

Non-parametric: Histogram, Cluster, Sampling
12
12
Data Reduction
13
13
Data cube aggregation
Data Compression
14
Original Data Compressed
Data
lossless
Original Data
Approximated
lossy
Data Compression
15
Data Transformation
 Data is transformed into appropriate form suitable for mining
process
 There are 4 methods in data transformation
 1. Normalization
 2. Attribute selection
 3. Discretization
 4. Concept hierarchy generation
1. Normalization
Normalization is done in order to scale the data values in a
specified range
For example, -1.0 to + 1.0 or 0 to 1
16
16
Data Transformation
2. Attribute selection
New attributes are created using older ones
3. Discretization
Raw values are replaced by interval values
4. Concept hierarchy generation
Attributes are converted from low level to the higher level
Example: city to country
17
17
Automatic Concept Hierarchy Generation
 Some hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in
the data set
 The attribute with the most distinct values is placed at
the lowest level of the hierarchy
 Exceptions, e.g., weekday, month, quarter, year
18
country
province_or_ state
city
street
15 distinct values
365 distinct values
3567 distinct values
674,339 distinct values
Automatic Concept Hierarchy Generation
19
Binning Methods for Data Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
20
21
Discretization Without Using Class Labels
(Binning vs. Clustering)
Data Equal interval width
(binning)
Equal frequency (binning) K-means clustering leads to better
results
Summary
 Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
 Data cleaning: e.g. missing/noisy values, outliers
 Data integration from multiple sources:
 Entity identification problem

Remove redundancies
 Detect inconsistencies
 Data reduction
 Dimensionality reduction

Numerosity reduction

Data compression
 Data transformation

Normalization
 Concept hierarchy generation
22

Data preprocessing in precision agriculture

  • 1.
    Fall 2024-25 Precision Agriculture Dr.C. Moganapriya moganapriya.c@vit.ac.in
  • 2.
  • 3.
    Data Preprocessing  DataPreprocessing: An Overview  Data Quality  Major Tasks in Data Preprocessing  Data Cleaning  Data Integration  Data Reduction  Data Transformation 3 3
  • 4.
    Data Quality: WhyPreprocess the Data?  Measures for data quality: A multidimensional view  Accuracy: correct or wrong, accurate or not  Completeness: not recorded, unavailable, …  Consistency: some modified but some not, dangling, …  Timeliness: timely update?  Believability: how trustable the data are correct?  Interpretability: how easily the data can be understood? 4
  • 5.
    Major Tasks inData Preprocessing  Data cleaning  Data integration  Data reduction  Data transformation 5
  • 6.
    Data Cleaning  Datain the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error  incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data  e.g., Occupation=“ ” (missing data)  noisy: containing noise, errors, or outliers  e.g., Salary=“−10” (an error)  inconsistent: containing discrepancies in codes or names, e.g.,  Age=“42”, Birthday=“03/07/2010”  Was rating “1, 2, 3”, now rating “A, B, C”  discrepancy between duplicate records  Intentional (e.g., disguised missing data)  Jan. 1 as everyone’s birthday? 6
  • 7.
    How to HandleMissing Data?  Fill in the missing value manually: tedious + infeasible?  Fill in it automatically with  a global constant : e.g., “unknown”, a new class?!  the attribute mean  the attribute mean for all samples belonging to the same class: smarter  the most probable value: inference-based such as Bayesian formula or decision tree  Manual- small data set  Automatic – larger data set –more efficient 7
  • 8.
    8 Noisy Data  Noise:random error or variance in a measured variable  Incorrect attribute values may be due to  faulty data collection instruments  data entry problems  data transmission problems  technology limitation  inconsistency in naming convention  Other data problems which require data cleaning  duplicate records  incomplete data  inconsistent data
  • 9.
    How to HandleNoisy Data?  Binning  first sort data and partition into (equal-frequency) bins  then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.  Regression  smooth by fitting the data into regression functions  Clustering  Similar item are grouped and detect and remove outliers  Combined computer and human inspection  detect suspicious values and check by human (e.g., deal with possible outliers) 9
  • 10.
    Data Integration  Dataintegration:  Combines data from multiple heterogeneous sources into a coherent store  2 types  Tight Coupling  Data is combined together into a physical location  Loose coupling  only an interface is created, and the data is combined through the interface and accessed through the interface  data reminds in actual database only 10 10
  • 11.
    Data Reduction  Volumeoff data is reduced to make analysis easier 11 11
  • 12.
    Data Reduction  Dimensionalityreduction  Reduces the number of input variables in the data set because the large input variables will result in poor performance  Data cube aggregation  Data is combined to construct a data cube  Attribute subset selection  highly relevant attributes should be used, and other attributes should be discarded or removed, so in this way the data can be reduced  Numerosity Reduction:  Here, we store only model of data instead of entire data  Parametric  Non-parametric: Histogram, Cluster, Sampling 12 12
  • 13.
  • 14.
    Data Compression 14 Original DataCompressed Data lossless Original Data Approximated lossy
  • 15.
  • 16.
    Data Transformation  Datais transformed into appropriate form suitable for mining process  There are 4 methods in data transformation  1. Normalization  2. Attribute selection  3. Discretization  4. Concept hierarchy generation 1. Normalization Normalization is done in order to scale the data values in a specified range For example, -1.0 to + 1.0 or 0 to 1 16 16
  • 17.
    Data Transformation 2. Attributeselection New attributes are created using older ones 3. Discretization Raw values are replaced by interval values 4. Concept hierarchy generation Attributes are converted from low level to the higher level Example: city to country 17 17
  • 18.
    Automatic Concept HierarchyGeneration  Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set  The attribute with the most distinct values is placed at the lowest level of the hierarchy  Exceptions, e.g., weekday, month, quarter, year 18 country province_or_ state city street 15 distinct values 365 distinct values 3567 distinct values 674,339 distinct values
  • 19.
  • 20.
    Binning Methods forData Smoothing  Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 20
  • 21.
    21 Discretization Without UsingClass Labels (Binning vs. Clustering) Data Equal interval width (binning) Equal frequency (binning) K-means clustering leads to better results
  • 22.
    Summary  Data quality:accuracy, completeness, consistency, timeliness, believability, interpretability  Data cleaning: e.g. missing/noisy values, outliers  Data integration from multiple sources:  Entity identification problem  Remove redundancies  Detect inconsistencies  Data reduction  Dimensionality reduction  Numerosity reduction  Data compression  Data transformation  Normalization  Concept hierarchy generation 22