Introduction To Machine Learning
Data
Processing
March 2024
1
S. Hassan Adelyar, Ph.D
Instructor of Computer Science Faculty
Kabul University
March 2024
Data Processing
07:49:21 AM
Introduction To Machine Learning
Data
Processing
March 2024
2
 Real world data can often be incomplete,
inconsistent or even erroneous in nature.
 Data preprocessing resolves such issues.
 Data preprocessing is the first step for any
data mining process.
 Data preprocessing involves transformation
of raw data into an understandable format.
Data Processing
Introduction To Machine Learning
Data
Processing
March 2024
3
 The various stages in which data preprocessing
is performed.
 Data Cleaning
 Data Integration
 Data Transformation
 Data Reduction
Data Preprocessing Methods
Introduction To Machine Learning
Data
Processing
March 2024
4
 In data cleaning:
 Missing values are filled
 Noisy data is smoothed
 Inconsistencies are resolved
 Outliers are identified & removed in order to
clean the data.
 Filling Missing Values
 Missing values are filled by different
methods.
Data cleaning
Introduction To Machine Learning
Data
Processing
March 2024
5
 Fill in the missing value manually:
 Use of some global constant in place of
missing value. such as ‘Unknown’ or -∞
 Use the attribute mean to fill in the missing
value.
 Use some other value which is high in
probability to fill in the missing value.
 Ignore the tuple
Introduction To Machine Learning
Data
Processing
March 2024
6
 Handling noisy data
 A data set that contains extra meaningless
data.
 The noise can be defined as unwanted
variance or some random error that
occurred in a measurable variable.
 Most data mining algorithms are affected
adversely due to noisy data.
Introduction To Machine Learning
Data
Processing
March 2024
7
 Clustering or outlier analysis
 Clustering or outlier analysis is a method
that allows detection of outliers by
clustering.
 In clustering, values which are common or
similar are organized into groups or
‘clusters’, & those values which lie outside
these clusters are termed as outliers or noise.
Introduction To Machine Learning
Data
Processing
March 2024
8
 Regression
 Regression is another such method which
allows data smoothing by fitting it to some
function.
 For example, Linear Regression is one of
the most used methods that aims at finding
the most suitable line to fit values of two
variables or attributes.
 The primary purpose of this is to predict the
value of other variables using the first one.
Introduction To Machine Learning
Data
Processing
March 2024
9
 Similarly, Multiple Regression is used when
more than two variables are involved.
 Regression allows data fitting which in turn
removes noise from data & hence smoothens
the dataset using mathematical equations.
 Linear regression is used to estimate a
relationship between two variables.
Introduction To Machine Learning
Data
Processing
March 2024
10
 For example, we might assume that, given an
increase in population, food production would
increase at the same rate.
 To visualize this, consider a graph in which the
Y-axis tracks population increase, & the X-axis
tracks food production.
 As the Y value increases, the X value would
increase at the same rate, making the
relationship between them a straight line.
Introduction To Machine Learning
Data
Processing
March 2024
11
 Handling of Inconsistent Data
 Data inconsistency is a situation where there
are multiple tables within a database that
deal with the same data but may receive it
from different inputs.
 Inconsistency is generally deepened by data
redundancy.
 However, it is different from data redundancy
in that it typically refers to problems with
the content of a database rather than its
Introduction To Machine Learning
Data
Processing
March 2024
12
 Data integration is a process which combines
data from an excess of sources (such as
multiple databases, flat files) into a unified
data store.
 A most necessary step to be taken during data
analysis is Data Integration.
 During data integration, a number of tricky
issues have to be considered.
Data Integration
Introduction To Machine Learning
Data
Processing
March 2024
13
 For example, how does the data analyst or the
analyzing machine be sure that student_id of
one database & student_number of another
database refer to the same entity?
 Solution to the problem lies with the term
‘metadata’.
 Databases & data warehouses consist of
metadata, which is data about data.
 This metadata is taken as a reference &
referred by the data analyst to avoid errors
Introduction To Machine Learning
Data
Processing
March 2024
14
 Another such issue which may be caused due to
schema integration is redundancy.
 In the language of database, an attribute is said
to be redundant if it is derivable from some
other table (of the same database).
 Mistakes in attribute naming can also lead to
data redundancies in the resulting dataset.
 We use a number of tools to perform data
integration from different sources into one
unified schema.
Introduction To Machine Learning
Data
Processing
March 2024
15
 Data transformation is a process in which
data is consolidated or transformed into some
other standard forms which are better suited for
data mining.
 Normalization & Standardization are the most
popular & widely used data transformation
methods.
Data Transformation
Introduction To Machine Learning
Data
Processing
March 2024
16
 Normalization
 In case of normalization, all the attributes
are converted to a normalized score or to a
range (0, 1).
 The problem of normalization is an outlier.
 If there is an outlier, it will tend to crunch all
of the other values down toward the value of
zero.
 In order to understand this, let’s suppose the
range of students’ marks is 35 to 45 out of
Introduction To Machine Learning
Data
Processing
March 2024
17
 Then 35 will be considered as 0 & 45 as 1, &
students will be distributed between 0 to 1
depending upon their marks.
 But if there is one student having marks 90,
then it will act as an outlier & in this case, 35
will be considered as 0 & 90 as 1.
 Now, it will crunch most of the values down
toward the value of zero.
 In this scenario, the solution is
standardization.
Introduction To Machine Learning
Data
Processing
March 2024
18
 Standardization
 In case of standardization, the values are all
spread out so that we have a standard
deviation of 1.
 Generally, there is no rule for when to use
normalization versus standardization.
 However, if your data has outliers, use
standardization, otherwise use
normalization.
Introduction To Machine Learning
Data
Processing
March 2024
19
 Using standardization tends to make the
remaining values for all of the other
attributes fall into similar ranges since all
attributes will have the same standard
deviation of 1.
Introduction To Machine Learning
Data
Processing
March 2024
20
 Using data reduction techniques a dataset can
be represented in a reduced manner without
actually compromising the integrity of original
data.
 Data reduction is all about reducing the
dimensions (referring to the total number of
attributes) or reducing the volume.
 Moreover, mining when carried out on reduced
datasets often results in better accuracy &
proves to be more efficient.
Data Reduction
Introduction To Machine Learning
Data
Processing
March 2024
21
 There are many methods to reduce large
datasets to yield useful knowledge.
 A few among them are:
Introduction To Machine Learning
Data
Processing
March 2024
22
 Dimension reduction
 In data warehousing, ‘dimension’ equips us
with structured labeling information. But
not all dimensions (attributes) are necessary
at a time.
 Dimension reduction uses algorithm such as
Principal Component Analysis (PCA).
 With the usage of such algorithms one can
detect & remove redundant & weakly
relevant, attributes or dimensions.
Introduction To Machine Learning
Data
Processing
March 2024
23
 Multiplicity reduction
 It is a technique which is used to choose
smaller forms of data representation for
reducing the dataset volume.
 Data compression
 Data compression techniques are used to
reduce the dataset size.
 In these techniques compression techniques
where some encoding mechanisms (e.g.
Huffman coding) are used.
End of Lesson 1
Question / Discussion?

Cloud Computing about Data Processing.pptx

  • 1.
    Introduction To MachineLearning Data Processing March 2024 1 S. Hassan Adelyar, Ph.D Instructor of Computer Science Faculty Kabul University March 2024 Data Processing 07:49:21 AM
  • 2.
    Introduction To MachineLearning Data Processing March 2024 2  Real world data can often be incomplete, inconsistent or even erroneous in nature.  Data preprocessing resolves such issues.  Data preprocessing is the first step for any data mining process.  Data preprocessing involves transformation of raw data into an understandable format. Data Processing
  • 3.
    Introduction To MachineLearning Data Processing March 2024 3  The various stages in which data preprocessing is performed.  Data Cleaning  Data Integration  Data Transformation  Data Reduction Data Preprocessing Methods
  • 4.
    Introduction To MachineLearning Data Processing March 2024 4  In data cleaning:  Missing values are filled  Noisy data is smoothed  Inconsistencies are resolved  Outliers are identified & removed in order to clean the data.  Filling Missing Values  Missing values are filled by different methods. Data cleaning
  • 5.
    Introduction To MachineLearning Data Processing March 2024 5  Fill in the missing value manually:  Use of some global constant in place of missing value. such as ‘Unknown’ or -∞  Use the attribute mean to fill in the missing value.  Use some other value which is high in probability to fill in the missing value.  Ignore the tuple
  • 6.
    Introduction To MachineLearning Data Processing March 2024 6  Handling noisy data  A data set that contains extra meaningless data.  The noise can be defined as unwanted variance or some random error that occurred in a measurable variable.  Most data mining algorithms are affected adversely due to noisy data.
  • 7.
    Introduction To MachineLearning Data Processing March 2024 7  Clustering or outlier analysis  Clustering or outlier analysis is a method that allows detection of outliers by clustering.  In clustering, values which are common or similar are organized into groups or ‘clusters’, & those values which lie outside these clusters are termed as outliers or noise.
  • 8.
    Introduction To MachineLearning Data Processing March 2024 8  Regression  Regression is another such method which allows data smoothing by fitting it to some function.  For example, Linear Regression is one of the most used methods that aims at finding the most suitable line to fit values of two variables or attributes.  The primary purpose of this is to predict the value of other variables using the first one.
  • 9.
    Introduction To MachineLearning Data Processing March 2024 9  Similarly, Multiple Regression is used when more than two variables are involved.  Regression allows data fitting which in turn removes noise from data & hence smoothens the dataset using mathematical equations.  Linear regression is used to estimate a relationship between two variables.
  • 10.
    Introduction To MachineLearning Data Processing March 2024 10  For example, we might assume that, given an increase in population, food production would increase at the same rate.  To visualize this, consider a graph in which the Y-axis tracks population increase, & the X-axis tracks food production.  As the Y value increases, the X value would increase at the same rate, making the relationship between them a straight line.
  • 11.
    Introduction To MachineLearning Data Processing March 2024 11  Handling of Inconsistent Data  Data inconsistency is a situation where there are multiple tables within a database that deal with the same data but may receive it from different inputs.  Inconsistency is generally deepened by data redundancy.  However, it is different from data redundancy in that it typically refers to problems with the content of a database rather than its
  • 12.
    Introduction To MachineLearning Data Processing March 2024 12  Data integration is a process which combines data from an excess of sources (such as multiple databases, flat files) into a unified data store.  A most necessary step to be taken during data analysis is Data Integration.  During data integration, a number of tricky issues have to be considered. Data Integration
  • 13.
    Introduction To MachineLearning Data Processing March 2024 13  For example, how does the data analyst or the analyzing machine be sure that student_id of one database & student_number of another database refer to the same entity?  Solution to the problem lies with the term ‘metadata’.  Databases & data warehouses consist of metadata, which is data about data.  This metadata is taken as a reference & referred by the data analyst to avoid errors
  • 14.
    Introduction To MachineLearning Data Processing March 2024 14  Another such issue which may be caused due to schema integration is redundancy.  In the language of database, an attribute is said to be redundant if it is derivable from some other table (of the same database).  Mistakes in attribute naming can also lead to data redundancies in the resulting dataset.  We use a number of tools to perform data integration from different sources into one unified schema.
  • 15.
    Introduction To MachineLearning Data Processing March 2024 15  Data transformation is a process in which data is consolidated or transformed into some other standard forms which are better suited for data mining.  Normalization & Standardization are the most popular & widely used data transformation methods. Data Transformation
  • 16.
    Introduction To MachineLearning Data Processing March 2024 16  Normalization  In case of normalization, all the attributes are converted to a normalized score or to a range (0, 1).  The problem of normalization is an outlier.  If there is an outlier, it will tend to crunch all of the other values down toward the value of zero.  In order to understand this, let’s suppose the range of students’ marks is 35 to 45 out of
  • 17.
    Introduction To MachineLearning Data Processing March 2024 17  Then 35 will be considered as 0 & 45 as 1, & students will be distributed between 0 to 1 depending upon their marks.  But if there is one student having marks 90, then it will act as an outlier & in this case, 35 will be considered as 0 & 90 as 1.  Now, it will crunch most of the values down toward the value of zero.  In this scenario, the solution is standardization.
  • 18.
    Introduction To MachineLearning Data Processing March 2024 18  Standardization  In case of standardization, the values are all spread out so that we have a standard deviation of 1.  Generally, there is no rule for when to use normalization versus standardization.  However, if your data has outliers, use standardization, otherwise use normalization.
  • 19.
    Introduction To MachineLearning Data Processing March 2024 19  Using standardization tends to make the remaining values for all of the other attributes fall into similar ranges since all attributes will have the same standard deviation of 1.
  • 20.
    Introduction To MachineLearning Data Processing March 2024 20  Using data reduction techniques a dataset can be represented in a reduced manner without actually compromising the integrity of original data.  Data reduction is all about reducing the dimensions (referring to the total number of attributes) or reducing the volume.  Moreover, mining when carried out on reduced datasets often results in better accuracy & proves to be more efficient. Data Reduction
  • 21.
    Introduction To MachineLearning Data Processing March 2024 21  There are many methods to reduce large datasets to yield useful knowledge.  A few among them are:
  • 22.
    Introduction To MachineLearning Data Processing March 2024 22  Dimension reduction  In data warehousing, ‘dimension’ equips us with structured labeling information. But not all dimensions (attributes) are necessary at a time.  Dimension reduction uses algorithm such as Principal Component Analysis (PCA).  With the usage of such algorithms one can detect & remove redundant & weakly relevant, attributes or dimensions.
  • 23.
    Introduction To MachineLearning Data Processing March 2024 23  Multiplicity reduction  It is a technique which is used to choose smaller forms of data representation for reducing the dataset volume.  Data compression  Data compression techniques are used to reduce the dataset size.  In these techniques compression techniques where some encoding mechanisms (e.g. Huffman coding) are used.
  • 24.
    End of Lesson1 Question / Discussion?