Cloud Computing about Data Processing.pptx

Introduction To Machine Learning
Data
Processing
March 2024
1
S. Hassan Adelyar, Ph.D
Instructor of Computer Science Faculty
Kabul University
March 2024
Data Processing
07:49:21 AM

Data
Processing
March 2024
2
 Real world data can often be incomplete,
inconsistent or even erroneous in nature.
 Data preprocessing resolves such issues.
 Data preprocessing is the first step for any
data mining process.
 Data preprocessing involves transformation
of raw data into an understandable format.
Data Processing

Data
Processing
March 2024
3
 The various stages in which data preprocessing
is performed.
 Data Cleaning
 Data Integration
 Data Transformation
 Data Reduction
Data Preprocessing Methods

Data
Processing
March 2024
4
 In data cleaning:
 Missing values are filled
 Noisy data is smoothed
 Inconsistencies are resolved
 Outliers are identified & removed in order to
clean the data.
 Filling Missing Values
 Missing values are filled by different
methods.
Data cleaning

Data
Processing
March 2024
5
 Fill in the missing value manually:
 Use of some global constant in place of
missing value. such as ‘Unknown’ or -∞
 Use the attribute mean to fill in the missing
value.
 Use some other value which is high in
probability to fill in the missing value.
 Ignore the tuple

Data
Processing
March 2024
6
 Handling noisy data
 A data set that contains extra meaningless
data.
 The noise can be defined as unwanted
variance or some random error that
occurred in a measurable variable.
 Most data mining algorithms are affected
adversely due to noisy data.

Data
Processing
March 2024
7
 Clustering or outlier analysis
 Clustering or outlier analysis is a method
that allows detection of outliers by
clustering.
 In clustering, values which are common or
similar are organized into groups or
‘clusters’, & those values which lie outside
these clusters are termed as outliers or noise.

Data
Processing
March 2024
8
 Regression
 Regression is another such method which
allows data smoothing by fitting it to some
function.
 For example, Linear Regression is one of
the most used methods that aims at finding
the most suitable line to fit values of two
variables or attributes.
 The primary purpose of this is to predict the
value of other variables using the first one.

Data
Processing
March 2024
9
 Similarly, Multiple Regression is used when
more than two variables are involved.
 Regression allows data fitting which in turn
removes noise from data & hence smoothens
the dataset using mathematical equations.
 Linear regression is used to estimate a
relationship between two variables.

Data
Processing
March 2024
10
 For example, we might assume that, given an
increase in population, food production would
increase at the same rate.
 To visualize this, consider a graph in which the
Y-axis tracks population increase, & the X-axis
tracks food production.
 As the Y value increases, the X value would
increase at the same rate, making the
relationship between them a straight line.

Data
Processing
March 2024
11
 Handling of Inconsistent Data
 Data inconsistency is a situation where there
are multiple tables within a database that
deal with the same data but may receive it
from different inputs.
 Inconsistency is generally deepened by data
redundancy.
 However, it is different from data redundancy
in that it typically refers to problems with
the content of a database rather than its

Data
Processing
March 2024
12
 Data integration is a process which combines
data from an excess of sources (such as
multiple databases, flat files) into a unified
data store.
 A most necessary step to be taken during data
analysis is Data Integration.
 During data integration, a number of tricky
issues have to be considered.
Data Integration

Data
Processing
March 2024
13
 For example, how does the data analyst or the
analyzing machine be sure that student_id of
one database & student_number of another
database refer to the same entity?
 Solution to the problem lies with the term
‘metadata’.
 Databases & data warehouses consist of
metadata, which is data about data.
 This metadata is taken as a reference &
referred by the data analyst to avoid errors

Data
Processing
March 2024
14
 Another such issue which may be caused due to
schema integration is redundancy.
 In the language of database, an attribute is said
to be redundant if it is derivable from some
other table (of the same database).
 Mistakes in attribute naming can also lead to
data redundancies in the resulting dataset.
 We use a number of tools to perform data
integration from different sources into one
unified schema.

Data
Processing
March 2024
15
 Data transformation is a process in which
data is consolidated or transformed into some
other standard forms which are better suited for
data mining.
 Normalization & Standardization are the most
popular & widely used data transformation
methods.
Data Transformation

Data
Processing
March 2024
16
 Normalization
 In case of normalization, all the attributes
are converted to a normalized score or to a
range (0, 1).
 The problem of normalization is an outlier.
 If there is an outlier, it will tend to crunch all
of the other values down toward the value of
zero.
 In order to understand this, let’s suppose the
range of students’ marks is 35 to 45 out of

Data
Processing
March 2024
17
 Then 35 will be considered as 0 & 45 as 1, &
students will be distributed between 0 to 1
depending upon their marks.
 But if there is one student having marks 90,
then it will act as an outlier & in this case, 35
will be considered as 0 & 90 as 1.
 Now, it will crunch most of the values down
toward the value of zero.
 In this scenario, the solution is
standardization.

Data
Processing
March 2024
18
 Standardization
 In case of standardization, the values are all
spread out so that we have a standard
deviation of 1.
 Generally, there is no rule for when to use
normalization versus standardization.
 However, if your data has outliers, use
standardization, otherwise use
normalization.

Data
Processing
March 2024
19
 Using standardization tends to make the
remaining values for all of the other
attributes fall into similar ranges since all
attributes will have the same standard
deviation of 1.

Data
Processing
March 2024
20
 Using data reduction techniques a dataset can
be represented in a reduced manner without
actually compromising the integrity of original
data.
 Data reduction is all about reducing the
dimensions (referring to the total number of
attributes) or reducing the volume.
 Moreover, mining when carried out on reduced
datasets often results in better accuracy &
proves to be more efficient.
Data Reduction

Data
Processing
March 2024
21
 There are many methods to reduce large
datasets to yield useful knowledge.
 A few among them are:

Data
Processing
March 2024
22
 Dimension reduction
 In data warehousing, ‘dimension’ equips us
with structured labeling information. But
not all dimensions (attributes) are necessary
at a time.
 Dimension reduction uses algorithm such as
Principal Component Analysis (PCA).
 With the usage of such algorithms one can
detect & remove redundant & weakly
relevant, attributes or dimensions.

Data
Processing
March 2024
23
 Multiplicity reduction
 It is a technique which is used to choose
smaller forms of data representation for
reducing the dataset volume.
 Data compression
 Data compression techniques are used to
reduce the dataset size.
 In these techniques compression techniques
where some encoding mechanisms (e.g.
Huffman coding) are used.

End of Lesson 1
Question / Discussion?

Cloud Computing about Data Processing.pptx

More Related Content

Similar to Cloud Computing about Data Processing.pptx

More from AnsarHasas1

Recently uploaded

Cloud Computing about Data Processing.pptx