IT-601 Lecture Notes-UNIT-2.pdf Data Analysis

M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
Data Analytics (KIT-601)
Unit-2: Data Analysis
Dr. Radhey Shyam
Professor
Department of Information Technology
SRMCEM Lucknow
(Affiliated to Dr. A.P.J. Abdul Kalam Technical University, Lucknow)
Unit-2 has been prepared and compiled by Dr. Radhey Shyam, with grateful acknowledgment to those who
made their course contents freely available or (Contributed directly or indirectly). Feel free to use this
study material for your own academic purposes. For any query, communication can be made through this
email : shyam0058@gmail.com.
March 11, 2024

Data Analytics (KIT 601)
Course Outcome ( CO) Bloom’s Knowledge Level (KL)
At the end of course , the student will be able to
CO 1 Discuss various concepts of data analytics pipeline K1, K2
CO 2 Apply classification and regression techniques K3
CO 3 Explain and apply mining techniques on streaming data K2, K3
CO 4 Compare different clustering and frequent pattern mining algorithms K4
CO 5 Describe the concept of R programming and implement analytics on Big data using R. K2,K3
DETAILED SYLLABUS 3-0-0
Unit Topic Proposed
Lecture
I
Introduction to Data Analytics: Sources and nature of data, classification of data
(structured, semi-structured, unstructured), characteristics of data, introduction to Big Data
platform, need of data analytics, evolution of analytic scalability, analytic process and
tools, analysis vs reporting, modern data analytic tools, applications of data analytics.
Data Analytics Lifecycle: Need, key roles for successful analytic projects, various phases
of data analytics lifecycle – discovery, data preparation, model planning, model building,
communicating results, operationalization.
08
II
Data Analysis: Regression modeling, multivariate analysis, Bayesian modeling, inference
and Bayesian networks, support vector and kernel methods, analysis of time series: linear
systems analysis & nonlinear dynamics, rule induction, neural networks: learning and
generalisation, competitive learning, principal component analysis and neural networks,
fuzzy logic: extracting fuzzy models from data, fuzzy decision trees, stochastic search
methods.
08
III
Mining Data Streams: Introduction to streams concepts, stream data model and
architecture, stream computing, sampling data in a stream, filtering streams, counting
distinct elements in a stream, estimating moments, counting oneness in a window,
decaying window, Real-time Analytics Platform ( RTAP) applications, Case studies – real
time sentiment analysis, stock market predictions.
08
IV
Frequent Itemsets and Clustering: Mining frequent itemsets, market based modelling,
Apriori algorithm, handling large data sets in main memory, limited pass algorithm,
counting frequent itemsets in a stream, clustering techniques: hierarchical, K-means,
clustering high dimensional data, CLIQUE and ProCLUS, frequent pattern based clustering
methods, clustering in non-euclidean space, clustering for streams and parallelism.
08
V
Frame Works and Visualization: MapReduce, Hadoop, Pig, Hive, HBase, MapR,
Sharding, NoSQL Databases, S3, Hadoop Distributed File Systems, Visualization: visual
data analysis techniques, interaction techniques, systems and applications.
Introduction to R - R graphical user interfaces, data import and export, attribute and data
types, descriptive statistics, exploratory data analysis, visualization before analysis,
analytics for unstructured data.
08
Text books and References:
1. Michael Berthold, David J. Hand, Intelligent Data Analysis, Springer
2. Anand Rajaraman and Jeffrey David Ullman, Mining of Massive Datasets, Cambridge University Press.
3. John Garrett,Data Analytics for IT Networks : Developing Innovative Use Cases, Pearson Education
Curriculum & Evaluation Scheme IT & CSI (V & VI semester) 23
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S

M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
Unit-II: Data Analysis
Data analysis refers to the process of inspecting, cleaning, transforming, and modeling data
with the goal of discovering useful information, drawing conclusions, and supporting decision-
making. It is a critical component of many fields, including business, finance, healthcare, engineering, and
the social sciences.
The data analysis process typically involves the following steps:
ˆ Data collection: This step involves gathering data from various sources, such as databases, surveys,
sensors, and social media.
ˆ Data cleaning: This step involves removing errors, inconsistencies, and outliers from the data. It
may also involve imputing missing values, transforming variables, and normalizing the data.
ˆ Data exploration: This step involves visualizing and summarizing the data to gain insights and
identify patterns. This may include statistical analyses, such as descriptive statistics, correlation
analysis, and hypothesis testing.
ˆ Data modeling: This step involves developing mathematical models to predict or explain the behavior
of the data. This may include regression analysis, time series analysis, machine learning, and
other techniques.
ˆ Data visualization: This step involves creating visual representations of the data to communicate
insights and findings to stakeholders. This may include charts, graphs, tables, and other visual-
izations.
ˆ Decision-making: This step involves using the results of the data analysis to make informed deci-
sions, develop strategies, and take actions.
Data analysis is a complex and iterative process that requires expertise in statistics, programming, and
domain knowledge. It is often performed using specialized software, such as R, Python, SAS, and Excel, as
well as cloud-based platforms, such as Amazon Web Services and Google Cloud Platform. Effective data
analysis can lead to better business outcomes, improved healthcare outcomes, and a deeper understanding
of complex phenomena.
3

M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
1 Regression Modeling
Regression modeling is a statistical technique used to examine the relationship between a dependent
variable (also called the outcome or response variable) and one or more independent variables (also called
predictors or explanatory variables). The goal of regression modeling is to identify the nature and strength
of the relationship between the dependent variable and the independent variable(s) and to use this infor-
mation to make predictions about the dependent variable.
There are many different types of regression models, including linear regression, logistic regression,
polynomial regression1
, and multivariate regression. Linear regression is one of the most commonly
used types of regression modeling, and it assumes that the relationship between the dependent variable and
the independent variable(s) is linear.
Regression modeling is used in a wide range of fields, including economics, finance, psychology, and
epidemiology2
, among others. It is often used to understand the relationships between different factors and
to make predictions about future outcomes.
1.1 Regression
1.1.1 Simple Linear Regression
Linear Regression— In statistics, linear regression is a linear approach to modeling the relationship
between a scalar response (or dependent variable) and one or more explanatory variables (or independent
variables). The case of one explanatory variable is called simple linear regression.
ˆ Linear regression is used to predict the continuous dependent variable using a given set of independent
variables.
ˆ Linear Regression is used for solving Regression problem.
ˆ In Linear regression, value of continuous variables are predicted.
ˆ Linear regression tried to find the best fit line, through which the output can be easily predicted.
1In statistics, polynomial regression is a form of regression analysis in which the relationship between the independent
variable x and the dependent variable y is modelled as an nth degree polynomial in x.
2Epidemiology is the study (scientific, systematic, and data-driven) of the distribution (frequency, pattern) and determinants
(causes, risk factors) of health-related states and events (not just diseases) in specified populations (neighborhood, school, city,
state, country, global)
4

M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
ˆ Least square estimation method3
is used for estimation of accuracy4
.
ˆ The output for Linear Regression must be a continuous value, such as price, age, etc.
ˆ In Linear regression, it is required that relationship between dependent variable and independent
variable must be linear.
ˆ In linear regression, there may be collinearity5
between the independent variables.
Some Regression examples:
ˆ Regression analysis is used in stats to find trends in data. For example, you might guess that there is
a connection between how much you eat and how much you weigh; regression analysis can help you
quantify that.
ˆ Regression analysis will provide you with an equation for a graph so that you can make predictions
about your data. For example, if you’ve been putting on weight over the last few years, it can predict
how much you’ll weigh in ten years time if you continue to put on weight at the same rate.
ˆ It is also called simple linear regression. It establishes the relationship between two variables using
a straight line. If two or more explanatory variables have a linear relationship with the dependent
variable, the regression is called a multiple linear regression.
1.1.2 Logistic Regression
Logistic Regression— use to resolve classification problems where given an element you have to classify
the same in N categories. Typical examples are for example given a mail to classify it as spam or not, or
3The least squares method is a statistical procedure to find the best fit for a set of data points by minimizing the sum of
the offsets of points from the plotted curve. Least squares regression is used to predict the behavior of dependent variables.
4Accuracy is how close a measured value is to the actual value. Precision is how close the measured values are to each other.
5Collinearity is a condition in which some of the independent variables are highly correlated.
5

M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
given a vehicle find to which category it belongs (car, truck, van, etc.). That’s basically the output is a finite
set of descrete values.
ˆ Logistic Regression is used to predict the categorical dependent variable using a given set of independent
variables.
ˆ Logistic regression is used for solving Classification problems.
ˆ In logistic Regression, we predict the values of categorical variables.
ˆ In Logistic Regression, we find the S-curve by which we can classify the samples.
ˆ Maximum likelihood estimation method is used for estimation of accuracy.
ˆ The output of Logistic Regression must be a Categorical value such as 0 or 1, Yes or No, etc.
ˆ In Logistic regression, it is not required to have the linear relationship between the dependent and
independent variable.
ˆ In logistic regression, there should not be collinearity between the independent variable.
2 Multivariate Analysis
Multivariate analysis is a statistical technique used to examine the relationships between multiple variables
simultaneously. It is used when there are multiple dependent variables and/or independent variables that
are interrelated.
Multivariate analysis is used in a wide range of fields, including social sciences, marketing, biology,
and finance, among others. There are many different types of multivariate analysis, including multivari-
ate regression, principal component analysis, factor analysis, cluster analysis, and discriminant
analysis.
6

M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
Multivariate regression is similar to linear regression, but it involves more than one independent variable.
It is used to predict the value of a dependent variable based on two or more independent variables. Principal
component analysis (PCA) is a technique used to reduce the dimensionality of data by identifying patterns
and relationships between variables. Factor analysis is a technique used to identify underlying factors that
explain the correlations between multiple variables. Cluster analysis is a technique used to group objects or
individuals into clusters based on similarities or dissimilarities. Discriminant analysis is a technique used to
determine which variables discriminate between two or more groups.
Overall, multivariate analysis is a powerful tool for examining complex relationships between multiple
variables, and it can help researchers and analysts gain a deeper understanding of the data they are working
with.
3 Bayesian Modeling
Bayesian modeling is a statistical modeling approach that uses Bayesian inference to make predictions and
estimate parameters. It is named after Thomas Bayes, an 18th-century statistician who developed the Bayes
theorem, which is a key component of Bayesian modeling.
In Bayesian modeling, prior information about the parameters of interest is combined with data to
produce a posterior distribution. This posterior distribution represents the updated probability distribution
of the parameters given the data and the prior information. The posterior distribution is used to make
inferences and predictions about the parameters.
Bayesian modeling is particularly useful when there is limited data or when the data is noisy or uncertain.
It allows for the incorporation of prior knowledge and beliefs into the modeling process, which can improve
the accuracy and precision of predictions.
Bayesian modeling is used in a wide range of fields, including finance, engineering, ecology, and social
sciences. Some examples of Bayesian modeling applications include predicting stock prices, estimating the
prevalence of a disease in a population, and analyzing the effects of environmental factors on a species.
3.1 Bayes Theorem
ˆ Goal — To determine the most probable hypothesis, given the data D plus any initial knowledge
about the prior probabilities of the various hypotheses in H.
7

M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
ˆ Prior probability of h, P(h) — it reflects any background knowledge we have about the chance
that h is a correct hypothesis (before having observed the data).
ˆ Prior probability of D, P(D) — it reflects the probability that training data D will be observed
given no knowledge about which hypothesis h holds.
ˆ Conditional Probability of observation D, P(D|h) — it denotes the probability of observing
data D given some world in which hypothesis h holds.
ˆ Posterior probability of h, P(h|D) — it represents the probability that h holds given the observed
training data D. It reflects our confidence that h holds after we have seen the training data D and it
is the quantity that Machine Learning researchers are interested in.
ˆ Bayes Theorem allows us to compute P(h|D) —
P(h|D) = P(D|h)P(h)/P(D)
Maximum A Posteriori (MAP) Hypothesis and Maximum Likelihood
ˆ Goal — To find the most probable hypothesis h from a set of candidate hypotheses H given the
observed data D. MAP Hypothesis,
hMAP = argmax
h∈H
P(h|D)
= argmax
h∈H
P(D|h)P(h)/P(D)
= argmax
h∈H
P(D|h)P(h)
ˆ If every hypothesis in H is equally probable a priori, we only need to consider the likelihood of the
data D given h, P(D|h). Then, hMAP becomes the Maximum Likelihood,
hML = argmax
h∈H
P(D|h)P(h)
Overall, Bayesian modeling is a powerful tool for making predictions and estimating parameters in situ-
ations where there is uncertainty and prior information is available.
8

M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
4 Inference and Bayesian networks
Inference in Bayesian networks is the process of using probabilistic reasoning to make predictions or draw
conclusions about a system or phenomenon. Bayesian networks are graphical models that represent the
relationships between variables using a directed acyclic graph, where nodes represent variables and edges
represent probabilistic dependencies between the variables.
Inference in Bayesian networks involves calculating the posterior probability distribution of one or more
variables given evidence about other variables in the network. This can be done using Bayesian inference,
which involves updating the prior probability distribution of the variables using Bayes’ theorem and the
observed evidence.
The posterior distribution can be used to make predictions or draw conclusions about the system or
phenomenon being modeled. For example, in a medical diagnosis system, the posterior probability of a
particular disease given a set of symptoms can be calculated using a Bayesian network. This can help
clinicians make a more accurate diagnosis and choose appropriate treatments.
Bayesian networks and inference are widely used in many fields, including artificial intelligence, decision
making, finance, and engineering. They are particularly useful in situations where there is uncertainty and
probabilistic relationships between variables need to be modeled and analyzed.
4.1 BAYESIAN NETWORKS
ˆ Abbreviation : BBN (Bayesian Belief Network)
ˆ Synonyms: Bayes (ian) network, Bayes(ian) model, Belief network, Decision network, or probabilistic
directed acyclic graphical model.
ˆ A BBN is a probabilistic graphical model that represents a set of variables and their conditional
dependencies via a Directed Acyclic Graph (DAG).
9

M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
ˆ BBNs enable us to model and reason about uncertainty. BBNs accommodate both subjective proba-
bilities and probabilities based on objective data.
ˆ The most important use of BBNs is in revising probabilities in the light of actual observations of events.
ˆ Nodes represent variables in the Bayesian sense: observable quantities, hidden variables or hypotheses.
Edges represent conditional dependencies.
ˆ Each node is associated with a probability function that takes, as input, a particular set of probabilities
for values for the node’s parent variables, and outputs the probability of the values of the variable
represented by the node.
ˆ Prior Probabilities: e.g. P(RAIN)
ˆ Conditional Probabilities: e.g. P(SPRINKLER | RAIN)
ˆ Joint Probability Function: P(GRASS WET, SPRINKLER, RAIN) = P(GRASS WET | RAIN,
SPRINKLER) * P(SPRINKLER | RAIN) * P ( RAIN)
ˆ Typically the probability functions are described in table form.
10

M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
ˆ BN cannot be used to model the correlation relationships between random variables.
Overall, inference in Bayesian networks is a powerful tool for making predictions and drawing conclusions
in situations where there is uncertainty and complex probabilistic relationships between variables.
4.2 Support Vector and Kernel Methods
Support vector machines (SVMs) and kernel methods are commonly used in machine learning and pattern
recognition to solve classification and regression problems.
SVMs are a type of supervised learning algorithm that aims to find the optimal hyperplane that separates
the data into different classes. The optimal hyperplane is the one that maximizes the margin, or the distance
between the hyperplane and the closest data points from each class. SVMs can also use kernel functions to
transform the original input data into a higher dimensional space, where it may be easier to find a separating
hyperplane.
Kernel methods are a class of algorithms that use kernel functions to compute the similarity between
pairs of data points. Kernel functions can transform the input data into a higher dimensional feature space,
where linear methods can be applied more effectively. Some commonly used kernel functions include linear,
polynomial, and radial basis functions.
Kernel methods are used in a variety of applications, including image recognition, speech recognition,
11

M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
and natural language processing. They are particularly useful in situations where the data is non-linear and
the relationship between variables is complex.
History of SVM6
ˆ SVM is related to statistical learning theory.
ˆ SVM was first introduced in 1992.
ˆ SVM becomes popular because of its success in handwritten digit recognition 1.1% test error rate for
SVM. This is the same as the error rates of a carefully constructed neural network.
ˆ SVM is now regarded as an important example of “kernel methods”, one of the key area in machine
learning
Binary Classification
Given training data (xi, yi) for i = 1 . . . N, with
xi ∈ Rd and yi ∈ {−1, 1}, learn a classifier f(x)
such that
f(xi)
(
≥ 0 yi = +1
< 0 yi = −1
i.e. yif(xi) > 0 for a correct classification.
Linear separability
linearly
separable
not
linearly
separable
6Support vector machine is a linear model and it always looks for a hyperplane to separate one class from another. I will
focus on two-dimensional case because it is easier to comprehend and - possible to visualize to give some intuition, however
bear in mind that this is true for higher dimensions (simply lines change into planes, parabolas into paraboloids etc.).
12

M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
Linear classifiers
X2
X1
A linear classifier has the form
• in 2D the discriminant is a line
• is the normal to the line, and b the bias
• is known as the weight vector
f(x) = 0
f(x) = w>x + b
f(x) > 0
f(x) < 0
Linear classifiers
A linear classifier has the form
• in 3D the discriminant is a plane, and in nD it is a hyperplane
For a K-NN classifier it was necessary to `carry’ the training data
For a linear classifier, the training data is used to learn w and then discarded
Only w is needed for classifying new data
f(x) = 0
f(x) = w>x + b
Given linearly separable data xi labelled into two categories yi = {-1,1} ,
find a weight vector w such that the discriminant function
separates the categories for i = 1, .., N
• how can we find this separating hyperplane ?
The Perceptron Classifier
f(xi) = w>xi + b
The Perceptron Algorithm
Write classifier as
• Initialize w = 0
• Cycle though the data points { xi, yi }
• if xi is misclassified then
• Until all the data is correctly classified
w ← w + α sign(f(xi)) xi
f(xi) = w̃>x̃i + w0 = w>xi
where w = (w̃, w0), xi = (x̃i, 1)
13

M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
For example in 2D
X2
X1
X2
X1
w
before update after update
w
NB after convergence w =
PN
i αixi
• Initialize w = 0
• Cycle though the data points { xi, yi }
• if xi is misclassified then
• Until all the data is correctly classified
w ← w + α sign(f(xi)) xi
xi
• if the data is linearly separable, then the algorithm will converge
• convergence can be slow …
• separating line close to training data
• we would prefer a larger margin for generalization
-15 -10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
8
Perceptron
example
What is the best w?
• maximum margin solution: most stable under perturbations of the inputs
14

M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
Tennis example
Humidity
Temperature
= play tennis
= do not play tennis
Linear Support Vector
Machines
x1
x2
=+1
=-1
Data: <xi,yi>, i=1,..,l
xi  Rd
yi  {-1,+1}
=-1
=+1
Data: <xi,yi>, i=1,..,l
xi  Rd
yi  {-1,+1}
All hyperplanes in Rd
are parameterize by a vector (w) and a constant b.
Can be expressed as w•x+b=0 (remember the equation for a hyperplane
from algebra!)
Our aim is to find such a hyperplane f(x)=sign(w•x+b), that
correctly classify our data.
f(x)
Linear SVM 2
15

M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
d+
d-
Definitions
Define the hyperplane H such that:
xi•w+b  +1 when yi =+1
xi•w+b  -1 when yi =-1
d+ = the shortest distance to the closest positive point
d- = the shortest distance to the closest negative point
The margin of a separating hyperplane is d+
+ d-
.
H
H1 and H2 are the planes:
H1: xi•w+b = +1
H2: xi•w+b = -1
The points on the planes
H1 and H2 are the
Support Vectors
H1
H2
Maximizing the margin
d+
d-
We want a classifier with as big margin as possible.
Recall the distance from a point(x0,y0) to a line:
Ax+By+c = 0 is|A x0 +B y0 +c|/sqrt(A2
+B2
)
The distance between H and H1 is:
|w•x+b|/||w||=1/||w||
The distance between H1 and H2 is: 2/||w||
In order to maximize the margin, we need to minimize ||w||. With the
condition that there are no datapoints between H1 and H2:
xi•w+b  +1 when yi =+1
xi•w+b  -1 when yi =-1 Can be combined into yi(xi•w)  1
H1
H2
H
Constrained Optimization
Problem
 
 
0
and
0
subject to
2
1
Maximize
:
yields
g
simplifyin
and
,
into
back
ng
substituti
0,
to
them
setting
s,
derivative
the
Taking
0.
be
must
and
both
respect
with
of
derivative
partial
the
extremum,
At the
1
)
(
||
||
2
1
)
,
,
(
where
,
)
,
,
(
inf
maximize
:
method
Lagrangian
all
for
1
)
(
subject to
||
||
Minimize
,















 

i
i
i
i
i j
i
j
i
j
i
j
i
i
i
i
i
i
i
i
y
y
y
L
b
L
b
y
b
L
b
L
i
b
y








x
x
w
w
x
w
w
w
w
x
w
w
w
w
16

M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
Quadratic Programming
• Why is this reformulation a good thing?
• The problem
is an instance of what is called a positive, semi-definite
programming problem
• For a fixed real-number accuracy, can be solved in
O(n log n) time = O(|D|2 log |D|2)
0
and
0
subject to
2
1
Maximize
,





 
i
i
i
i
i j
i
j
i
j
i
j
i
i
y
y
y




 x
x
Problems with linear SVM
=-1
=+1
What if the decision function is not a linear?
Kernel Trick
)
2
,
,
(
space
in the
separable
linearly
are
points
Data
2
1
2
2
2
1 x
x
x
x
2
,
)
,
(
Here,
directly!
compute
easy to
often
is
:
thing
Cool
)
(
)
(
)
,
(
Define
)
(
)
(
2
1
maximize
want to
We
j
i
j
i
j
i
j
i
i j
i
j
i
j
i
j
i
i
K
K
F
F
K
F
F
y
y
x
x
x
x
x
x
x
x
x
x






  


17

M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
Other Kernels
The polynomial kernel
K(xi,xj) = (xi•xj + 1)p
, where p is a tunable parameter.
Evaluating K only require one addition and one exponentiation
more than the original dot product.
Gaussian kernels (also called radius basis functions)
K(xi,xj) = exp(||xi-xj ||2
/22
)
Overtraining/overfitting
=-1
=+1
An example: A botanist really knowing trees.Everytime he sees a new tree,
he claims it is not a tree.
A well known problem with machine learning methods is overtraining.
This means that we have learned the training data very well, but
we can not classify unseen examples correctly.
Overtraining/overfitting 2
It can be shown that: The portion, n, of unseen data that will be
missclassified is bounded by:
n  Number of support vectors / number of training examples
A measure of the risk of overtraining with SVM (there are also other
measures).
Ockham´s razor principle: Simpler system are better than more complex ones.
In SVM case: fewer support vectors mean a simpler representation of the
hyperplane.
Example: Understanding a certain cancer if it can be described by one gene
is easier than if we have to describe it with 5000.
18

M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
A practical example, protein
localization
• Proteins are synthesized in the cytosol.
• Transported into different subcellular
locations where they carry out their
functions.
• Aim: To predict in what location a
certain protein will end up!!!
Overall, SVMs and kernel methods are powerful tools for solving classification and regression problems. They
can handle complex data and provide accurate predictions, making them valuable in many fields, including
finance, healthcare, and engineering.
5 Analysis of Time Series: Linear Systems Analysis & Nonlinear
Dynamics
Time series analysis is a statistical technique used to analyze time-dependent data. It involves studying the
patterns and trends in the data over time and making predictions about future values.
Linear systems analysis is a technique used in time series analysis to model the behavior of a system
using linear equations. Linear models assume that the relationship between variables is linear and that the
system is time-invariant, meaning that the relationship between variables does not change over time. Linear
systems analysis involves techniques such as autoregressive (AR) and moving average (MA) models, which
use past values of a variable to predict future values.
Nonlinear dynamics is another approach to time series analysis that considers systems that are not
described by linear equations. Nonlinear systems are often more complex and can exhibit chaotic behavior,
making them more difficult to model and predict. Nonlinear dynamics involves techniques such as chaos
theory and fractal analysis, which use mathematical concepts to describe the behavior of nonlinear systems.
Both linear systems analysis and nonlinear dynamics have applications in a wide range of fields, including
finance, economics, and engineering. Linear models are often used in situations where the data is relatively
simple and the relationship between variables is well understood. Nonlinear dynamics is often used in
19

M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
situations where the data is more complex and the relationship between variables is not well understood.
There are several components of time series analysis, including:
1. Trend Analysis: Trend analysis is used to identify the long-term patterns and trends in the data. It
can be a linear or non-linear trend and may show an upward, downward or flat trend.
2. Seasonal Analysis: Seasonal analysis is used to identify the recurring patterns in the data that occur
within a fixed time period, such as a week, month, or year.
3. Cyclical Analysis: Cyclical analysis is used to identify the patterns that are not necessarily regular
or fixed in duration, but do show a tendency to repeat over time, such as economic cycles or business
cycles.
4. Irregular Analysis: Irregular analysis is used to identify any random fluctuations or noise in the
data that cannot be attributed to any of the above components.
5. Forecasting: Forecasting is the process of predicting future values of a time series based on its past
behavior. It can be done using various statistical techniques such as moving averages, exponential
smoothing, and regression analysis.
Overall, time series analysis is a powerful tool for studying time-dependent data and making predictions
about future values. Linear systems analysis and nonlinear dynamics are two approaches to time series
analysis that can be used in different situations to model and predict complex systems.
6 Rule Induction
Rule induction is a machine learning technique used to identify patterns in data and create a set of rules that
can be used to make predictions or decisions about new data. It is often used in decision tree algorithms
and can be applied to both classification and regression problems.
The rule induction process involves analyzing the data to identify common patterns and relationships
between the variables. These patterns are used to create a set of rules that can be used to classify or predict
new data. The rules are typically in the form of ”if-then” statements, where the ”if” part specifies the
conditions under which the rule applies and the ”then” part specifies the action or prediction to be taken.
Rule induction algorithms can be divided into two main types: top-down and bottom-up. Top-down
algorithms start with a general rule that applies to the entire dataset and then refine the rule based on
20

M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
the data. Bottom-up algorithms start with individual data points and then group them together based on
common attributes.
Rule induction has many applications in fields such as finance, healthcare, and marketing. For example,
it can be used to identify patterns in financial data to predict stock prices or to analyze medical data to
identify risk factors for certain diseases.
Overall, rule induction is a powerful machine learning technique that can be used to identify patterns
in data and create rules that can be used to make predictions or decisions. It is a useful tool for solving
classification and regression problems and has many applications in various fields.
7 Neural Networks: Learning and Generalization
Neural networks are a class of machine learning algorithms that are inspired by the structure and function
of the human brain. They are used to learn complex patterns and relationships in data and can be used for
a variety of tasks, including classification, regression, and clustering.
Learning in neural networks refers to the process of adjusting the weights and biases of the network to
improve its performance on a particular task. This is typically done through a process called backpropagation,
which involves propagating the errors from the output layer back through the network and adjusting the
weights and biases accordingly.
Generalization in neural networks refers to the ability of the network to perform well on new, unseen
data. A network that has good generalization performance is able to accurately predict the outputs for new
inputs that were not included in the training set. Generalization performance is typically evaluated using a
separate validation set or by cross-validation.
Overfitting is a common problem in neural networks, where the network becomes too complex and starts
to fit the noise in the training data, rather than the underlying patterns. This can result in poor generalization
performance on new data. Techniques such as regularization, early stopping, and dropout are often used to
prevent overfitting and improve generalization performance.
Overall, learning and generalization are two important concepts in neural networks. Learning involves
adjusting the weights and biases of the network to improve its performance, while generalization refers
to the ability of the network to perform well on new, unseen data. Effective techniques for learning and
generalization are critical for building accurate and useful neural network models.
21

M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
7.1 Neural Network concepts
22

M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S

M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
8 Competitive Learning
Competitive learning is a type of machine learning technique in which a set of neurons compete to be
activated by input data. The neurons are organized into a layer, and each neuron receives the same input
data. However, only one neuron is activated, and the competition is based on a set of rules that determine
which neuron is activated.
The competition in competitive learning is typically based on a measure of similarity between the input
data and the weights of each neuron. The neuron with the highest similarity to the input data is activated,
and the weights of that neuron are updated to become more similar to the input data. This process is repeated
for multiple iterations, and over time, the neurons learn to become specialized in recognizing different types
of input data.
Competitive learning is often used for unsupervised learning tasks, such as clustering or feature extraction.
In clustering, the neurons learn to group similar input data into clusters, while in feature extraction, the
neurons learn to recognize specific features in the input data.
One of the advantages of competitive learning is that it can be used to discover hidden structures and
patterns in data without the need for labeled data. This makes it particularly useful for applications such
as image and speech recognition, where labeled data can be difficult and expensive to obtain.
Overall, competitive learning is a powerful machine learning technique that can be used for a variety of
unsupervised learning tasks. It involves a set of neurons that compete to be activated by input data, and
over time, the neurons learn to become specialized in recognizing different types of input data.
27

M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
9 Principal Component Analysis and Neural Networks
Principal component analysis (PCA) and neural networks are both machine learning techniques that can be
used for a variety of tasks, including data compression, feature extraction, and dimensionality reduction.
PCA is a linear technique that involves finding the principal components of a dataset, which are the
directions of greatest variance. The principal components can be used to reduce the dimensionality of the
data, while preserving as much of the original variance as possible.
Neural networks, on the other hand, are nonlinear techniques that involve multiple layers of interconnected
neurons. Neural networks can be used for a variety of tasks, including classification, regression, and clustering.
They can also be used for feature extraction, where the network learns to identify the most important features
of the input data.
PCA and neural networks can be used together for a variety of tasks. For example, PCA can be used to
reduce the dimensionality of the data before feeding it into a neural network. This can help to improve the
performance of the network by reducing the amount of noise and irrelevant information in the input data.
Neural networks can also be used to improve the performance of PCA. In some cases, PCA can be
limited by its linear nature, and may not be able to capture complex nonlinear relationships in the data. By
combining PCA with a neural network, the network can learn to capture these nonlinear relationships and
improve the accuracy of the PCA results.
Overall, PCA and neural networks are both powerful machine learning techniques that can be used for
a variety of tasks. When used together, they can improve the performance and accuracy of each technique
and help to solve more complex problems.
30

Pattern Recognition
Tag: Principal Component Analysis Numerical Example
Principal Component Analysis | Dimension Reduction
Dimension Reduction-
In pattern recognition, Dimension Reduction is defined as-
It is a process of converting a data set having vast dimensions into a data set with lesser dimensions.
It ensures that the converted data set conveys similar information concisely.
Example-
Consider the following example-
The following graph shows two dimensions x1 and x2.
x1 represents the measurement of several objects in cm.
x2 represents the measurement of several objects in inches.
In machine learning,
Using both these dimensions convey similar information.
Also, they introduce a lot of noise in the system.
So, it is better to use just one dimension.
Using dimension reduction techniques-
We convert the dimensions of data from 2 dimensions (x1 and x2) to 1 dimension (z1).
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S

It makes the data relatively easier to explain.
Benefits-
Dimension reduction offers several benefits such as-
It compresses the data and thus reduces the storage space requirements.
It reduces the time required for computation since less dimensions require less computation.
It eliminates the redundant features.
It improves the model performance.
Dimension Reduction Techniques-
The two popular and well-known dimension reduction techniques are-
1. Principal ComponentAnalysis (PCA)
2. Fisher Linear DiscriminantAnalysis (LDA)
In this article, we will discuss about Principal ComponentAnalysis.
Principal Component Analysis-
Principal ComponentAnalysis is a well-known dimension reduction technique.
It transforms the variables into a new set of variables called as principal components.
These principal components are linear combination of original variables and are orthogonal.
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S

The first principal component accounts for most of the possible variation of original data.
The second principal component does its best to capture the variance in the data.
There can be only two principal components for a two-dimensional data set.
PCA Algorithm-
The steps involved in PCAAlgorithm are as follows-
Step-01: Get data.
Step-02: Compute the mean vector (µ).
Step-03: Subtract mean from the given data.
Step-04: Calculate the covariance matrix.
Step-05: Calculate the eigen vectors and eigen values of the covariance matrix.
Step-06: Choosing components and forming a feature vector.
Step-07: Deriving the new data set.
PRACTICE PROBLEMS BASED ON PRINCIPAL COMPONENT ANALYSIS-
Problem-01:
Given data = { 2, 3, 4, 5, 6, 7 ; 1, 5, 3, 6, 7, 8 }.
Compute the principal component using PCAAlgorithm.
OR
Consider the two dimensional patterns (2, 1), (3, 5), (4, 3), (5, 6), (6, 7), (7, 8).
Compute the principal component using PCAAlgorithm.
OR
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S

Compute the principal component of following data-
CLASS 1
X = 2 , 3 , 4
Y = 1 , 5 , 3
CLASS 2
X = 5 , 6 , 7
Y = 6 , 7 , 8
Solution-
We use the above discussed PCAAlgorithm-
Step-01:
Get data.
The given feature vectors are-
x1 = (2, 1)
x2 = (3, 5)
x3 = (4, 3)
x4 = (5, 6)
x5 = (6, 7)
x6 = (7, 8)
Step-02:
Calculate the mean vector (µ).
Mean vector (µ)
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S

= ((2 + 3 + 4 + 5 + 6 + 7) / 6, (1 + 5 + 3 + 6 + 7 + 8) / 6)
= (4.5, 5)
Thus,
Step-03:
Subtract mean vector (µ) from the given feature vectors.
x1 – µ = (2 – 4.5, 1 – 5) = (-2.5, -4)
x2 – µ = (3 – 4.5, 5 – 5) = (-1.5, 0)
x3 – µ = (4 – 4.5, 3 – 5) = (-0.5, -2)
x4 – µ = (5 – 4.5, 6 – 5) = (0.5, 1)
x5 – µ = (6 – 4.5, 7 – 5) = (1.5, 2)
x6 – µ = (7 – 4.5, 8 – 5) = (2.5, 3)
Feature vectors (xi) after subtracting mean vector (µ) are-
Step-04:
Calculate the covariance matrix.
Covariance matrix is given by-
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S

Now,
Now,
Covariance matrix
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S

= (m1 + m2 + m3 + m4 + m5 + m6) / 6
On adding the above matrices and dividing by 6, we get-
Step-05:
Calculate the eigen values and eigen vectors of the covariance matrix.
λ is an eigen value for a matrix M if it is a solution of the characteristic equation |M – λI| = 0.
So, we have-
From here,
(2.92 – λ)(5.67 – λ) – (3.67 x 3.67) = 0
16.56 – 2.92λ – 5.67λ + λ2 – 13.47 = 0
λ2 – 8.59λ + 3.09 = 0
Solving this quadratic equation, we get λ = 8.22, 0.38
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S

Thus, two eigen values are λ1 = 8.22 and λ2 = 0.38.
Clearly, the second eigen value is very small compared to the first eigen value.
So, the second eigen vector can be left out.
Eigen vector corresponding to the greatest eigen value is the principal component for the given data set.
So. we find the eigen vector corresponding to eigen value λ1.
We use the following equation to find the eigen vector-
MX = λX
where-
M = Covariance Matrix
X = Eigen vector
λ = Eigen value
Substituting the values in the above equation, we get-
Solving these, we get-
2.92X1 + 3.67X2 = 8.22X1
3.67X1 + 5.67X2 = 8.22X2
On simplification, we get-
5.3X1 = 3.67X2 ………(1)
3.67X1 = 2.55X2 ………(2)
From (1) and (2), X1 = 0.69X2
From (2), the eigen vector is-
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S

Thus, principal component for the given data set is-
Lastly, we project the data points onto the new subspace as-
Problem-02:
Use PCAAlgorithm to transform the pattern (2, 1) onto the eigen vector in the previous question.
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S

M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
10 Fuzzy Logic: Extracting Fuzzy Models from Data
Fuzzy logic is a type of logic that allows for degrees of truth, rather than just true or false values. It is often
used in machine learning to extract fuzzy models from data.
A fuzzy model is a model that uses fuzzy logic to make predictions or decisions based on uncertain or
incomplete data. Fuzzy models are particularly useful in situations where traditional models may not work
well, such as when the data is noisy or when there is a lot of uncertainty or ambiguity in the data.
To extract a fuzzy model from data, the first step is to define the input and output variables of the
model. The input variables are the features or attributes of the data, while the output variable is the target
variable that we want to predict or classify.
Next, we use fuzzy logic to define the membership functions for each input and output variable. The
membership functions describe the degree of membership of each data point to each category or class. For
example, a data point may have a high degree of membership to the category ”low”, but a low degree of
membership to the category ”high”.
Once the membership functions have been defined, we can use fuzzy inference to make predictions or
decisions based on the input data. Fuzzy inference involves using the membership functions to determine
the degree of membership of each data point to each category or class, and then combining these degrees of
membership to make a prediction or decision.
Overall, extracting fuzzy models from data involves using fuzzy logic to define the membership functions
for each input and output variable, and then using fuzzy inference to make predictions or decisions based on
the input data. Fuzzy models are particularly useful in situations where traditional models may not work
well, and can help to improve the accuracy and robustness of machine learning models.
10.1 Fuzzy Decision Trees
Fuzzy decision trees are a type of decision tree that use fuzzy logic to make decisions based on uncertain or
imprecise data. Decision trees are a type of supervised learning technique that involve recursively partitioning
the input space into regions that correspond to different classes or categories.
Fuzzy decision trees extend traditional decision trees by allowing for degrees of membership to each
category or class, rather than just a binary classification. This is particularly useful in situations where the
data is uncertain or imprecise, and where a single, crisp classification may not be appropriate.
To build a fuzzy decision tree, we start with a set of training data that consists of input-output pairs.
40

M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
We then use fuzzy logic to determine the degree of membership of each data point to each category or class.
This is done by defining the membership functions for each input and output variable, and using these to
compute the degree of membership of each data point to each category or class.
Next, we use the fuzzy membership values to construct a fuzzy decision tree. The tree consists of a set of
nodes and edges, where each node represents a test on one of the input variables, and each edge represents
a decision based on the result of the test. The degree of membership of each data point to each category or
class is used to determine the probability of reaching each leaf node of the tree.
Fuzzy decision trees can be used for a variety of tasks, including classification, regression, and clustering.
They are particularly useful in situations where the data is uncertain or imprecise, and where traditional
decision trees may not work well.
Overall, fuzzy decision trees are a powerful machine learning technique that can be used to make decisions
based on uncertain or imprecise data. They extend traditional decision trees by allowing for degrees of
membership to each category or class, and can help to improve the accuracy and robustness of machine
learning models.
11 Stochastic Search Methods
Stochastic search methods are a class of optimization algorithms that use probabilistic techniques to search
for the optimal solution in a large search space. These methods are commonly used in machine learning to
find the best set of parameters for a model, such as the weights in a neural network or the parameters in a
regression model.
Stochastic search methods are often used when the search space is too large to exhaustively search all
possible solutions, or when the objective function is highly nonlinear and has many local optima. The
basic idea behind these methods is to explore the search space by randomly sampling solutions and using
probabilistic techniques to move towards better solutions.
One common stochastic search method is called the stochastic gradient descent (SGD) algorithm. In this
method, the objective function is optimized by iteratively updating the parameters in the direction of the
negative gradient of the objective function. The update rule includes a learning rate, which controls the step
size and the direction of the update. SGD is widely used in training neural networks and other deep learning
models.
Another stochastic search method is called simulated annealing. This method is based on the physical
41

M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
process of annealing, which involves heating and cooling a material to improve its properties. In simulated
annealing, the search process starts with a high temperature and gradually cools down over time. At each
iteration, the algorithm randomly selects a new solution and computes its fitness. If the new solution is better
than the current solution, it is accepted. However, if the new solution is worse, it may still be accepted with
a certain probability that decreases as the temperature decreases.
Other stochastic search methods include evolutionary algorithms, such as genetic algorithms and particle
swarm optimization, which mimic the process of natural selection and evolution to search for the optimal
solution.
Overall, stochastic search methods are powerful optimization techniques that are widely used in machine
learning and other fields. These methods allow us to efficiently search large search spaces and find optimal
solutions in the presence of noise, uncertainty, and nonlinearity.
42

Printed Page: 1 of 2
Subject Code: KIT601
0Roll No: 0 0 0 0 0 0 0 0 0 0 0 0 0
BTECH
(SEM VI) THEORY EXAMINATION 2021-22
DATA ANALYTICS
Time: 3 Hours Total Marks: 100
Note: Attempt all Sections. If you require any missing data, then choose suitably.
SECTION A
1. Attempt all questions in brief. 2*10 = 20
Qno Questions CO
(a) Discuss the need of data analytics. 1
(b) Give the classification of data. 1
(c) Define neural network. 2
(d) What is multivariate analysis? 2
(e) Give the full form of RTAP and discuss its application. 3
(f) What is the role of sampling data in a stream? 3
(g) Discuss the use of limited pass algorithm. 4
(h) What is the principle behind hierarchical clustering technique? 4
(i) List five R functions used in descriptive statistics. 5
(j) List the names of any 2 visualization tools. 5
SECTION B
2. Attempt any three of the following: 10*3 = 30
Qno Questions CO
(a) Explain the process model and computation model for Big data
platform.
1
(b) Explain the use and advantages of decision trees. 2
(c) Explain the architecture of data stream model. 3
(d) Illustrate the K-means algorithm in detail with its advantages. 4
(e) Differentiate between NoSQL and RDBMS databases. 5
SECTION C
3. Attempt any one part of the following: 10*1 = 10
Qno Questions CO
(a) Explain the various phases of data analytics life cycle. 1
(b) Explain modern data analytics tools in detail. 1
4. Attempt any one part of the following: 10 *1 = 10
Qno Questions CO
(a) Compare various types of support vector and kernel methods of data
analysis.
2
(b) Given data= {2,3,4,5,6,7;1,5,3,6,7,8}. Compute the principal
component using PCA algorithm.
2
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S

Printed Page: 2 of 2
Subject Code: KIT601
0Roll No: 0 0 0 0 0 0 0 0 0 0 0 0 0
BTECH
(SEM VI) THEORY EXAMINATION 2021-22
DATA ANALYTICS
Qno Questions CO
(a) Explain any one algorithm to count number of distinct elements in a
data stream.
3
(b) Discuss the case study of stock market predictions in detail. 3
Qno Questions CO
(a) Differentiate between CLIQUE and ProCLUS clustering. 4
(b) A database has 5 transactions. Let min_sup=60% and min_conf=80%.
TID Items_Bought
T100 {M, O, N, K, E, Y}
T200 {D, O, N, K, E, Y}
T300 {M, A, K, E}
T400 {M, U, C, K, Y}
T500 {C, O, O, K, I, E}
i) Find all frequent itemsets using Apriori algorithm.
ii) List all the strong association rules (with support s and confidence
c).
4
Qno Questions CO
(a) Explain the HIVE architecture with its features in detail. 5
(b) Write R function to check whether the given number is prime or not. 5
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S

M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
12 Reference
[1] https://www.jigsawacademy.com/blogs/hr-analytics/data-analytics-lifecycle/
[2] https://statacumen.com/teach/ADA1/ADA1_notes_F14.pdf
[3] https://www.youtube.com/watch?v=fDRa82lxzaU
[4] https://www.investopedia.com/terms/d/data-analytics.asp
[5] http://egyankosh.ac.in/bitstream/123456789/10935/1/Unit-2.pdf
[6] http://epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/computer_science/16._data_analytics/
03._evolution_of_analytical_scalability/et/9280_et_3_et.pdf
[7] https://bhavanakhivsara.files.wordpress.com/2018/06/data-science-and-big-data-analy-nieizv_
book.pdf
[8] https://www.researchgate.net/publication/317214679_Sentiment_Analysis_for_Effective_Stock_
Market_Prediction
[9] https://snscourseware.org/snscenew/files/1569681518.pdf
[10] http://csis.pace.edu/ctappert/cs816-19fall/books/2015DataScience&BigDataAnalytics.pdf
[11] https://www.youtube.com/watch?v=mccsmoh2_3c
[12] https://mentalmodels4life.net/2015/11/18/agile-data-science-applying-kanban-in-the-analytics-li
[13] https://www.sas.com/en_in/insights/big-data/what-is-big-data.html#:~:text=Big%20data%
20refers%20to%20data,around%20for%20a%20long%20time.
[14] https://www.javatpoint.com/big-data-characteristics
[15] Liu, S., Wang, M., Zhan, Y., & Shi, J. (2009). Daily work stress and alcohol use: Testing the cross-
level moderation effects of neuroticism and job involvement. Personnel Psychology,62(3), 575–597.
http://dx.doi.org/10.1111/j.1744-6570.2009.01149.x
********************
47

IT-601 Lecture Notes-UNIT-2.pdf Data Analysis

Recommended

Recommended

More Related Content

Similar to IT-601 Lecture Notes-UNIT-2.pdf Data Analysis

Similar to IT-601 Lecture Notes-UNIT-2.pdf Data Analysis (20)

More from Dr. Radhey Shyam

More from Dr. Radhey Shyam (20)

Recently uploaded

Recently uploaded (20)

IT-601 Lecture Notes-UNIT-2.pdf Data Analysis