SlideShare a Scribd company logo
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
Data Analytics (KIT-601)
Unit-2: Data Analysis
Dr. Radhey Shyam
Professor
Department of Information Technology
SRMCEM Lucknow
(Affiliated to Dr. A.P.J. Abdul Kalam Technical University, Lucknow)
Unit-2 has been prepared and compiled by Dr. Radhey Shyam, with grateful acknowledgment to those who
made their course contents freely available or (Contributed directly or indirectly). Feel free to use this
study material for your own academic purposes. For any query, communication can be made through this
email : shyam0058@gmail.com.
March 11, 2024
Data Analytics (KIT 601)
Course Outcome ( CO) Bloom’s Knowledge Level (KL)
At the end of course , the student will be able to
CO 1 Discuss various concepts of data analytics pipeline K1, K2
CO 2 Apply classification and regression techniques K3
CO 3 Explain and apply mining techniques on streaming data K2, K3
CO 4 Compare different clustering and frequent pattern mining algorithms K4
CO 5 Describe the concept of R programming and implement analytics on Big data using R. K2,K3
DETAILED SYLLABUS 3-0-0
Unit Topic Proposed
Lecture
I
Introduction to Data Analytics: Sources and nature of data, classification of data
(structured, semi-structured, unstructured), characteristics of data, introduction to Big Data
platform, need of data analytics, evolution of analytic scalability, analytic process and
tools, analysis vs reporting, modern data analytic tools, applications of data analytics.
Data Analytics Lifecycle: Need, key roles for successful analytic projects, various phases
of data analytics lifecycle – discovery, data preparation, model planning, model building,
communicating results, operationalization.
08
II
Data Analysis: Regression modeling, multivariate analysis, Bayesian modeling, inference
and Bayesian networks, support vector and kernel methods, analysis of time series: linear
systems analysis & nonlinear dynamics, rule induction, neural networks: learning and
generalisation, competitive learning, principal component analysis and neural networks,
fuzzy logic: extracting fuzzy models from data, fuzzy decision trees, stochastic search
methods.
08
III
Mining Data Streams: Introduction to streams concepts, stream data model and
architecture, stream computing, sampling data in a stream, filtering streams, counting
distinct elements in a stream, estimating moments, counting oneness in a window,
decaying window, Real-time Analytics Platform ( RTAP) applications, Case studies – real
time sentiment analysis, stock market predictions.
08
IV
Frequent Itemsets and Clustering: Mining frequent itemsets, market based modelling,
Apriori algorithm, handling large data sets in main memory, limited pass algorithm,
counting frequent itemsets in a stream, clustering techniques: hierarchical, K-means,
clustering high dimensional data, CLIQUE and ProCLUS, frequent pattern based clustering
methods, clustering in non-euclidean space, clustering for streams and parallelism.
08
V
Frame Works and Visualization: MapReduce, Hadoop, Pig, Hive, HBase, MapR,
Sharding, NoSQL Databases, S3, Hadoop Distributed File Systems, Visualization: visual
data analysis techniques, interaction techniques, systems and applications.
Introduction to R - R graphical user interfaces, data import and export, attribute and data
types, descriptive statistics, exploratory data analysis, visualization before analysis,
analytics for unstructured data.
08
Text books and References:
1. Michael Berthold, David J. Hand, Intelligent Data Analysis, Springer
2. Anand Rajaraman and Jeffrey David Ullman, Mining of Massive Datasets, Cambridge University Press.
3. John Garrett,Data Analytics for IT Networks : Developing Innovative Use Cases, Pearson Education
Curriculum & Evaluation Scheme IT & CSI (V & VI semester) 23
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
Unit-II: Data Analysis
Data analysis refers to the process of inspecting, cleaning, transforming, and modeling data
with the goal of discovering useful information, drawing conclusions, and supporting decision-
making. It is a critical component of many fields, including business, finance, healthcare, engineering, and
the social sciences.
The data analysis process typically involves the following steps:
ˆ Data collection: This step involves gathering data from various sources, such as databases, surveys,
sensors, and social media.
ˆ Data cleaning: This step involves removing errors, inconsistencies, and outliers from the data. It
may also involve imputing missing values, transforming variables, and normalizing the data.
ˆ Data exploration: This step involves visualizing and summarizing the data to gain insights and
identify patterns. This may include statistical analyses, such as descriptive statistics, correlation
analysis, and hypothesis testing.
ˆ Data modeling: This step involves developing mathematical models to predict or explain the behavior
of the data. This may include regression analysis, time series analysis, machine learning, and
other techniques.
ˆ Data visualization: This step involves creating visual representations of the data to communicate
insights and findings to stakeholders. This may include charts, graphs, tables, and other visual-
izations.
ˆ Decision-making: This step involves using the results of the data analysis to make informed deci-
sions, develop strategies, and take actions.
Data analysis is a complex and iterative process that requires expertise in statistics, programming, and
domain knowledge. It is often performed using specialized software, such as R, Python, SAS, and Excel, as
well as cloud-based platforms, such as Amazon Web Services and Google Cloud Platform. Effective data
analysis can lead to better business outcomes, improved healthcare outcomes, and a deeper understanding
of complex phenomena.
3
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
1 Regression Modeling
Regression modeling is a statistical technique used to examine the relationship between a dependent
variable (also called the outcome or response variable) and one or more independent variables (also called
predictors or explanatory variables). The goal of regression modeling is to identify the nature and strength
of the relationship between the dependent variable and the independent variable(s) and to use this infor-
mation to make predictions about the dependent variable.
There are many different types of regression models, including linear regression, logistic regression,
polynomial regression1
, and multivariate regression. Linear regression is one of the most commonly
used types of regression modeling, and it assumes that the relationship between the dependent variable and
the independent variable(s) is linear.
Regression modeling is used in a wide range of fields, including economics, finance, psychology, and
epidemiology2
, among others. It is often used to understand the relationships between different factors and
to make predictions about future outcomes.
1.1 Regression
1.1.1 Simple Linear Regression
Linear Regression— In statistics, linear regression is a linear approach to modeling the relationship
between a scalar response (or dependent variable) and one or more explanatory variables (or independent
variables). The case of one explanatory variable is called simple linear regression.
ˆ Linear regression is used to predict the continuous dependent variable using a given set of independent
variables.
ˆ Linear Regression is used for solving Regression problem.
ˆ In Linear regression, value of continuous variables are predicted.
ˆ Linear regression tried to find the best fit line, through which the output can be easily predicted.
1In statistics, polynomial regression is a form of regression analysis in which the relationship between the independent
variable x and the dependent variable y is modelled as an nth degree polynomial in x.
2Epidemiology is the study (scientific, systematic, and data-driven) of the distribution (frequency, pattern) and determinants
(causes, risk factors) of health-related states and events (not just diseases) in specified populations (neighborhood, school, city,
state, country, global)
4
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
ˆ Least square estimation method3
is used for estimation of accuracy4
.
ˆ The output for Linear Regression must be a continuous value, such as price, age, etc.
ˆ In Linear regression, it is required that relationship between dependent variable and independent
variable must be linear.
ˆ In linear regression, there may be collinearity5
between the independent variables.
Some Regression examples:
ˆ Regression analysis is used in stats to find trends in data. For example, you might guess that there is
a connection between how much you eat and how much you weigh; regression analysis can help you
quantify that.
ˆ Regression analysis will provide you with an equation for a graph so that you can make predictions
about your data. For example, if you’ve been putting on weight over the last few years, it can predict
how much you’ll weigh in ten years time if you continue to put on weight at the same rate.
ˆ It is also called simple linear regression. It establishes the relationship between two variables using
a straight line. If two or more explanatory variables have a linear relationship with the dependent
variable, the regression is called a multiple linear regression.
1.1.2 Logistic Regression
Logistic Regression— use to resolve classification problems where given an element you have to classify
the same in N categories. Typical examples are for example given a mail to classify it as spam or not, or
3The least squares method is a statistical procedure to find the best fit for a set of data points by minimizing the sum of
the offsets of points from the plotted curve. Least squares regression is used to predict the behavior of dependent variables.
4Accuracy is how close a measured value is to the actual value. Precision is how close the measured values are to each other.
5Collinearity is a condition in which some of the independent variables are highly correlated.
5
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
given a vehicle find to which category it belongs (car, truck, van, etc.). That’s basically the output is a finite
set of descrete values.
ˆ Logistic Regression is used to predict the categorical dependent variable using a given set of independent
variables.
ˆ Logistic regression is used for solving Classification problems.
ˆ In logistic Regression, we predict the values of categorical variables.
ˆ In Logistic Regression, we find the S-curve by which we can classify the samples.
ˆ Maximum likelihood estimation method is used for estimation of accuracy.
ˆ The output of Logistic Regression must be a Categorical value such as 0 or 1, Yes or No, etc.
ˆ In Logistic regression, it is not required to have the linear relationship between the dependent and
independent variable.
ˆ In logistic regression, there should not be collinearity between the independent variable.
2 Multivariate Analysis
Multivariate analysis is a statistical technique used to examine the relationships between multiple variables
simultaneously. It is used when there are multiple dependent variables and/or independent variables that
are interrelated.
Multivariate analysis is used in a wide range of fields, including social sciences, marketing, biology,
and finance, among others. There are many different types of multivariate analysis, including multivari-
ate regression, principal component analysis, factor analysis, cluster analysis, and discriminant
analysis.
6
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
Multivariate regression is similar to linear regression, but it involves more than one independent variable.
It is used to predict the value of a dependent variable based on two or more independent variables. Principal
component analysis (PCA) is a technique used to reduce the dimensionality of data by identifying patterns
and relationships between variables. Factor analysis is a technique used to identify underlying factors that
explain the correlations between multiple variables. Cluster analysis is a technique used to group objects or
individuals into clusters based on similarities or dissimilarities. Discriminant analysis is a technique used to
determine which variables discriminate between two or more groups.
Overall, multivariate analysis is a powerful tool for examining complex relationships between multiple
variables, and it can help researchers and analysts gain a deeper understanding of the data they are working
with.
3 Bayesian Modeling
Bayesian modeling is a statistical modeling approach that uses Bayesian inference to make predictions and
estimate parameters. It is named after Thomas Bayes, an 18th-century statistician who developed the Bayes
theorem, which is a key component of Bayesian modeling.
In Bayesian modeling, prior information about the parameters of interest is combined with data to
produce a posterior distribution. This posterior distribution represents the updated probability distribution
of the parameters given the data and the prior information. The posterior distribution is used to make
inferences and predictions about the parameters.
Bayesian modeling is particularly useful when there is limited data or when the data is noisy or uncertain.
It allows for the incorporation of prior knowledge and beliefs into the modeling process, which can improve
the accuracy and precision of predictions.
Bayesian modeling is used in a wide range of fields, including finance, engineering, ecology, and social
sciences. Some examples of Bayesian modeling applications include predicting stock prices, estimating the
prevalence of a disease in a population, and analyzing the effects of environmental factors on a species.
3.1 Bayes Theorem
ˆ Goal — To determine the most probable hypothesis, given the data D plus any initial knowledge
about the prior probabilities of the various hypotheses in H.
7
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
ˆ Prior probability of h, P(h) — it reflects any background knowledge we have about the chance
that h is a correct hypothesis (before having observed the data).
ˆ Prior probability of D, P(D) — it reflects the probability that training data D will be observed
given no knowledge about which hypothesis h holds.
ˆ Conditional Probability of observation D, P(D|h) — it denotes the probability of observing
data D given some world in which hypothesis h holds.
ˆ Posterior probability of h, P(h|D) — it represents the probability that h holds given the observed
training data D. It reflects our confidence that h holds after we have seen the training data D and it
is the quantity that Machine Learning researchers are interested in.
ˆ Bayes Theorem allows us to compute P(h|D) —
P(h|D) = P(D|h)P(h)/P(D)
Maximum A Posteriori (MAP) Hypothesis and Maximum Likelihood
ˆ Goal — To find the most probable hypothesis h from a set of candidate hypotheses H given the
observed data D. MAP Hypothesis,
hMAP = argmax
h∈H
P(h|D)
= argmax
h∈H
P(D|h)P(h)/P(D)
= argmax
h∈H
P(D|h)P(h)
ˆ If every hypothesis in H is equally probable a priori, we only need to consider the likelihood of the
data D given h, P(D|h). Then, hMAP becomes the Maximum Likelihood,
hML = argmax
h∈H
P(D|h)P(h)
Overall, Bayesian modeling is a powerful tool for making predictions and estimating parameters in situ-
ations where there is uncertainty and prior information is available.
8
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
4 Inference and Bayesian networks
Inference in Bayesian networks is the process of using probabilistic reasoning to make predictions or draw
conclusions about a system or phenomenon. Bayesian networks are graphical models that represent the
relationships between variables using a directed acyclic graph, where nodes represent variables and edges
represent probabilistic dependencies between the variables.
Inference in Bayesian networks involves calculating the posterior probability distribution of one or more
variables given evidence about other variables in the network. This can be done using Bayesian inference,
which involves updating the prior probability distribution of the variables using Bayes’ theorem and the
observed evidence.
The posterior distribution can be used to make predictions or draw conclusions about the system or
phenomenon being modeled. For example, in a medical diagnosis system, the posterior probability of a
particular disease given a set of symptoms can be calculated using a Bayesian network. This can help
clinicians make a more accurate diagnosis and choose appropriate treatments.
Bayesian networks and inference are widely used in many fields, including artificial intelligence, decision
making, finance, and engineering. They are particularly useful in situations where there is uncertainty and
probabilistic relationships between variables need to be modeled and analyzed.
4.1 BAYESIAN NETWORKS
ˆ Abbreviation : BBN (Bayesian Belief Network)
ˆ Synonyms: Bayes (ian) network, Bayes(ian) model, Belief network, Decision network, or probabilistic
directed acyclic graphical model.
ˆ A BBN is a probabilistic graphical model that represents a set of variables and their conditional
dependencies via a Directed Acyclic Graph (DAG).
9
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
ˆ BBNs enable us to model and reason about uncertainty. BBNs accommodate both subjective proba-
bilities and probabilities based on objective data.
ˆ The most important use of BBNs is in revising probabilities in the light of actual observations of events.
ˆ Nodes represent variables in the Bayesian sense: observable quantities, hidden variables or hypotheses.
Edges represent conditional dependencies.
ˆ Each node is associated with a probability function that takes, as input, a particular set of probabilities
for values for the node’s parent variables, and outputs the probability of the values of the variable
represented by the node.
ˆ Prior Probabilities: e.g. P(RAIN)
ˆ Conditional Probabilities: e.g. P(SPRINKLER | RAIN)
ˆ Joint Probability Function: P(GRASS WET, SPRINKLER, RAIN) = P(GRASS WET | RAIN,
SPRINKLER) * P(SPRINKLER | RAIN) * P ( RAIN)
ˆ Typically the probability functions are described in table form.
10
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
ˆ BN cannot be used to model the correlation relationships between random variables.
Overall, inference in Bayesian networks is a powerful tool for making predictions and drawing conclusions
in situations where there is uncertainty and complex probabilistic relationships between variables.
4.2 Support Vector and Kernel Methods
Support vector machines (SVMs) and kernel methods are commonly used in machine learning and pattern
recognition to solve classification and regression problems.
SVMs are a type of supervised learning algorithm that aims to find the optimal hyperplane that separates
the data into different classes. The optimal hyperplane is the one that maximizes the margin, or the distance
between the hyperplane and the closest data points from each class. SVMs can also use kernel functions to
transform the original input data into a higher dimensional space, where it may be easier to find a separating
hyperplane.
Kernel methods are a class of algorithms that use kernel functions to compute the similarity between
pairs of data points. Kernel functions can transform the input data into a higher dimensional feature space,
where linear methods can be applied more effectively. Some commonly used kernel functions include linear,
polynomial, and radial basis functions.
Kernel methods are used in a variety of applications, including image recognition, speech recognition,
11
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
and natural language processing. They are particularly useful in situations where the data is non-linear and
the relationship between variables is complex.
History of SVM6
ˆ SVM is related to statistical learning theory.
ˆ SVM was first introduced in 1992.
ˆ SVM becomes popular because of its success in handwritten digit recognition 1.1% test error rate for
SVM. This is the same as the error rates of a carefully constructed neural network.
ˆ SVM is now regarded as an important example of “kernel methods”, one of the key area in machine
learning
Binary Classification
Given training data (xi, yi) for i = 1 . . . N, with
xi ∈ Rd and yi ∈ {−1, 1}, learn a classifier f(x)
such that
f(xi)
(
≥ 0 yi = +1
< 0 yi = −1
i.e. yif(xi) > 0 for a correct classification.
Linear separability
linearly
separable
not
linearly
separable
6Support vector machine is a linear model and it always looks for a hyperplane to separate one class from another. I will
focus on two-dimensional case because it is easier to comprehend and - possible to visualize to give some intuition, however
bear in mind that this is true for higher dimensions (simply lines change into planes, parabolas into paraboloids etc.).
12
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
Linear classifiers
X2
X1
A linear classifier has the form
• in 2D the discriminant is a line
• is the normal to the line, and b the bias
• is known as the weight vector
f(x) = 0
f(x) = w>x + b
f(x) > 0
f(x) < 0
Linear classifiers
A linear classifier has the form
• in 3D the discriminant is a plane, and in nD it is a hyperplane
For a K-NN classifier it was necessary to `carry’ the training data
For a linear classifier, the training data is used to learn w and then discarded
Only w is needed for classifying new data
f(x) = 0
f(x) = w>x + b
Given linearly separable data xi labelled into two categories yi = {-1,1} ,
find a weight vector w such that the discriminant function
separates the categories for i = 1, .., N
• how can we find this separating hyperplane ?
The Perceptron Classifier
f(xi) = w>xi + b
The Perceptron Algorithm
Write classifier as
• Initialize w = 0
• Cycle though the data points { xi, yi }
• if xi is misclassified then
• Until all the data is correctly classified
w ← w + α sign(f(xi)) xi
f(xi) = w̃>x̃i + w0 = w>xi
where w = (w̃, w0), xi = (x̃i, 1)
13
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
For example in 2D
X2
X1
X2
X1
w
before update after update
w
NB after convergence w =
PN
i αixi
• Initialize w = 0
• Cycle though the data points { xi, yi }
• if xi is misclassified then
• Until all the data is correctly classified
w ← w + α sign(f(xi)) xi
xi
• if the data is linearly separable, then the algorithm will converge
• convergence can be slow …
• separating line close to training data
• we would prefer a larger margin for generalization
-15 -10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
8
Perceptron
example
What is the best w?
• maximum margin solution: most stable under perturbations of the inputs
14
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
Tennis example
Humidity
Temperature
= play tennis
= do not play tennis
Linear Support Vector
Machines
x1
x2
=+1
=-1
Data: <xi,yi>, i=1,..,l
xi  Rd
yi  {-1,+1}
=-1
=+1
Data: <xi,yi>, i=1,..,l
xi  Rd
yi  {-1,+1}
All hyperplanes in Rd
are parameterize by a vector (w) and a constant b.
Can be expressed as w•x+b=0 (remember the equation for a hyperplane
from algebra!)
Our aim is to find such a hyperplane f(x)=sign(w•x+b), that
correctly classify our data.
f(x)
Linear SVM 2
15
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
d+
d-
Definitions
Define the hyperplane H such that:
xi•w+b  +1 when yi =+1
xi•w+b  -1 when yi =-1
d+ = the shortest distance to the closest positive point
d- = the shortest distance to the closest negative point
The margin of a separating hyperplane is d+
+ d-
.
H
H1 and H2 are the planes:
H1: xi•w+b = +1
H2: xi•w+b = -1
The points on the planes
H1 and H2 are the
Support Vectors
H1
H2
Maximizing the margin
d+
d-
We want a classifier with as big margin as possible.
Recall the distance from a point(x0,y0) to a line:
Ax+By+c = 0 is|A x0 +B y0 +c|/sqrt(A2
+B2
)
The distance between H and H1 is:
|w•x+b|/||w||=1/||w||
The distance between H1 and H2 is: 2/||w||
In order to maximize the margin, we need to minimize ||w||. With the
condition that there are no datapoints between H1 and H2:
xi•w+b  +1 when yi =+1
xi•w+b  -1 when yi =-1 Can be combined into yi(xi•w)  1
H1
H2
H
Constrained Optimization
Problem
 
 
0
and
0
subject to
2
1
Maximize
:
yields
g
simplifyin
and
,
into
back
ng
substituti
0,
to
them
setting
s,
derivative
the
Taking
0.
be
must
and
both
respect
with
of
derivative
partial
the
extremum,
At the
1
)
(
||
||
2
1
)
,
,
(
where
,
)
,
,
(
inf
maximize
:
method
Lagrangian
all
for
1
)
(
subject to
||
||
Minimize
,















 

i
i
i
i
i j
i
j
i
j
i
j
i
i
i
i
i
i
i
i
y
y
y
L
b
L
b
y
b
L
b
L
i
b
y








x
x
w
w
x
w
w
w
w
x
w
w
w
w
16
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
Quadratic Programming
• Why is this reformulation a good thing?
• The problem
is an instance of what is called a positive, semi-definite
programming problem
• For a fixed real-number accuracy, can be solved in
O(n log n) time = O(|D|2 log |D|2)
0
and
0
subject to
2
1
Maximize
,





 
i
i
i
i
i j
i
j
i
j
i
j
i
i
y
y
y




 x
x
Problems with linear SVM
=-1
=+1
What if the decision function is not a linear?
Kernel Trick
)
2
,
,
(
space
in the
separable
linearly
are
points
Data
2
1
2
2
2
1 x
x
x
x
2
,
)
,
(
Here,
directly!
compute
easy to
often
is
:
thing
Cool
)
(
)
(
)
,
(
Define
)
(
)
(
2
1
maximize
want to
We
j
i
j
i
j
i
j
i
i j
i
j
i
j
i
j
i
i
K
K
F
F
K
F
F
y
y
x
x
x
x
x
x
x
x
x
x






  


17
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
Other Kernels
The polynomial kernel
K(xi,xj) = (xi•xj + 1)p
, where p is a tunable parameter.
Evaluating K only require one addition and one exponentiation
more than the original dot product.
Gaussian kernels (also called radius basis functions)
K(xi,xj) = exp(||xi-xj ||2
/22
)
Overtraining/overfitting
=-1
=+1
An example: A botanist really knowing trees.Everytime he sees a new tree,
he claims it is not a tree.
A well known problem with machine learning methods is overtraining.
This means that we have learned the training data very well, but
we can not classify unseen examples correctly.
Overtraining/overfitting 2
It can be shown that: The portion, n, of unseen data that will be
missclassified is bounded by:
n  Number of support vectors / number of training examples
A measure of the risk of overtraining with SVM (there are also other
measures).
Ockham´s razor principle: Simpler system are better than more complex ones.
In SVM case: fewer support vectors mean a simpler representation of the
hyperplane.
Example: Understanding a certain cancer if it can be described by one gene
is easier than if we have to describe it with 5000.
18
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
A practical example, protein
localization
• Proteins are synthesized in the cytosol.
• Transported into different subcellular
locations where they carry out their
functions.
• Aim: To predict in what location a
certain protein will end up!!!
Overall, SVMs and kernel methods are powerful tools for solving classification and regression problems. They
can handle complex data and provide accurate predictions, making them valuable in many fields, including
finance, healthcare, and engineering.
5 Analysis of Time Series: Linear Systems Analysis & Nonlinear
Dynamics
Time series analysis is a statistical technique used to analyze time-dependent data. It involves studying the
patterns and trends in the data over time and making predictions about future values.
Linear systems analysis is a technique used in time series analysis to model the behavior of a system
using linear equations. Linear models assume that the relationship between variables is linear and that the
system is time-invariant, meaning that the relationship between variables does not change over time. Linear
systems analysis involves techniques such as autoregressive (AR) and moving average (MA) models, which
use past values of a variable to predict future values.
Nonlinear dynamics is another approach to time series analysis that considers systems that are not
described by linear equations. Nonlinear systems are often more complex and can exhibit chaotic behavior,
making them more difficult to model and predict. Nonlinear dynamics involves techniques such as chaos
theory and fractal analysis, which use mathematical concepts to describe the behavior of nonlinear systems.
Both linear systems analysis and nonlinear dynamics have applications in a wide range of fields, including
finance, economics, and engineering. Linear models are often used in situations where the data is relatively
simple and the relationship between variables is well understood. Nonlinear dynamics is often used in
19
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
situations where the data is more complex and the relationship between variables is not well understood.
There are several components of time series analysis, including:
1. Trend Analysis: Trend analysis is used to identify the long-term patterns and trends in the data. It
can be a linear or non-linear trend and may show an upward, downward or flat trend.
2. Seasonal Analysis: Seasonal analysis is used to identify the recurring patterns in the data that occur
within a fixed time period, such as a week, month, or year.
3. Cyclical Analysis: Cyclical analysis is used to identify the patterns that are not necessarily regular
or fixed in duration, but do show a tendency to repeat over time, such as economic cycles or business
cycles.
4. Irregular Analysis: Irregular analysis is used to identify any random fluctuations or noise in the
data that cannot be attributed to any of the above components.
5. Forecasting: Forecasting is the process of predicting future values of a time series based on its past
behavior. It can be done using various statistical techniques such as moving averages, exponential
smoothing, and regression analysis.
Overall, time series analysis is a powerful tool for studying time-dependent data and making predictions
about future values. Linear systems analysis and nonlinear dynamics are two approaches to time series
analysis that can be used in different situations to model and predict complex systems.
6 Rule Induction
Rule induction is a machine learning technique used to identify patterns in data and create a set of rules that
can be used to make predictions or decisions about new data. It is often used in decision tree algorithms
and can be applied to both classification and regression problems.
The rule induction process involves analyzing the data to identify common patterns and relationships
between the variables. These patterns are used to create a set of rules that can be used to classify or predict
new data. The rules are typically in the form of ”if-then” statements, where the ”if” part specifies the
conditions under which the rule applies and the ”then” part specifies the action or prediction to be taken.
Rule induction algorithms can be divided into two main types: top-down and bottom-up. Top-down
algorithms start with a general rule that applies to the entire dataset and then refine the rule based on
20
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
the data. Bottom-up algorithms start with individual data points and then group them together based on
common attributes.
Rule induction has many applications in fields such as finance, healthcare, and marketing. For example,
it can be used to identify patterns in financial data to predict stock prices or to analyze medical data to
identify risk factors for certain diseases.
Overall, rule induction is a powerful machine learning technique that can be used to identify patterns
in data and create rules that can be used to make predictions or decisions. It is a useful tool for solving
classification and regression problems and has many applications in various fields.
7 Neural Networks: Learning and Generalization
Neural networks are a class of machine learning algorithms that are inspired by the structure and function
of the human brain. They are used to learn complex patterns and relationships in data and can be used for
a variety of tasks, including classification, regression, and clustering.
Learning in neural networks refers to the process of adjusting the weights and biases of the network to
improve its performance on a particular task. This is typically done through a process called backpropagation,
which involves propagating the errors from the output layer back through the network and adjusting the
weights and biases accordingly.
Generalization in neural networks refers to the ability of the network to perform well on new, unseen
data. A network that has good generalization performance is able to accurately predict the outputs for new
inputs that were not included in the training set. Generalization performance is typically evaluated using a
separate validation set or by cross-validation.
Overfitting is a common problem in neural networks, where the network becomes too complex and starts
to fit the noise in the training data, rather than the underlying patterns. This can result in poor generalization
performance on new data. Techniques such as regularization, early stopping, and dropout are often used to
prevent overfitting and improve generalization performance.
Overall, learning and generalization are two important concepts in neural networks. Learning involves
adjusting the weights and biases of the network to improve its performance, while generalization refers
to the ability of the network to perform well on new, unseen data. Effective techniques for learning and
generalization are critical for building accurate and useful neural network models.
21
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
7.1 Neural Network concepts
22
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
8 Competitive Learning
Competitive learning is a type of machine learning technique in which a set of neurons compete to be
activated by input data. The neurons are organized into a layer, and each neuron receives the same input
data. However, only one neuron is activated, and the competition is based on a set of rules that determine
which neuron is activated.
The competition in competitive learning is typically based on a measure of similarity between the input
data and the weights of each neuron. The neuron with the highest similarity to the input data is activated,
and the weights of that neuron are updated to become more similar to the input data. This process is repeated
for multiple iterations, and over time, the neurons learn to become specialized in recognizing different types
of input data.
Competitive learning is often used for unsupervised learning tasks, such as clustering or feature extraction.
In clustering, the neurons learn to group similar input data into clusters, while in feature extraction, the
neurons learn to recognize specific features in the input data.
One of the advantages of competitive learning is that it can be used to discover hidden structures and
patterns in data without the need for labeled data. This makes it particularly useful for applications such
as image and speech recognition, where labeled data can be difficult and expensive to obtain.
Overall, competitive learning is a powerful machine learning technique that can be used for a variety of
unsupervised learning tasks. It involves a set of neurons that compete to be activated by input data, and
over time, the neurons learn to become specialized in recognizing different types of input data.
27
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
9 Principal Component Analysis and Neural Networks
Principal component analysis (PCA) and neural networks are both machine learning techniques that can be
used for a variety of tasks, including data compression, feature extraction, and dimensionality reduction.
PCA is a linear technique that involves finding the principal components of a dataset, which are the
directions of greatest variance. The principal components can be used to reduce the dimensionality of the
data, while preserving as much of the original variance as possible.
Neural networks, on the other hand, are nonlinear techniques that involve multiple layers of interconnected
neurons. Neural networks can be used for a variety of tasks, including classification, regression, and clustering.
They can also be used for feature extraction, where the network learns to identify the most important features
of the input data.
PCA and neural networks can be used together for a variety of tasks. For example, PCA can be used to
reduce the dimensionality of the data before feeding it into a neural network. This can help to improve the
performance of the network by reducing the amount of noise and irrelevant information in the input data.
Neural networks can also be used to improve the performance of PCA. In some cases, PCA can be
limited by its linear nature, and may not be able to capture complex nonlinear relationships in the data. By
combining PCA with a neural network, the network can learn to capture these nonlinear relationships and
improve the accuracy of the PCA results.
Overall, PCA and neural networks are both powerful machine learning techniques that can be used for
a variety of tasks. When used together, they can improve the performance and accuracy of each technique
and help to solve more complex problems.
30
Pattern Recognition
Tag: Principal Component Analysis Numerical Example
Principal Component Analysis | Dimension Reduction
Dimension Reduction-
In pattern recognition, Dimension Reduction is defined as-
It is a process of converting a data set having vast dimensions into a data set with lesser dimensions.
It ensures that the converted data set conveys similar information concisely.
Example-
Consider the following example-
The following graph shows two dimensions x1 and x2.
x1 represents the measurement of several objects in cm.
x2 represents the measurement of several objects in inches.
In machine learning,
Using both these dimensions convey similar information.
Also, they introduce a lot of noise in the system.
So, it is better to use just one dimension.
Using dimension reduction techniques-
We convert the dimensions of data from 2 dimensions (x1 and x2) to 1 dimension (z1).
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
It makes the data relatively easier to explain.
Benefits-
Dimension reduction offers several benefits such as-
It compresses the data and thus reduces the storage space requirements.
It reduces the time required for computation since less dimensions require less computation.
It eliminates the redundant features.
It improves the model performance.
Dimension Reduction Techniques-
The two popular and well-known dimension reduction techniques are-
1. Principal ComponentAnalysis (PCA)
2. Fisher Linear DiscriminantAnalysis (LDA)
In this article, we will discuss about Principal ComponentAnalysis.
Principal Component Analysis-
Principal ComponentAnalysis is a well-known dimension reduction technique.
It transforms the variables into a new set of variables called as principal components.
These principal components are linear combination of original variables and are orthogonal.
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
The first principal component accounts for most of the possible variation of original data.
The second principal component does its best to capture the variance in the data.
There can be only two principal components for a two-dimensional data set.
PCA Algorithm-
The steps involved in PCAAlgorithm are as follows-
Step-01: Get data.
Step-02: Compute the mean vector (µ).
Step-03: Subtract mean from the given data.
Step-04: Calculate the covariance matrix.
Step-05: Calculate the eigen vectors and eigen values of the covariance matrix.
Step-06: Choosing components and forming a feature vector.
Step-07: Deriving the new data set.
PRACTICE PROBLEMS BASED ON PRINCIPAL COMPONENT ANALYSIS-
Problem-01:
Given data = { 2, 3, 4, 5, 6, 7 ; 1, 5, 3, 6, 7, 8 }.
Compute the principal component using PCAAlgorithm.
OR
Consider the two dimensional patterns (2, 1), (3, 5), (4, 3), (5, 6), (6, 7), (7, 8).
Compute the principal component using PCAAlgorithm.
OR
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
Compute the principal component of following data-
CLASS 1
X = 2 , 3 , 4
Y = 1 , 5 , 3
CLASS 2
X = 5 , 6 , 7
Y = 6 , 7 , 8
Solution-
We use the above discussed PCAAlgorithm-
Step-01:
Get data.
The given feature vectors are-
x1 = (2, 1)
x2 = (3, 5)
x3 = (4, 3)
x4 = (5, 6)
x5 = (6, 7)
x6 = (7, 8)
Step-02:
Calculate the mean vector (µ).
Mean vector (µ)
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
= ((2 + 3 + 4 + 5 + 6 + 7) / 6, (1 + 5 + 3 + 6 + 7 + 8) / 6)
= (4.5, 5)
Thus,
Step-03:
Subtract mean vector (µ) from the given feature vectors.
x1 – µ = (2 – 4.5, 1 – 5) = (-2.5, -4)
x2 – µ = (3 – 4.5, 5 – 5) = (-1.5, 0)
x3 – µ = (4 – 4.5, 3 – 5) = (-0.5, -2)
x4 – µ = (5 – 4.5, 6 – 5) = (0.5, 1)
x5 – µ = (6 – 4.5, 7 – 5) = (1.5, 2)
x6 – µ = (7 – 4.5, 8 – 5) = (2.5, 3)
Feature vectors (xi) after subtracting mean vector (µ) are-
Step-04:
Calculate the covariance matrix.
Covariance matrix is given by-
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
Now,
Now,
Covariance matrix
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
= (m1 + m2 + m3 + m4 + m5 + m6) / 6
On adding the above matrices and dividing by 6, we get-
Step-05:
Calculate the eigen values and eigen vectors of the covariance matrix.
λ is an eigen value for a matrix M if it is a solution of the characteristic equation |M – λI| = 0.
So, we have-
From here,
(2.92 – λ)(5.67 – λ) – (3.67 x 3.67) = 0
16.56 – 2.92λ – 5.67λ + λ2 – 13.47 = 0
λ2 – 8.59λ + 3.09 = 0
Solving this quadratic equation, we get λ = 8.22, 0.38
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
Thus, two eigen values are λ1 = 8.22 and λ2 = 0.38.
Clearly, the second eigen value is very small compared to the first eigen value.
So, the second eigen vector can be left out.
Eigen vector corresponding to the greatest eigen value is the principal component for the given data set.
So. we find the eigen vector corresponding to eigen value λ1.
We use the following equation to find the eigen vector-
MX = λX
where-
M = Covariance Matrix
X = Eigen vector
λ = Eigen value
Substituting the values in the above equation, we get-
Solving these, we get-
2.92X1 + 3.67X2 = 8.22X1
3.67X1 + 5.67X2 = 8.22X2
On simplification, we get-
5.3X1 = 3.67X2 ………(1)
3.67X1 = 2.55X2 ………(2)
From (1) and (2), X1 = 0.69X2
From (2), the eigen vector is-
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
Thus, principal component for the given data set is-
Lastly, we project the data points onto the new subspace as-
Problem-02:
Use PCAAlgorithm to transform the pattern (2, 1) onto the eigen vector in the previous question.
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
10 Fuzzy Logic: Extracting Fuzzy Models from Data
Fuzzy logic is a type of logic that allows for degrees of truth, rather than just true or false values. It is often
used in machine learning to extract fuzzy models from data.
A fuzzy model is a model that uses fuzzy logic to make predictions or decisions based on uncertain or
incomplete data. Fuzzy models are particularly useful in situations where traditional models may not work
well, such as when the data is noisy or when there is a lot of uncertainty or ambiguity in the data.
To extract a fuzzy model from data, the first step is to define the input and output variables of the
model. The input variables are the features or attributes of the data, while the output variable is the target
variable that we want to predict or classify.
Next, we use fuzzy logic to define the membership functions for each input and output variable. The
membership functions describe the degree of membership of each data point to each category or class. For
example, a data point may have a high degree of membership to the category ”low”, but a low degree of
membership to the category ”high”.
Once the membership functions have been defined, we can use fuzzy inference to make predictions or
decisions based on the input data. Fuzzy inference involves using the membership functions to determine
the degree of membership of each data point to each category or class, and then combining these degrees of
membership to make a prediction or decision.
Overall, extracting fuzzy models from data involves using fuzzy logic to define the membership functions
for each input and output variable, and then using fuzzy inference to make predictions or decisions based on
the input data. Fuzzy models are particularly useful in situations where traditional models may not work
well, and can help to improve the accuracy and robustness of machine learning models.
10.1 Fuzzy Decision Trees
Fuzzy decision trees are a type of decision tree that use fuzzy logic to make decisions based on uncertain or
imprecise data. Decision trees are a type of supervised learning technique that involve recursively partitioning
the input space into regions that correspond to different classes or categories.
Fuzzy decision trees extend traditional decision trees by allowing for degrees of membership to each
category or class, rather than just a binary classification. This is particularly useful in situations where the
data is uncertain or imprecise, and where a single, crisp classification may not be appropriate.
To build a fuzzy decision tree, we start with a set of training data that consists of input-output pairs.
40
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
We then use fuzzy logic to determine the degree of membership of each data point to each category or class.
This is done by defining the membership functions for each input and output variable, and using these to
compute the degree of membership of each data point to each category or class.
Next, we use the fuzzy membership values to construct a fuzzy decision tree. The tree consists of a set of
nodes and edges, where each node represents a test on one of the input variables, and each edge represents
a decision based on the result of the test. The degree of membership of each data point to each category or
class is used to determine the probability of reaching each leaf node of the tree.
Fuzzy decision trees can be used for a variety of tasks, including classification, regression, and clustering.
They are particularly useful in situations where the data is uncertain or imprecise, and where traditional
decision trees may not work well.
Overall, fuzzy decision trees are a powerful machine learning technique that can be used to make decisions
based on uncertain or imprecise data. They extend traditional decision trees by allowing for degrees of
membership to each category or class, and can help to improve the accuracy and robustness of machine
learning models.
11 Stochastic Search Methods
Stochastic search methods are a class of optimization algorithms that use probabilistic techniques to search
for the optimal solution in a large search space. These methods are commonly used in machine learning to
find the best set of parameters for a model, such as the weights in a neural network or the parameters in a
regression model.
Stochastic search methods are often used when the search space is too large to exhaustively search all
possible solutions, or when the objective function is highly nonlinear and has many local optima. The
basic idea behind these methods is to explore the search space by randomly sampling solutions and using
probabilistic techniques to move towards better solutions.
One common stochastic search method is called the stochastic gradient descent (SGD) algorithm. In this
method, the objective function is optimized by iteratively updating the parameters in the direction of the
negative gradient of the objective function. The update rule includes a learning rate, which controls the step
size and the direction of the update. SGD is widely used in training neural networks and other deep learning
models.
Another stochastic search method is called simulated annealing. This method is based on the physical
41
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
process of annealing, which involves heating and cooling a material to improve its properties. In simulated
annealing, the search process starts with a high temperature and gradually cools down over time. At each
iteration, the algorithm randomly selects a new solution and computes its fitness. If the new solution is better
than the current solution, it is accepted. However, if the new solution is worse, it may still be accepted with
a certain probability that decreases as the temperature decreases.
Other stochastic search methods include evolutionary algorithms, such as genetic algorithms and particle
swarm optimization, which mimic the process of natural selection and evolution to search for the optimal
solution.
Overall, stochastic search methods are powerful optimization techniques that are widely used in machine
learning and other fields. These methods allow us to efficiently search large search spaces and find optimal
solutions in the presence of noise, uncertainty, and nonlinearity.
42
Printed Page: 1 of 2
Subject Code: KIT601
0Roll No: 0 0 0 0 0 0 0 0 0 0 0 0 0
BTECH
(SEM VI) THEORY EXAMINATION 2021-22
DATA ANALYTICS
Time: 3 Hours Total Marks: 100
Note: Attempt all Sections. If you require any missing data, then choose suitably.
SECTION A
1. Attempt all questions in brief. 2*10 = 20
Qno Questions CO
(a) Discuss the need of data analytics. 1
(b) Give the classification of data. 1
(c) Define neural network. 2
(d) What is multivariate analysis? 2
(e) Give the full form of RTAP and discuss its application. 3
(f) What is the role of sampling data in a stream? 3
(g) Discuss the use of limited pass algorithm. 4
(h) What is the principle behind hierarchical clustering technique? 4
(i) List five R functions used in descriptive statistics. 5
(j) List the names of any 2 visualization tools. 5
SECTION B
2. Attempt any three of the following: 10*3 = 30
Qno Questions CO
(a) Explain the process model and computation model for Big data
platform.
1
(b) Explain the use and advantages of decision trees. 2
(c) Explain the architecture of data stream model. 3
(d) Illustrate the K-means algorithm in detail with its advantages. 4
(e) Differentiate between NoSQL and RDBMS databases. 5
SECTION C
3. Attempt any one part of the following: 10*1 = 10
Qno Questions CO
(a) Explain the various phases of data analytics life cycle. 1
(b) Explain modern data analytics tools in detail. 1
4. Attempt any one part of the following: 10 *1 = 10
Qno Questions CO
(a) Compare various types of support vector and kernel methods of data
analysis.
2
(b) Given data= {2,3,4,5,6,7;1,5,3,6,7,8}. Compute the principal
component using PCA algorithm.
2
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
Printed Page: 2 of 2
Subject Code: KIT601
0Roll No: 0 0 0 0 0 0 0 0 0 0 0 0 0
BTECH
(SEM VI) THEORY EXAMINATION 2021-22
DATA ANALYTICS
5. Attempt any one part of the following: 10*1 = 10
Qno Questions CO
(a) Explain any one algorithm to count number of distinct elements in a
data stream.
3
(b) Discuss the case study of stock market predictions in detail. 3
6. Attempt any one part of the following: 10*1 = 10
Qno Questions CO
(a) Differentiate between CLIQUE and ProCLUS clustering. 4
(b) A database has 5 transactions. Let min_sup=60% and min_conf=80%.
TID Items_Bought
T100 {M, O, N, K, E, Y}
T200 {D, O, N, K, E, Y}
T300 {M, A, K, E}
T400 {M, U, C, K, Y}
T500 {C, O, O, K, I, E}
i) Find all frequent itemsets using Apriori algorithm.
ii) List all the strong association rules (with support s and confidence
c).
4
7. Attempt any one part of the following: 10*1 = 10
Qno Questions CO
(a) Explain the HIVE architecture with its features in detail. 5
(b) Write R function to check whether the given number is prime or not. 5
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
M
a
r
c
h
1
1
,
2
0
2
4
/
D
r
.
R
S
12 Reference
[1] https://www.jigsawacademy.com/blogs/hr-analytics/data-analytics-lifecycle/
[2] https://statacumen.com/teach/ADA1/ADA1_notes_F14.pdf
[3] https://www.youtube.com/watch?v=fDRa82lxzaU
[4] https://www.investopedia.com/terms/d/data-analytics.asp
[5] http://egyankosh.ac.in/bitstream/123456789/10935/1/Unit-2.pdf
[6] http://epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/computer_science/16._data_analytics/
03._evolution_of_analytical_scalability/et/9280_et_3_et.pdf
[7] https://bhavanakhivsara.files.wordpress.com/2018/06/data-science-and-big-data-analy-nieizv_
book.pdf
[8] https://www.researchgate.net/publication/317214679_Sentiment_Analysis_for_Effective_Stock_
Market_Prediction
[9] https://snscourseware.org/snscenew/files/1569681518.pdf
[10] http://csis.pace.edu/ctappert/cs816-19fall/books/2015DataScience&BigDataAnalytics.pdf
[11] https://www.youtube.com/watch?v=mccsmoh2_3c
[12] https://mentalmodels4life.net/2015/11/18/agile-data-science-applying-kanban-in-the-analytics-li
[13] https://www.sas.com/en_in/insights/big-data/what-is-big-data.html#:~:text=Big%20data%
20refers%20to%20data,around%20for%20a%20long%20time.
[14] https://www.javatpoint.com/big-data-characteristics
[15] Liu, S., Wang, M., Zhan, Y., & Shi, J. (2009). Daily work stress and alcohol use: Testing the cross-
level moderation effects of neuroticism and job involvement. Personnel Psychology,62(3), 575–597.
http://dx.doi.org/10.1111/j.1744-6570.2009.01149.x
********************
47

More Related Content

Similar to IT-601 Lecture Notes-UNIT-2.pdf Data Analysis

Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
Sitamarhi Institute of Technology
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
Sitamarhi Institute of Technology
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
Sitamarhi Institute of Technology
 
DataMining_CA2-4
DataMining_CA2-4DataMining_CA2-4
DataMining_CA2-4
Aravind Kumar
 
[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)
[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)
[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)
mohamedchaouche
 
Linear Regression with R programming.pptx
Linear Regression with R programming.pptxLinear Regression with R programming.pptx
Linear Regression with R programming.pptx
anshikagoel52
 
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
IJAEMSJORNAL
 
An application of artificial intelligent neural network and discriminant anal...
An application of artificial intelligent neural network and discriminant anal...An application of artificial intelligent neural network and discriminant anal...
An application of artificial intelligent neural network and discriminant anal...
Alexander Decker
 
Anomaly detection: Core Techniques and Advances in Big Data and Deep Learning
Anomaly detection: Core Techniques and Advances in Big Data and Deep LearningAnomaly detection: Core Techniques and Advances in Big Data and Deep Learning
Anomaly detection: Core Techniques and Advances in Big Data and Deep Learning
QuantUniversity
 
Unit 8 data analysis and interpretation
Unit 8 data analysis and interpretationUnit 8 data analysis and interpretation
Unit 8 data analysis and interpretation
Asima shahzadi
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
Dinusha Dilanka
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysis
Katy Allen
 
Presentation of BRM.pptx
Presentation of BRM.pptxPresentation of BRM.pptx
Presentation of BRM.pptx
Gãurãv Kúmàr
 
Data science notes for ASDS calicut 2.pptx
Data science notes for ASDS calicut 2.pptxData science notes for ASDS calicut 2.pptx
Data science notes for ASDS calicut 2.pptx
swapnaraghav
 
Data Science 1.pdf
Data Science 1.pdfData Science 1.pdf
Data Science 1.pdf
ArchanaArya17
 
UNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MINING
UNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MININGUNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MINING
UNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MINING
IJDKP
 
factorization methods
factorization methodsfactorization methods
factorization methods
Shaina Raza
 
A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...
A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...
A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...
CSCJournals
 
analysis plan.ppt
analysis plan.pptanalysis plan.ppt
analysis plan.ppt
SamsonOlusinaBamiwuy
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptx
VishalLabde
 

Similar to IT-601 Lecture Notes-UNIT-2.pdf Data Analysis (20)

Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
 
DataMining_CA2-4
DataMining_CA2-4DataMining_CA2-4
DataMining_CA2-4
 
[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)
[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)
[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)
 
Linear Regression with R programming.pptx
Linear Regression with R programming.pptxLinear Regression with R programming.pptx
Linear Regression with R programming.pptx
 
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
 
An application of artificial intelligent neural network and discriminant anal...
An application of artificial intelligent neural network and discriminant anal...An application of artificial intelligent neural network and discriminant anal...
An application of artificial intelligent neural network and discriminant anal...
 
Anomaly detection: Core Techniques and Advances in Big Data and Deep Learning
Anomaly detection: Core Techniques and Advances in Big Data and Deep LearningAnomaly detection: Core Techniques and Advances in Big Data and Deep Learning
Anomaly detection: Core Techniques and Advances in Big Data and Deep Learning
 
Unit 8 data analysis and interpretation
Unit 8 data analysis and interpretationUnit 8 data analysis and interpretation
Unit 8 data analysis and interpretation
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysis
 
Presentation of BRM.pptx
Presentation of BRM.pptxPresentation of BRM.pptx
Presentation of BRM.pptx
 
Data science notes for ASDS calicut 2.pptx
Data science notes for ASDS calicut 2.pptxData science notes for ASDS calicut 2.pptx
Data science notes for ASDS calicut 2.pptx
 
Data Science 1.pdf
Data Science 1.pdfData Science 1.pdf
Data Science 1.pdf
 
UNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MINING
UNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MININGUNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MINING
UNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MINING
 
factorization methods
factorization methodsfactorization methods
factorization methods
 
A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...
A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...
A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...
 
analysis plan.ppt
analysis plan.pptanalysis plan.ppt
analysis plan.ppt
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptx
 

More from Dr. Radhey Shyam

KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and VisualizationKIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
Dr. Radhey Shyam
 
KIT-601 Lecture Notes-UNIT-4.pdf Frequent Itemsets and Clustering
KIT-601 Lecture Notes-UNIT-4.pdf Frequent Itemsets and ClusteringKIT-601 Lecture Notes-UNIT-4.pdf Frequent Itemsets and Clustering
KIT-601 Lecture Notes-UNIT-4.pdf Frequent Itemsets and Clustering
Dr. Radhey Shyam
 
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamKIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
Dr. Radhey Shyam
 
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfKIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
Dr. Radhey Shyam
 
SE-UNIT-3-II-Software metrics, numerical and their solutions.pdf
SE-UNIT-3-II-Software metrics, numerical and their solutions.pdfSE-UNIT-3-II-Software metrics, numerical and their solutions.pdf
SE-UNIT-3-II-Software metrics, numerical and their solutions.pdf
Dr. Radhey Shyam
 
Introduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleIntroduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycle
Dr. Radhey Shyam
 
KCS-501-3.pdf
KCS-501-3.pdfKCS-501-3.pdf
KCS-501-3.pdf
Dr. Radhey Shyam
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdf
Dr. Radhey Shyam
 
KCS-055 U5.pdf
KCS-055 U5.pdfKCS-055 U5.pdf
KCS-055 U5.pdf
Dr. Radhey Shyam
 
KCS-055 MLT U4.pdf
KCS-055 MLT U4.pdfKCS-055 MLT U4.pdf
KCS-055 MLT U4.pdf
Dr. Radhey Shyam
 
Deep-Learning-2017-Lecture5CNN.pptx
Deep-Learning-2017-Lecture5CNN.pptxDeep-Learning-2017-Lecture5CNN.pptx
Deep-Learning-2017-Lecture5CNN.pptx
Dr. Radhey Shyam
 
SE UNIT-3 (Software metrics).pdf
SE UNIT-3 (Software metrics).pdfSE UNIT-3 (Software metrics).pdf
SE UNIT-3 (Software metrics).pdf
Dr. Radhey Shyam
 
SE UNIT-2.pdf
SE UNIT-2.pdfSE UNIT-2.pdf
SE UNIT-2.pdf
Dr. Radhey Shyam
 
SE UNIT-1 Revised.pdf
SE UNIT-1 Revised.pdfSE UNIT-1 Revised.pdf
SE UNIT-1 Revised.pdf
Dr. Radhey Shyam
 
SE UNIT-3.pdf
SE UNIT-3.pdfSE UNIT-3.pdf
SE UNIT-3.pdf
Dr. Radhey Shyam
 
Ip unit 5
Ip unit 5Ip unit 5
Ip unit 5
Dr. Radhey Shyam
 
Ip unit 4 modified on 22.06.21
Ip unit 4 modified on 22.06.21Ip unit 4 modified on 22.06.21
Ip unit 4 modified on 22.06.21
Dr. Radhey Shyam
 
Ip unit 3 modified of 26.06.2021
Ip unit 3 modified of 26.06.2021Ip unit 3 modified of 26.06.2021
Ip unit 3 modified of 26.06.2021
Dr. Radhey Shyam
 
Ip unit 2 modified on 8.6.2021
Ip unit 2 modified on 8.6.2021Ip unit 2 modified on 8.6.2021
Ip unit 2 modified on 8.6.2021
Dr. Radhey Shyam
 
Ip unit 1
Ip unit 1Ip unit 1
Ip unit 1
Dr. Radhey Shyam
 

More from Dr. Radhey Shyam (20)

KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and VisualizationKIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
 
KIT-601 Lecture Notes-UNIT-4.pdf Frequent Itemsets and Clustering
KIT-601 Lecture Notes-UNIT-4.pdf Frequent Itemsets and ClusteringKIT-601 Lecture Notes-UNIT-4.pdf Frequent Itemsets and Clustering
KIT-601 Lecture Notes-UNIT-4.pdf Frequent Itemsets and Clustering
 
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamKIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
 
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfKIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
 
SE-UNIT-3-II-Software metrics, numerical and their solutions.pdf
SE-UNIT-3-II-Software metrics, numerical and their solutions.pdfSE-UNIT-3-II-Software metrics, numerical and their solutions.pdf
SE-UNIT-3-II-Software metrics, numerical and their solutions.pdf
 
Introduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleIntroduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycle
 
KCS-501-3.pdf
KCS-501-3.pdfKCS-501-3.pdf
KCS-501-3.pdf
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdf
 
KCS-055 U5.pdf
KCS-055 U5.pdfKCS-055 U5.pdf
KCS-055 U5.pdf
 
KCS-055 MLT U4.pdf
KCS-055 MLT U4.pdfKCS-055 MLT U4.pdf
KCS-055 MLT U4.pdf
 
Deep-Learning-2017-Lecture5CNN.pptx
Deep-Learning-2017-Lecture5CNN.pptxDeep-Learning-2017-Lecture5CNN.pptx
Deep-Learning-2017-Lecture5CNN.pptx
 
SE UNIT-3 (Software metrics).pdf
SE UNIT-3 (Software metrics).pdfSE UNIT-3 (Software metrics).pdf
SE UNIT-3 (Software metrics).pdf
 
SE UNIT-2.pdf
SE UNIT-2.pdfSE UNIT-2.pdf
SE UNIT-2.pdf
 
SE UNIT-1 Revised.pdf
SE UNIT-1 Revised.pdfSE UNIT-1 Revised.pdf
SE UNIT-1 Revised.pdf
 
SE UNIT-3.pdf
SE UNIT-3.pdfSE UNIT-3.pdf
SE UNIT-3.pdf
 
Ip unit 5
Ip unit 5Ip unit 5
Ip unit 5
 
Ip unit 4 modified on 22.06.21
Ip unit 4 modified on 22.06.21Ip unit 4 modified on 22.06.21
Ip unit 4 modified on 22.06.21
 
Ip unit 3 modified of 26.06.2021
Ip unit 3 modified of 26.06.2021Ip unit 3 modified of 26.06.2021
Ip unit 3 modified of 26.06.2021
 
Ip unit 2 modified on 8.6.2021
Ip unit 2 modified on 8.6.2021Ip unit 2 modified on 8.6.2021
Ip unit 2 modified on 8.6.2021
 
Ip unit 1
Ip unit 1Ip unit 1
Ip unit 1
 

Recently uploaded

1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
MadhavJungKarki
 
AN INTRODUCTION OF AI & SEARCHING TECHIQUES
AN INTRODUCTION OF AI & SEARCHING TECHIQUESAN INTRODUCTION OF AI & SEARCHING TECHIQUES
AN INTRODUCTION OF AI & SEARCHING TECHIQUES
drshikhapandey2022
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
VANDANAMOHANGOUDA
 
Blood finder application project report (1).pdf
Blood finder application project report (1).pdfBlood finder application project report (1).pdf
Blood finder application project report (1).pdf
Kamal Acharya
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
nedcocy
 
FULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back EndFULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back End
PreethaV16
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
Divyanshu
 
Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...
pvpriya2
 
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
PriyankaKilaniya
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
ydzowc
 
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls ChennaiCall Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
paraasingh12 #V08
 
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICSUNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
vmspraneeth
 
DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...
DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...
DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...
OKORIE1
 
openshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoinopenshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoin
snaprevwdev
 
SCALING OF MOS CIRCUITS m .pptx
SCALING OF MOS CIRCUITS m                 .pptxSCALING OF MOS CIRCUITS m                 .pptx
SCALING OF MOS CIRCUITS m .pptx
harshapolam10
 
OOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming languageOOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming language
PreethaV16
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
uqyfuc
 
Height and depth gauge linear metrology.pdf
Height and depth gauge linear metrology.pdfHeight and depth gauge linear metrology.pdf
Height and depth gauge linear metrology.pdf
q30122000
 
Presentation on Food Delivery Systems
Presentation on Food Delivery SystemsPresentation on Food Delivery Systems
Presentation on Food Delivery Systems
Abdullah Al Noman
 
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
Paris Salesforce Developer Group
 

Recently uploaded (20)

1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
 
AN INTRODUCTION OF AI & SEARCHING TECHIQUES
AN INTRODUCTION OF AI & SEARCHING TECHIQUESAN INTRODUCTION OF AI & SEARCHING TECHIQUES
AN INTRODUCTION OF AI & SEARCHING TECHIQUES
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
 
Blood finder application project report (1).pdf
Blood finder application project report (1).pdfBlood finder application project report (1).pdf
Blood finder application project report (1).pdf
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
 
FULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back EndFULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back End
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
 
Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...
 
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
 
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls ChennaiCall Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
 
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICSUNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
 
DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...
DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...
DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...
 
openshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoinopenshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoin
 
SCALING OF MOS CIRCUITS m .pptx
SCALING OF MOS CIRCUITS m                 .pptxSCALING OF MOS CIRCUITS m                 .pptx
SCALING OF MOS CIRCUITS m .pptx
 
OOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming languageOOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming language
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
 
Height and depth gauge linear metrology.pdf
Height and depth gauge linear metrology.pdfHeight and depth gauge linear metrology.pdf
Height and depth gauge linear metrology.pdf
 
Presentation on Food Delivery Systems
Presentation on Food Delivery SystemsPresentation on Food Delivery Systems
Presentation on Food Delivery Systems
 
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
 

IT-601 Lecture Notes-UNIT-2.pdf Data Analysis

  • 1. M a r c h 1 1 , 2 0 2 4 / D r . R S Data Analytics (KIT-601) Unit-2: Data Analysis Dr. Radhey Shyam Professor Department of Information Technology SRMCEM Lucknow (Affiliated to Dr. A.P.J. Abdul Kalam Technical University, Lucknow) Unit-2 has been prepared and compiled by Dr. Radhey Shyam, with grateful acknowledgment to those who made their course contents freely available or (Contributed directly or indirectly). Feel free to use this study material for your own academic purposes. For any query, communication can be made through this email : shyam0058@gmail.com. March 11, 2024
  • 2. Data Analytics (KIT 601) Course Outcome ( CO) Bloom’s Knowledge Level (KL) At the end of course , the student will be able to CO 1 Discuss various concepts of data analytics pipeline K1, K2 CO 2 Apply classification and regression techniques K3 CO 3 Explain and apply mining techniques on streaming data K2, K3 CO 4 Compare different clustering and frequent pattern mining algorithms K4 CO 5 Describe the concept of R programming and implement analytics on Big data using R. K2,K3 DETAILED SYLLABUS 3-0-0 Unit Topic Proposed Lecture I Introduction to Data Analytics: Sources and nature of data, classification of data (structured, semi-structured, unstructured), characteristics of data, introduction to Big Data platform, need of data analytics, evolution of analytic scalability, analytic process and tools, analysis vs reporting, modern data analytic tools, applications of data analytics. Data Analytics Lifecycle: Need, key roles for successful analytic projects, various phases of data analytics lifecycle – discovery, data preparation, model planning, model building, communicating results, operationalization. 08 II Data Analysis: Regression modeling, multivariate analysis, Bayesian modeling, inference and Bayesian networks, support vector and kernel methods, analysis of time series: linear systems analysis & nonlinear dynamics, rule induction, neural networks: learning and generalisation, competitive learning, principal component analysis and neural networks, fuzzy logic: extracting fuzzy models from data, fuzzy decision trees, stochastic search methods. 08 III Mining Data Streams: Introduction to streams concepts, stream data model and architecture, stream computing, sampling data in a stream, filtering streams, counting distinct elements in a stream, estimating moments, counting oneness in a window, decaying window, Real-time Analytics Platform ( RTAP) applications, Case studies – real time sentiment analysis, stock market predictions. 08 IV Frequent Itemsets and Clustering: Mining frequent itemsets, market based modelling, Apriori algorithm, handling large data sets in main memory, limited pass algorithm, counting frequent itemsets in a stream, clustering techniques: hierarchical, K-means, clustering high dimensional data, CLIQUE and ProCLUS, frequent pattern based clustering methods, clustering in non-euclidean space, clustering for streams and parallelism. 08 V Frame Works and Visualization: MapReduce, Hadoop, Pig, Hive, HBase, MapR, Sharding, NoSQL Databases, S3, Hadoop Distributed File Systems, Visualization: visual data analysis techniques, interaction techniques, systems and applications. Introduction to R - R graphical user interfaces, data import and export, attribute and data types, descriptive statistics, exploratory data analysis, visualization before analysis, analytics for unstructured data. 08 Text books and References: 1. Michael Berthold, David J. Hand, Intelligent Data Analysis, Springer 2. Anand Rajaraman and Jeffrey David Ullman, Mining of Massive Datasets, Cambridge University Press. 3. John Garrett,Data Analytics for IT Networks : Developing Innovative Use Cases, Pearson Education Curriculum & Evaluation Scheme IT & CSI (V & VI semester) 23 M a r c h 1 1 , 2 0 2 4 / D r . R S
  • 3. M a r c h 1 1 , 2 0 2 4 / D r . R S Unit-II: Data Analysis Data analysis refers to the process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, drawing conclusions, and supporting decision- making. It is a critical component of many fields, including business, finance, healthcare, engineering, and the social sciences. The data analysis process typically involves the following steps: ˆ Data collection: This step involves gathering data from various sources, such as databases, surveys, sensors, and social media. ˆ Data cleaning: This step involves removing errors, inconsistencies, and outliers from the data. It may also involve imputing missing values, transforming variables, and normalizing the data. ˆ Data exploration: This step involves visualizing and summarizing the data to gain insights and identify patterns. This may include statistical analyses, such as descriptive statistics, correlation analysis, and hypothesis testing. ˆ Data modeling: This step involves developing mathematical models to predict or explain the behavior of the data. This may include regression analysis, time series analysis, machine learning, and other techniques. ˆ Data visualization: This step involves creating visual representations of the data to communicate insights and findings to stakeholders. This may include charts, graphs, tables, and other visual- izations. ˆ Decision-making: This step involves using the results of the data analysis to make informed deci- sions, develop strategies, and take actions. Data analysis is a complex and iterative process that requires expertise in statistics, programming, and domain knowledge. It is often performed using specialized software, such as R, Python, SAS, and Excel, as well as cloud-based platforms, such as Amazon Web Services and Google Cloud Platform. Effective data analysis can lead to better business outcomes, improved healthcare outcomes, and a deeper understanding of complex phenomena. 3
  • 4. M a r c h 1 1 , 2 0 2 4 / D r . R S 1 Regression Modeling Regression modeling is a statistical technique used to examine the relationship between a dependent variable (also called the outcome or response variable) and one or more independent variables (also called predictors or explanatory variables). The goal of regression modeling is to identify the nature and strength of the relationship between the dependent variable and the independent variable(s) and to use this infor- mation to make predictions about the dependent variable. There are many different types of regression models, including linear regression, logistic regression, polynomial regression1 , and multivariate regression. Linear regression is one of the most commonly used types of regression modeling, and it assumes that the relationship between the dependent variable and the independent variable(s) is linear. Regression modeling is used in a wide range of fields, including economics, finance, psychology, and epidemiology2 , among others. It is often used to understand the relationships between different factors and to make predictions about future outcomes. 1.1 Regression 1.1.1 Simple Linear Regression Linear Regression— In statistics, linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). The case of one explanatory variable is called simple linear regression. ˆ Linear regression is used to predict the continuous dependent variable using a given set of independent variables. ˆ Linear Regression is used for solving Regression problem. ˆ In Linear regression, value of continuous variables are predicted. ˆ Linear regression tried to find the best fit line, through which the output can be easily predicted. 1In statistics, polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modelled as an nth degree polynomial in x. 2Epidemiology is the study (scientific, systematic, and data-driven) of the distribution (frequency, pattern) and determinants (causes, risk factors) of health-related states and events (not just diseases) in specified populations (neighborhood, school, city, state, country, global) 4
  • 5. M a r c h 1 1 , 2 0 2 4 / D r . R S ˆ Least square estimation method3 is used for estimation of accuracy4 . ˆ The output for Linear Regression must be a continuous value, such as price, age, etc. ˆ In Linear regression, it is required that relationship between dependent variable and independent variable must be linear. ˆ In linear regression, there may be collinearity5 between the independent variables. Some Regression examples: ˆ Regression analysis is used in stats to find trends in data. For example, you might guess that there is a connection between how much you eat and how much you weigh; regression analysis can help you quantify that. ˆ Regression analysis will provide you with an equation for a graph so that you can make predictions about your data. For example, if you’ve been putting on weight over the last few years, it can predict how much you’ll weigh in ten years time if you continue to put on weight at the same rate. ˆ It is also called simple linear regression. It establishes the relationship between two variables using a straight line. If two or more explanatory variables have a linear relationship with the dependent variable, the regression is called a multiple linear regression. 1.1.2 Logistic Regression Logistic Regression— use to resolve classification problems where given an element you have to classify the same in N categories. Typical examples are for example given a mail to classify it as spam or not, or 3The least squares method is a statistical procedure to find the best fit for a set of data points by minimizing the sum of the offsets of points from the plotted curve. Least squares regression is used to predict the behavior of dependent variables. 4Accuracy is how close a measured value is to the actual value. Precision is how close the measured values are to each other. 5Collinearity is a condition in which some of the independent variables are highly correlated. 5
  • 6. M a r c h 1 1 , 2 0 2 4 / D r . R S given a vehicle find to which category it belongs (car, truck, van, etc.). That’s basically the output is a finite set of descrete values. ˆ Logistic Regression is used to predict the categorical dependent variable using a given set of independent variables. ˆ Logistic regression is used for solving Classification problems. ˆ In logistic Regression, we predict the values of categorical variables. ˆ In Logistic Regression, we find the S-curve by which we can classify the samples. ˆ Maximum likelihood estimation method is used for estimation of accuracy. ˆ The output of Logistic Regression must be a Categorical value such as 0 or 1, Yes or No, etc. ˆ In Logistic regression, it is not required to have the linear relationship between the dependent and independent variable. ˆ In logistic regression, there should not be collinearity between the independent variable. 2 Multivariate Analysis Multivariate analysis is a statistical technique used to examine the relationships between multiple variables simultaneously. It is used when there are multiple dependent variables and/or independent variables that are interrelated. Multivariate analysis is used in a wide range of fields, including social sciences, marketing, biology, and finance, among others. There are many different types of multivariate analysis, including multivari- ate regression, principal component analysis, factor analysis, cluster analysis, and discriminant analysis. 6
  • 7. M a r c h 1 1 , 2 0 2 4 / D r . R S Multivariate regression is similar to linear regression, but it involves more than one independent variable. It is used to predict the value of a dependent variable based on two or more independent variables. Principal component analysis (PCA) is a technique used to reduce the dimensionality of data by identifying patterns and relationships between variables. Factor analysis is a technique used to identify underlying factors that explain the correlations between multiple variables. Cluster analysis is a technique used to group objects or individuals into clusters based on similarities or dissimilarities. Discriminant analysis is a technique used to determine which variables discriminate between two or more groups. Overall, multivariate analysis is a powerful tool for examining complex relationships between multiple variables, and it can help researchers and analysts gain a deeper understanding of the data they are working with. 3 Bayesian Modeling Bayesian modeling is a statistical modeling approach that uses Bayesian inference to make predictions and estimate parameters. It is named after Thomas Bayes, an 18th-century statistician who developed the Bayes theorem, which is a key component of Bayesian modeling. In Bayesian modeling, prior information about the parameters of interest is combined with data to produce a posterior distribution. This posterior distribution represents the updated probability distribution of the parameters given the data and the prior information. The posterior distribution is used to make inferences and predictions about the parameters. Bayesian modeling is particularly useful when there is limited data or when the data is noisy or uncertain. It allows for the incorporation of prior knowledge and beliefs into the modeling process, which can improve the accuracy and precision of predictions. Bayesian modeling is used in a wide range of fields, including finance, engineering, ecology, and social sciences. Some examples of Bayesian modeling applications include predicting stock prices, estimating the prevalence of a disease in a population, and analyzing the effects of environmental factors on a species. 3.1 Bayes Theorem ˆ Goal — To determine the most probable hypothesis, given the data D plus any initial knowledge about the prior probabilities of the various hypotheses in H. 7
  • 8. M a r c h 1 1 , 2 0 2 4 / D r . R S ˆ Prior probability of h, P(h) — it reflects any background knowledge we have about the chance that h is a correct hypothesis (before having observed the data). ˆ Prior probability of D, P(D) — it reflects the probability that training data D will be observed given no knowledge about which hypothesis h holds. ˆ Conditional Probability of observation D, P(D|h) — it denotes the probability of observing data D given some world in which hypothesis h holds. ˆ Posterior probability of h, P(h|D) — it represents the probability that h holds given the observed training data D. It reflects our confidence that h holds after we have seen the training data D and it is the quantity that Machine Learning researchers are interested in. ˆ Bayes Theorem allows us to compute P(h|D) — P(h|D) = P(D|h)P(h)/P(D) Maximum A Posteriori (MAP) Hypothesis and Maximum Likelihood ˆ Goal — To find the most probable hypothesis h from a set of candidate hypotheses H given the observed data D. MAP Hypothesis, hMAP = argmax h∈H P(h|D) = argmax h∈H P(D|h)P(h)/P(D) = argmax h∈H P(D|h)P(h) ˆ If every hypothesis in H is equally probable a priori, we only need to consider the likelihood of the data D given h, P(D|h). Then, hMAP becomes the Maximum Likelihood, hML = argmax h∈H P(D|h)P(h) Overall, Bayesian modeling is a powerful tool for making predictions and estimating parameters in situ- ations where there is uncertainty and prior information is available. 8
  • 9. M a r c h 1 1 , 2 0 2 4 / D r . R S 4 Inference and Bayesian networks Inference in Bayesian networks is the process of using probabilistic reasoning to make predictions or draw conclusions about a system or phenomenon. Bayesian networks are graphical models that represent the relationships between variables using a directed acyclic graph, where nodes represent variables and edges represent probabilistic dependencies between the variables. Inference in Bayesian networks involves calculating the posterior probability distribution of one or more variables given evidence about other variables in the network. This can be done using Bayesian inference, which involves updating the prior probability distribution of the variables using Bayes’ theorem and the observed evidence. The posterior distribution can be used to make predictions or draw conclusions about the system or phenomenon being modeled. For example, in a medical diagnosis system, the posterior probability of a particular disease given a set of symptoms can be calculated using a Bayesian network. This can help clinicians make a more accurate diagnosis and choose appropriate treatments. Bayesian networks and inference are widely used in many fields, including artificial intelligence, decision making, finance, and engineering. They are particularly useful in situations where there is uncertainty and probabilistic relationships between variables need to be modeled and analyzed. 4.1 BAYESIAN NETWORKS ˆ Abbreviation : BBN (Bayesian Belief Network) ˆ Synonyms: Bayes (ian) network, Bayes(ian) model, Belief network, Decision network, or probabilistic directed acyclic graphical model. ˆ A BBN is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a Directed Acyclic Graph (DAG). 9
  • 10. M a r c h 1 1 , 2 0 2 4 / D r . R S ˆ BBNs enable us to model and reason about uncertainty. BBNs accommodate both subjective proba- bilities and probabilities based on objective data. ˆ The most important use of BBNs is in revising probabilities in the light of actual observations of events. ˆ Nodes represent variables in the Bayesian sense: observable quantities, hidden variables or hypotheses. Edges represent conditional dependencies. ˆ Each node is associated with a probability function that takes, as input, a particular set of probabilities for values for the node’s parent variables, and outputs the probability of the values of the variable represented by the node. ˆ Prior Probabilities: e.g. P(RAIN) ˆ Conditional Probabilities: e.g. P(SPRINKLER | RAIN) ˆ Joint Probability Function: P(GRASS WET, SPRINKLER, RAIN) = P(GRASS WET | RAIN, SPRINKLER) * P(SPRINKLER | RAIN) * P ( RAIN) ˆ Typically the probability functions are described in table form. 10
  • 11. M a r c h 1 1 , 2 0 2 4 / D r . R S ˆ BN cannot be used to model the correlation relationships between random variables. Overall, inference in Bayesian networks is a powerful tool for making predictions and drawing conclusions in situations where there is uncertainty and complex probabilistic relationships between variables. 4.2 Support Vector and Kernel Methods Support vector machines (SVMs) and kernel methods are commonly used in machine learning and pattern recognition to solve classification and regression problems. SVMs are a type of supervised learning algorithm that aims to find the optimal hyperplane that separates the data into different classes. The optimal hyperplane is the one that maximizes the margin, or the distance between the hyperplane and the closest data points from each class. SVMs can also use kernel functions to transform the original input data into a higher dimensional space, where it may be easier to find a separating hyperplane. Kernel methods are a class of algorithms that use kernel functions to compute the similarity between pairs of data points. Kernel functions can transform the input data into a higher dimensional feature space, where linear methods can be applied more effectively. Some commonly used kernel functions include linear, polynomial, and radial basis functions. Kernel methods are used in a variety of applications, including image recognition, speech recognition, 11
  • 12. M a r c h 1 1 , 2 0 2 4 / D r . R S and natural language processing. They are particularly useful in situations where the data is non-linear and the relationship between variables is complex. History of SVM6 ˆ SVM is related to statistical learning theory. ˆ SVM was first introduced in 1992. ˆ SVM becomes popular because of its success in handwritten digit recognition 1.1% test error rate for SVM. This is the same as the error rates of a carefully constructed neural network. ˆ SVM is now regarded as an important example of “kernel methods”, one of the key area in machine learning Binary Classification Given training data (xi, yi) for i = 1 . . . N, with xi ∈ Rd and yi ∈ {−1, 1}, learn a classifier f(x) such that f(xi) ( ≥ 0 yi = +1 < 0 yi = −1 i.e. yif(xi) > 0 for a correct classification. Linear separability linearly separable not linearly separable 6Support vector machine is a linear model and it always looks for a hyperplane to separate one class from another. I will focus on two-dimensional case because it is easier to comprehend and - possible to visualize to give some intuition, however bear in mind that this is true for higher dimensions (simply lines change into planes, parabolas into paraboloids etc.). 12
  • 13. M a r c h 1 1 , 2 0 2 4 / D r . R S Linear classifiers X2 X1 A linear classifier has the form • in 2D the discriminant is a line • is the normal to the line, and b the bias • is known as the weight vector f(x) = 0 f(x) = w>x + b f(x) > 0 f(x) < 0 Linear classifiers A linear classifier has the form • in 3D the discriminant is a plane, and in nD it is a hyperplane For a K-NN classifier it was necessary to `carry’ the training data For a linear classifier, the training data is used to learn w and then discarded Only w is needed for classifying new data f(x) = 0 f(x) = w>x + b Given linearly separable data xi labelled into two categories yi = {-1,1} , find a weight vector w such that the discriminant function separates the categories for i = 1, .., N • how can we find this separating hyperplane ? The Perceptron Classifier f(xi) = w>xi + b The Perceptron Algorithm Write classifier as • Initialize w = 0 • Cycle though the data points { xi, yi } • if xi is misclassified then • Until all the data is correctly classified w ← w + α sign(f(xi)) xi f(xi) = w̃>x̃i + w0 = w>xi where w = (w̃, w0), xi = (x̃i, 1) 13
  • 14. M a r c h 1 1 , 2 0 2 4 / D r . R S For example in 2D X2 X1 X2 X1 w before update after update w NB after convergence w = PN i αixi • Initialize w = 0 • Cycle though the data points { xi, yi } • if xi is misclassified then • Until all the data is correctly classified w ← w + α sign(f(xi)) xi xi • if the data is linearly separable, then the algorithm will converge • convergence can be slow … • separating line close to training data • we would prefer a larger margin for generalization -15 -10 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 Perceptron example What is the best w? • maximum margin solution: most stable under perturbations of the inputs 14
  • 15. M a r c h 1 1 , 2 0 2 4 / D r . R S Tennis example Humidity Temperature = play tennis = do not play tennis Linear Support Vector Machines x1 x2 =+1 =-1 Data: <xi,yi>, i=1,..,l xi  Rd yi  {-1,+1} =-1 =+1 Data: <xi,yi>, i=1,..,l xi  Rd yi  {-1,+1} All hyperplanes in Rd are parameterize by a vector (w) and a constant b. Can be expressed as w•x+b=0 (remember the equation for a hyperplane from algebra!) Our aim is to find such a hyperplane f(x)=sign(w•x+b), that correctly classify our data. f(x) Linear SVM 2 15
  • 16. M a r c h 1 1 , 2 0 2 4 / D r . R S d+ d- Definitions Define the hyperplane H such that: xi•w+b  +1 when yi =+1 xi•w+b  -1 when yi =-1 d+ = the shortest distance to the closest positive point d- = the shortest distance to the closest negative point The margin of a separating hyperplane is d+ + d- . H H1 and H2 are the planes: H1: xi•w+b = +1 H2: xi•w+b = -1 The points on the planes H1 and H2 are the Support Vectors H1 H2 Maximizing the margin d+ d- We want a classifier with as big margin as possible. Recall the distance from a point(x0,y0) to a line: Ax+By+c = 0 is|A x0 +B y0 +c|/sqrt(A2 +B2 ) The distance between H and H1 is: |w•x+b|/||w||=1/||w|| The distance between H1 and H2 is: 2/||w|| In order to maximize the margin, we need to minimize ||w||. With the condition that there are no datapoints between H1 and H2: xi•w+b  +1 when yi =+1 xi•w+b  -1 when yi =-1 Can be combined into yi(xi•w)  1 H1 H2 H Constrained Optimization Problem     0 and 0 subject to 2 1 Maximize : yields g simplifyin and , into back ng substituti 0, to them setting s, derivative the Taking 0. be must and both respect with of derivative partial the extremum, At the 1 ) ( || || 2 1 ) , , ( where , ) , , ( inf maximize : method Lagrangian all for 1 ) ( subject to || || Minimize ,                   i i i i i j i j i j i j i i i i i i i i y y y L b L b y b L b L i b y         x x w w x w w w w x w w w w 16
  • 17. M a r c h 1 1 , 2 0 2 4 / D r . R S Quadratic Programming • Why is this reformulation a good thing? • The problem is an instance of what is called a positive, semi-definite programming problem • For a fixed real-number accuracy, can be solved in O(n log n) time = O(|D|2 log |D|2) 0 and 0 subject to 2 1 Maximize ,        i i i i i j i j i j i j i i y y y      x x Problems with linear SVM =-1 =+1 What if the decision function is not a linear? Kernel Trick ) 2 , , ( space in the separable linearly are points Data 2 1 2 2 2 1 x x x x 2 , ) , ( Here, directly! compute easy to often is : thing Cool ) ( ) ( ) , ( Define ) ( ) ( 2 1 maximize want to We j i j i j i j i i j i j i j i j i i K K F F K F F y y x x x x x x x x x x            17
  • 18. M a r c h 1 1 , 2 0 2 4 / D r . R S Other Kernels The polynomial kernel K(xi,xj) = (xi•xj + 1)p , where p is a tunable parameter. Evaluating K only require one addition and one exponentiation more than the original dot product. Gaussian kernels (also called radius basis functions) K(xi,xj) = exp(||xi-xj ||2 /22 ) Overtraining/overfitting =-1 =+1 An example: A botanist really knowing trees.Everytime he sees a new tree, he claims it is not a tree. A well known problem with machine learning methods is overtraining. This means that we have learned the training data very well, but we can not classify unseen examples correctly. Overtraining/overfitting 2 It can be shown that: The portion, n, of unseen data that will be missclassified is bounded by: n  Number of support vectors / number of training examples A measure of the risk of overtraining with SVM (there are also other measures). Ockham´s razor principle: Simpler system are better than more complex ones. In SVM case: fewer support vectors mean a simpler representation of the hyperplane. Example: Understanding a certain cancer if it can be described by one gene is easier than if we have to describe it with 5000. 18
  • 19. M a r c h 1 1 , 2 0 2 4 / D r . R S A practical example, protein localization • Proteins are synthesized in the cytosol. • Transported into different subcellular locations where they carry out their functions. • Aim: To predict in what location a certain protein will end up!!! Overall, SVMs and kernel methods are powerful tools for solving classification and regression problems. They can handle complex data and provide accurate predictions, making them valuable in many fields, including finance, healthcare, and engineering. 5 Analysis of Time Series: Linear Systems Analysis & Nonlinear Dynamics Time series analysis is a statistical technique used to analyze time-dependent data. It involves studying the patterns and trends in the data over time and making predictions about future values. Linear systems analysis is a technique used in time series analysis to model the behavior of a system using linear equations. Linear models assume that the relationship between variables is linear and that the system is time-invariant, meaning that the relationship between variables does not change over time. Linear systems analysis involves techniques such as autoregressive (AR) and moving average (MA) models, which use past values of a variable to predict future values. Nonlinear dynamics is another approach to time series analysis that considers systems that are not described by linear equations. Nonlinear systems are often more complex and can exhibit chaotic behavior, making them more difficult to model and predict. Nonlinear dynamics involves techniques such as chaos theory and fractal analysis, which use mathematical concepts to describe the behavior of nonlinear systems. Both linear systems analysis and nonlinear dynamics have applications in a wide range of fields, including finance, economics, and engineering. Linear models are often used in situations where the data is relatively simple and the relationship between variables is well understood. Nonlinear dynamics is often used in 19
  • 20. M a r c h 1 1 , 2 0 2 4 / D r . R S situations where the data is more complex and the relationship between variables is not well understood. There are several components of time series analysis, including: 1. Trend Analysis: Trend analysis is used to identify the long-term patterns and trends in the data. It can be a linear or non-linear trend and may show an upward, downward or flat trend. 2. Seasonal Analysis: Seasonal analysis is used to identify the recurring patterns in the data that occur within a fixed time period, such as a week, month, or year. 3. Cyclical Analysis: Cyclical analysis is used to identify the patterns that are not necessarily regular or fixed in duration, but do show a tendency to repeat over time, such as economic cycles or business cycles. 4. Irregular Analysis: Irregular analysis is used to identify any random fluctuations or noise in the data that cannot be attributed to any of the above components. 5. Forecasting: Forecasting is the process of predicting future values of a time series based on its past behavior. It can be done using various statistical techniques such as moving averages, exponential smoothing, and regression analysis. Overall, time series analysis is a powerful tool for studying time-dependent data and making predictions about future values. Linear systems analysis and nonlinear dynamics are two approaches to time series analysis that can be used in different situations to model and predict complex systems. 6 Rule Induction Rule induction is a machine learning technique used to identify patterns in data and create a set of rules that can be used to make predictions or decisions about new data. It is often used in decision tree algorithms and can be applied to both classification and regression problems. The rule induction process involves analyzing the data to identify common patterns and relationships between the variables. These patterns are used to create a set of rules that can be used to classify or predict new data. The rules are typically in the form of ”if-then” statements, where the ”if” part specifies the conditions under which the rule applies and the ”then” part specifies the action or prediction to be taken. Rule induction algorithms can be divided into two main types: top-down and bottom-up. Top-down algorithms start with a general rule that applies to the entire dataset and then refine the rule based on 20
  • 21. M a r c h 1 1 , 2 0 2 4 / D r . R S the data. Bottom-up algorithms start with individual data points and then group them together based on common attributes. Rule induction has many applications in fields such as finance, healthcare, and marketing. For example, it can be used to identify patterns in financial data to predict stock prices or to analyze medical data to identify risk factors for certain diseases. Overall, rule induction is a powerful machine learning technique that can be used to identify patterns in data and create rules that can be used to make predictions or decisions. It is a useful tool for solving classification and regression problems and has many applications in various fields. 7 Neural Networks: Learning and Generalization Neural networks are a class of machine learning algorithms that are inspired by the structure and function of the human brain. They are used to learn complex patterns and relationships in data and can be used for a variety of tasks, including classification, regression, and clustering. Learning in neural networks refers to the process of adjusting the weights and biases of the network to improve its performance on a particular task. This is typically done through a process called backpropagation, which involves propagating the errors from the output layer back through the network and adjusting the weights and biases accordingly. Generalization in neural networks refers to the ability of the network to perform well on new, unseen data. A network that has good generalization performance is able to accurately predict the outputs for new inputs that were not included in the training set. Generalization performance is typically evaluated using a separate validation set or by cross-validation. Overfitting is a common problem in neural networks, where the network becomes too complex and starts to fit the noise in the training data, rather than the underlying patterns. This can result in poor generalization performance on new data. Techniques such as regularization, early stopping, and dropout are often used to prevent overfitting and improve generalization performance. Overall, learning and generalization are two important concepts in neural networks. Learning involves adjusting the weights and biases of the network to improve its performance, while generalization refers to the ability of the network to perform well on new, unseen data. Effective techniques for learning and generalization are critical for building accurate and useful neural network models. 21
  • 27. M a r c h 1 1 , 2 0 2 4 / D r . R S 8 Competitive Learning Competitive learning is a type of machine learning technique in which a set of neurons compete to be activated by input data. The neurons are organized into a layer, and each neuron receives the same input data. However, only one neuron is activated, and the competition is based on a set of rules that determine which neuron is activated. The competition in competitive learning is typically based on a measure of similarity between the input data and the weights of each neuron. The neuron with the highest similarity to the input data is activated, and the weights of that neuron are updated to become more similar to the input data. This process is repeated for multiple iterations, and over time, the neurons learn to become specialized in recognizing different types of input data. Competitive learning is often used for unsupervised learning tasks, such as clustering or feature extraction. In clustering, the neurons learn to group similar input data into clusters, while in feature extraction, the neurons learn to recognize specific features in the input data. One of the advantages of competitive learning is that it can be used to discover hidden structures and patterns in data without the need for labeled data. This makes it particularly useful for applications such as image and speech recognition, where labeled data can be difficult and expensive to obtain. Overall, competitive learning is a powerful machine learning technique that can be used for a variety of unsupervised learning tasks. It involves a set of neurons that compete to be activated by input data, and over time, the neurons learn to become specialized in recognizing different types of input data. 27
  • 30. M a r c h 1 1 , 2 0 2 4 / D r . R S 9 Principal Component Analysis and Neural Networks Principal component analysis (PCA) and neural networks are both machine learning techniques that can be used for a variety of tasks, including data compression, feature extraction, and dimensionality reduction. PCA is a linear technique that involves finding the principal components of a dataset, which are the directions of greatest variance. The principal components can be used to reduce the dimensionality of the data, while preserving as much of the original variance as possible. Neural networks, on the other hand, are nonlinear techniques that involve multiple layers of interconnected neurons. Neural networks can be used for a variety of tasks, including classification, regression, and clustering. They can also be used for feature extraction, where the network learns to identify the most important features of the input data. PCA and neural networks can be used together for a variety of tasks. For example, PCA can be used to reduce the dimensionality of the data before feeding it into a neural network. This can help to improve the performance of the network by reducing the amount of noise and irrelevant information in the input data. Neural networks can also be used to improve the performance of PCA. In some cases, PCA can be limited by its linear nature, and may not be able to capture complex nonlinear relationships in the data. By combining PCA with a neural network, the network can learn to capture these nonlinear relationships and improve the accuracy of the PCA results. Overall, PCA and neural networks are both powerful machine learning techniques that can be used for a variety of tasks. When used together, they can improve the performance and accuracy of each technique and help to solve more complex problems. 30
  • 31. Pattern Recognition Tag: Principal Component Analysis Numerical Example Principal Component Analysis | Dimension Reduction Dimension Reduction- In pattern recognition, Dimension Reduction is defined as- It is a process of converting a data set having vast dimensions into a data set with lesser dimensions. It ensures that the converted data set conveys similar information concisely. Example- Consider the following example- The following graph shows two dimensions x1 and x2. x1 represents the measurement of several objects in cm. x2 represents the measurement of several objects in inches. In machine learning, Using both these dimensions convey similar information. Also, they introduce a lot of noise in the system. So, it is better to use just one dimension. Using dimension reduction techniques- We convert the dimensions of data from 2 dimensions (x1 and x2) to 1 dimension (z1). M a r c h 1 1 , 2 0 2 4 / D r . R S
  • 32. It makes the data relatively easier to explain. Benefits- Dimension reduction offers several benefits such as- It compresses the data and thus reduces the storage space requirements. It reduces the time required for computation since less dimensions require less computation. It eliminates the redundant features. It improves the model performance. Dimension Reduction Techniques- The two popular and well-known dimension reduction techniques are- 1. Principal ComponentAnalysis (PCA) 2. Fisher Linear DiscriminantAnalysis (LDA) In this article, we will discuss about Principal ComponentAnalysis. Principal Component Analysis- Principal ComponentAnalysis is a well-known dimension reduction technique. It transforms the variables into a new set of variables called as principal components. These principal components are linear combination of original variables and are orthogonal. M a r c h 1 1 , 2 0 2 4 / D r . R S
  • 33. The first principal component accounts for most of the possible variation of original data. The second principal component does its best to capture the variance in the data. There can be only two principal components for a two-dimensional data set. PCA Algorithm- The steps involved in PCAAlgorithm are as follows- Step-01: Get data. Step-02: Compute the mean vector (µ). Step-03: Subtract mean from the given data. Step-04: Calculate the covariance matrix. Step-05: Calculate the eigen vectors and eigen values of the covariance matrix. Step-06: Choosing components and forming a feature vector. Step-07: Deriving the new data set. PRACTICE PROBLEMS BASED ON PRINCIPAL COMPONENT ANALYSIS- Problem-01: Given data = { 2, 3, 4, 5, 6, 7 ; 1, 5, 3, 6, 7, 8 }. Compute the principal component using PCAAlgorithm. OR Consider the two dimensional patterns (2, 1), (3, 5), (4, 3), (5, 6), (6, 7), (7, 8). Compute the principal component using PCAAlgorithm. OR M a r c h 1 1 , 2 0 2 4 / D r . R S
  • 34. Compute the principal component of following data- CLASS 1 X = 2 , 3 , 4 Y = 1 , 5 , 3 CLASS 2 X = 5 , 6 , 7 Y = 6 , 7 , 8 Solution- We use the above discussed PCAAlgorithm- Step-01: Get data. The given feature vectors are- x1 = (2, 1) x2 = (3, 5) x3 = (4, 3) x4 = (5, 6) x5 = (6, 7) x6 = (7, 8) Step-02: Calculate the mean vector (µ). Mean vector (µ) M a r c h 1 1 , 2 0 2 4 / D r . R S
  • 35. = ((2 + 3 + 4 + 5 + 6 + 7) / 6, (1 + 5 + 3 + 6 + 7 + 8) / 6) = (4.5, 5) Thus, Step-03: Subtract mean vector (µ) from the given feature vectors. x1 – µ = (2 – 4.5, 1 – 5) = (-2.5, -4) x2 – µ = (3 – 4.5, 5 – 5) = (-1.5, 0) x3 – µ = (4 – 4.5, 3 – 5) = (-0.5, -2) x4 – µ = (5 – 4.5, 6 – 5) = (0.5, 1) x5 – µ = (6 – 4.5, 7 – 5) = (1.5, 2) x6 – µ = (7 – 4.5, 8 – 5) = (2.5, 3) Feature vectors (xi) after subtracting mean vector (µ) are- Step-04: Calculate the covariance matrix. Covariance matrix is given by- M a r c h 1 1 , 2 0 2 4 / D r . R S
  • 37. = (m1 + m2 + m3 + m4 + m5 + m6) / 6 On adding the above matrices and dividing by 6, we get- Step-05: Calculate the eigen values and eigen vectors of the covariance matrix. λ is an eigen value for a matrix M if it is a solution of the characteristic equation |M – λI| = 0. So, we have- From here, (2.92 – λ)(5.67 – λ) – (3.67 x 3.67) = 0 16.56 – 2.92λ – 5.67λ + λ2 – 13.47 = 0 λ2 – 8.59λ + 3.09 = 0 Solving this quadratic equation, we get λ = 8.22, 0.38 M a r c h 1 1 , 2 0 2 4 / D r . R S
  • 38. Thus, two eigen values are λ1 = 8.22 and λ2 = 0.38. Clearly, the second eigen value is very small compared to the first eigen value. So, the second eigen vector can be left out. Eigen vector corresponding to the greatest eigen value is the principal component for the given data set. So. we find the eigen vector corresponding to eigen value λ1. We use the following equation to find the eigen vector- MX = λX where- M = Covariance Matrix X = Eigen vector λ = Eigen value Substituting the values in the above equation, we get- Solving these, we get- 2.92X1 + 3.67X2 = 8.22X1 3.67X1 + 5.67X2 = 8.22X2 On simplification, we get- 5.3X1 = 3.67X2 ………(1) 3.67X1 = 2.55X2 ………(2) From (1) and (2), X1 = 0.69X2 From (2), the eigen vector is- M a r c h 1 1 , 2 0 2 4 / D r . R S
  • 39. Thus, principal component for the given data set is- Lastly, we project the data points onto the new subspace as- Problem-02: Use PCAAlgorithm to transform the pattern (2, 1) onto the eigen vector in the previous question. M a r c h 1 1 , 2 0 2 4 / D r . R S
  • 40. M a r c h 1 1 , 2 0 2 4 / D r . R S 10 Fuzzy Logic: Extracting Fuzzy Models from Data Fuzzy logic is a type of logic that allows for degrees of truth, rather than just true or false values. It is often used in machine learning to extract fuzzy models from data. A fuzzy model is a model that uses fuzzy logic to make predictions or decisions based on uncertain or incomplete data. Fuzzy models are particularly useful in situations where traditional models may not work well, such as when the data is noisy or when there is a lot of uncertainty or ambiguity in the data. To extract a fuzzy model from data, the first step is to define the input and output variables of the model. The input variables are the features or attributes of the data, while the output variable is the target variable that we want to predict or classify. Next, we use fuzzy logic to define the membership functions for each input and output variable. The membership functions describe the degree of membership of each data point to each category or class. For example, a data point may have a high degree of membership to the category ”low”, but a low degree of membership to the category ”high”. Once the membership functions have been defined, we can use fuzzy inference to make predictions or decisions based on the input data. Fuzzy inference involves using the membership functions to determine the degree of membership of each data point to each category or class, and then combining these degrees of membership to make a prediction or decision. Overall, extracting fuzzy models from data involves using fuzzy logic to define the membership functions for each input and output variable, and then using fuzzy inference to make predictions or decisions based on the input data. Fuzzy models are particularly useful in situations where traditional models may not work well, and can help to improve the accuracy and robustness of machine learning models. 10.1 Fuzzy Decision Trees Fuzzy decision trees are a type of decision tree that use fuzzy logic to make decisions based on uncertain or imprecise data. Decision trees are a type of supervised learning technique that involve recursively partitioning the input space into regions that correspond to different classes or categories. Fuzzy decision trees extend traditional decision trees by allowing for degrees of membership to each category or class, rather than just a binary classification. This is particularly useful in situations where the data is uncertain or imprecise, and where a single, crisp classification may not be appropriate. To build a fuzzy decision tree, we start with a set of training data that consists of input-output pairs. 40
  • 41. M a r c h 1 1 , 2 0 2 4 / D r . R S We then use fuzzy logic to determine the degree of membership of each data point to each category or class. This is done by defining the membership functions for each input and output variable, and using these to compute the degree of membership of each data point to each category or class. Next, we use the fuzzy membership values to construct a fuzzy decision tree. The tree consists of a set of nodes and edges, where each node represents a test on one of the input variables, and each edge represents a decision based on the result of the test. The degree of membership of each data point to each category or class is used to determine the probability of reaching each leaf node of the tree. Fuzzy decision trees can be used for a variety of tasks, including classification, regression, and clustering. They are particularly useful in situations where the data is uncertain or imprecise, and where traditional decision trees may not work well. Overall, fuzzy decision trees are a powerful machine learning technique that can be used to make decisions based on uncertain or imprecise data. They extend traditional decision trees by allowing for degrees of membership to each category or class, and can help to improve the accuracy and robustness of machine learning models. 11 Stochastic Search Methods Stochastic search methods are a class of optimization algorithms that use probabilistic techniques to search for the optimal solution in a large search space. These methods are commonly used in machine learning to find the best set of parameters for a model, such as the weights in a neural network or the parameters in a regression model. Stochastic search methods are often used when the search space is too large to exhaustively search all possible solutions, or when the objective function is highly nonlinear and has many local optima. The basic idea behind these methods is to explore the search space by randomly sampling solutions and using probabilistic techniques to move towards better solutions. One common stochastic search method is called the stochastic gradient descent (SGD) algorithm. In this method, the objective function is optimized by iteratively updating the parameters in the direction of the negative gradient of the objective function. The update rule includes a learning rate, which controls the step size and the direction of the update. SGD is widely used in training neural networks and other deep learning models. Another stochastic search method is called simulated annealing. This method is based on the physical 41
  • 42. M a r c h 1 1 , 2 0 2 4 / D r . R S process of annealing, which involves heating and cooling a material to improve its properties. In simulated annealing, the search process starts with a high temperature and gradually cools down over time. At each iteration, the algorithm randomly selects a new solution and computes its fitness. If the new solution is better than the current solution, it is accepted. However, if the new solution is worse, it may still be accepted with a certain probability that decreases as the temperature decreases. Other stochastic search methods include evolutionary algorithms, such as genetic algorithms and particle swarm optimization, which mimic the process of natural selection and evolution to search for the optimal solution. Overall, stochastic search methods are powerful optimization techniques that are widely used in machine learning and other fields. These methods allow us to efficiently search large search spaces and find optimal solutions in the presence of noise, uncertainty, and nonlinearity. 42
  • 43. Printed Page: 1 of 2 Subject Code: KIT601 0Roll No: 0 0 0 0 0 0 0 0 0 0 0 0 0 BTECH (SEM VI) THEORY EXAMINATION 2021-22 DATA ANALYTICS Time: 3 Hours Total Marks: 100 Note: Attempt all Sections. If you require any missing data, then choose suitably. SECTION A 1. Attempt all questions in brief. 2*10 = 20 Qno Questions CO (a) Discuss the need of data analytics. 1 (b) Give the classification of data. 1 (c) Define neural network. 2 (d) What is multivariate analysis? 2 (e) Give the full form of RTAP and discuss its application. 3 (f) What is the role of sampling data in a stream? 3 (g) Discuss the use of limited pass algorithm. 4 (h) What is the principle behind hierarchical clustering technique? 4 (i) List five R functions used in descriptive statistics. 5 (j) List the names of any 2 visualization tools. 5 SECTION B 2. Attempt any three of the following: 10*3 = 30 Qno Questions CO (a) Explain the process model and computation model for Big data platform. 1 (b) Explain the use and advantages of decision trees. 2 (c) Explain the architecture of data stream model. 3 (d) Illustrate the K-means algorithm in detail with its advantages. 4 (e) Differentiate between NoSQL and RDBMS databases. 5 SECTION C 3. Attempt any one part of the following: 10*1 = 10 Qno Questions CO (a) Explain the various phases of data analytics life cycle. 1 (b) Explain modern data analytics tools in detail. 1 4. Attempt any one part of the following: 10 *1 = 10 Qno Questions CO (a) Compare various types of support vector and kernel methods of data analysis. 2 (b) Given data= {2,3,4,5,6,7;1,5,3,6,7,8}. Compute the principal component using PCA algorithm. 2 M a r c h 1 1 , 2 0 2 4 / D r . R S
  • 44. Printed Page: 2 of 2 Subject Code: KIT601 0Roll No: 0 0 0 0 0 0 0 0 0 0 0 0 0 BTECH (SEM VI) THEORY EXAMINATION 2021-22 DATA ANALYTICS 5. Attempt any one part of the following: 10*1 = 10 Qno Questions CO (a) Explain any one algorithm to count number of distinct elements in a data stream. 3 (b) Discuss the case study of stock market predictions in detail. 3 6. Attempt any one part of the following: 10*1 = 10 Qno Questions CO (a) Differentiate between CLIQUE and ProCLUS clustering. 4 (b) A database has 5 transactions. Let min_sup=60% and min_conf=80%. TID Items_Bought T100 {M, O, N, K, E, Y} T200 {D, O, N, K, E, Y} T300 {M, A, K, E} T400 {M, U, C, K, Y} T500 {C, O, O, K, I, E} i) Find all frequent itemsets using Apriori algorithm. ii) List all the strong association rules (with support s and confidence c). 4 7. Attempt any one part of the following: 10*1 = 10 Qno Questions CO (a) Explain the HIVE architecture with its features in detail. 5 (b) Write R function to check whether the given number is prime or not. 5 M a r c h 1 1 , 2 0 2 4 / D r . R S
  • 47. M a r c h 1 1 , 2 0 2 4 / D r . R S 12 Reference [1] https://www.jigsawacademy.com/blogs/hr-analytics/data-analytics-lifecycle/ [2] https://statacumen.com/teach/ADA1/ADA1_notes_F14.pdf [3] https://www.youtube.com/watch?v=fDRa82lxzaU [4] https://www.investopedia.com/terms/d/data-analytics.asp [5] http://egyankosh.ac.in/bitstream/123456789/10935/1/Unit-2.pdf [6] http://epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/computer_science/16._data_analytics/ 03._evolution_of_analytical_scalability/et/9280_et_3_et.pdf [7] https://bhavanakhivsara.files.wordpress.com/2018/06/data-science-and-big-data-analy-nieizv_ book.pdf [8] https://www.researchgate.net/publication/317214679_Sentiment_Analysis_for_Effective_Stock_ Market_Prediction [9] https://snscourseware.org/snscenew/files/1569681518.pdf [10] http://csis.pace.edu/ctappert/cs816-19fall/books/2015DataScience&BigDataAnalytics.pdf [11] https://www.youtube.com/watch?v=mccsmoh2_3c [12] https://mentalmodels4life.net/2015/11/18/agile-data-science-applying-kanban-in-the-analytics-li [13] https://www.sas.com/en_in/insights/big-data/what-is-big-data.html#:~:text=Big%20data% 20refers%20to%20data,around%20for%20a%20long%20time. [14] https://www.javatpoint.com/big-data-characteristics [15] Liu, S., Wang, M., Zhan, Y., & Shi, J. (2009). Daily work stress and alcohol use: Testing the cross- level moderation effects of neuroticism and job involvement. Personnel Psychology,62(3), 575–597. http://dx.doi.org/10.1111/j.1744-6570.2009.01149.x ******************** 47