SlideShare a Scribd company logo
Multidimensional Data
Dr. Ashutosh Satapathy
Assistant Professor, Department of CSE
VR Siddhartha Engineering College
Kanuru, Vijayawada
October 19, 2022
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 1 / 86
Outline
1 Multivariate and
High-Dimensional Problems
2 Visualisation
Three-Dimensional
Visualisation
Parallel Coordinate Plots
3 Multivariate Random Vectors
and Data
Population Case
Sample Case
Multivariate Random Vectors
Gaussian Random Vectors
Marginal and Conditional
Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 2 / 86
Multivariate and High-Dimensional Problems
Early in the twentieth century, scientists such as Pearson (1901),
Hotelling (1933) and Fisher (1936) developed methods for
analysing multivariate data in order to
1 Understand the structure in the data and summarise it in simpler
ways.
2 Understand the relationship of one part of the data to another part.
3 Make decisions and inferences based on the data.
The early methods these scientists developed are linear; as time
moved on, more complex methods were developed.
These data sets essential structure can often be obscured by noise.
Reduce the original data in such a way that informative and
interesting structure in the data is preserved while noisy, irrelevant
or purely random variables, dimensions or features are removed,
as these can adversely affect the analysis.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 3 / 86
Multivariate and High-Dimensional Problems
Traditionally one assumes that the dimension d is small compared to
the sample size n.
Many recent data sets do not fit into this framework; we encounter
the following problems.
Data whose dimension is comparable to the sample size, and both
are large.
High-dimension and low sample size data whose dimension d
vastly exceeds the sample size n, so d ≥ n.
Functional data whose observations are functions.
High-dimensional and functional data pose special challenges.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 4 / 86
Visualisation
Before we analyse a set of data, it is important to look at it.
Often we get useful clues such as skewness, bi- or multi-modality,
outliers, or distinct groupings.
Graphical displays are exploratory data-analysis tools, which, if
appropriately used, can enhance our understanding of data.
Visual clues are easier to understand and interpret than numbers
alone, and the information you can get from graphical displays can
help you understand answers that are based on numbers.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 5 / 86
Outline
1 Multivariate and
High-Dimensional Problems
2 Visualisation
Three-Dimensional
Visualisation
Parallel Coordinate Plots
3 Multivariate Random Vectors
and Data
Population Case
Sample Case
Multivariate Random Vectors
Gaussian Random Vectors
Marginal and Conditional
Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 6 / 86
Three-Dimensional Visualisation
Two-dimensional scatter-plots are a natural – though limited – way
of looking at data with three or more variables.
As the number of variables, and therefore the dimension increases.
We can, of course, still display three of the d dimensions in
scatter-plots, but it is less clear how one can look at more than
three dimensions in a single plot.
Figure 2.1 display the 10,000 observations and the three variables
CD3, CD8 and CD4 of the five-dimensional HIV+ and HIV- data
sets.
The data-sets contain measurements of blood cells relevant to HIV.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 7 / 86
Three-Dimensional Visualisation
Figure 2.1: HIV+ data (left) and HIV- data (right) of variables CD3, CD8 and
CD4.
There are differences between the point clouds in the two figures, and an
important task is to exhibit and quantify the differences.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 8 / 86
Three-Dimensional Visualisation
Projecting the Figure 2.1 data onto a number of orthogonal directions
and displaying the lower-dimensional projected data in Figure 2.2.
Figure 2.2: Orthogonal projections of the HIV+ data (left) and the HIV-
data(right).
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 9 / 86
Three-Dimensional Visualisation
We can see a smaller fourth cluster in the top right corner of the
HIV- data, which seems to have almost disappeared in the HIV+ data
in the left panel.
Many of the methods we explore use projections: Principal
Component Analysis, Factor Analysis, Multidimensional Scaling,
Independent Component Analysis and Projection Pursuit.
In each case the projections focus on different aspects and properties
of the data.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 10 / 86
Three-Dimensional Visualisation
Figure 2.3: Three different species of Iris flowers.
We display the four variables of Fisher’s iris data – sepal length, sepal
width, petal length and petal width – in a sequence of three-
dimensional scatter-plots.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 11 / 86
Three-Dimensional Visualisation
Figure 2.4: Features 1, 2 and 3 (top left), features 1, 2 and 4 (top right), features
1, 3 and 4 (bottom left) and features 2, 3 and 4 (bottom right).
Red refers to Setosa, green to Versicolor and black to Virginica.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 12 / 86
Outline
1 Multivariate and
High-Dimensional Problems
2 Visualisation
Three-Dimensional
Visualisation
Parallel Coordinate Plots
3 Multivariate Random Vectors
and Data
Population Case
Sample Case
Multivariate Random Vectors
Gaussian Random Vectors
Marginal and Conditional
Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 13 / 86
Parallel Coordinate Plots
As the dimension grows, three-dimensional scatter-plots become
less relevant, unless we know that only some variables are important.
An alternative, which allows us to see all variables at once, present
the data in the form of parallel coordinate plots.
The idea is to present the data as two-dimensional graphs.
The variable numbers are represented as values on the y-axis in a
vertical parallel coordinate plot.
For a vector X = [X1,..., Xd ]T we represent the first variable X1 by
the point (X1, 1) and the jth variable Xj by (Xj , j).
Finally, we connect the d points by a line which goes from (X1, 1) to
(X2, 2) and so on to (Xd , d).
We apply the same rule to the next d-dimensional feature vectors.
Figure 2.5 shows a vertical parallel coordinate plot for Fisher’s iris
data.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 14 / 86
Parallel Coordinate Plots
Figure 2.5: Iris data with variables represented on the y-axis and separate colours
for the three species.
Red refers to the observations of Setosa, green to those of
Versicolor and black to those of Virginica.
Unlike the previous Figure 2.4, Figure 2.5 tells us that dimension 3
separates the two groups most strongly.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 15 / 86
Parallel Coordinate Plots
Instead of the three colours shown in Figure 2.5, different colors can
be used for each observation in Figure 2.6.
In a horizontal parallel coordinate plot, the x-axis represents the
variable numbers 1, ..., d. For a feature vector X = [X1 ··· Xd]T,
the first variable gives rise to the point (1, X1) and the jth variable
Xj to (j, Xj).
The d points are connected by a line, starting with (1, X1), then (2,
X2), until we reach (d, Xd).
As the variables are presented along the x-axis, horizontal parallel
coordinate plots are often used.
The differently coloured lines make it easier to trace particular
observations in Figure 2.6.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 16 / 86
Parallel Coordinate Plots
Figure 2.6: Parallel coordinate view of the illicit drug market data.
Figure 2.6 shows the 66 monthly observations on 15 features or
variables of the illicit drug market data.
Each observation (month) is displayed in a different colour.
Looking at variable 5, heroin overdose, the question arises whether
there could be two groups of observations corresponding to the high
and low values of this variable.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 17 / 86
Outline
1 Multivariate and
High-Dimensional Problems
2 Visualisation
Three-Dimensional
Visualisation
Parallel Coordinate Plots
3 Multivariate Random Vectors
and Data
Population Case
Sample Case
Multivariate Random Vectors
Gaussian Random Vectors
Marginal and Conditional
Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 18 / 86
Population Case
In data science, population is the entire set of items from which you
draw data for a statistical study. It can be a group of individuals, a
set of items, etc.
Generally, population refers to the people who live in a particular area
at a specific time. But in data science, population refers to data on
your study of interest.
It can be a group of individuals, objects, events, organizations, etc.
You use populations to draw conclusions. An example of a
population would be the entire student body at a school. The
problem statement is the percentage of students who speak English
fluently.
If you had to collect the same data from the entire country of India,
it would be impossible to draw reliable conclusions because of
geographical and accessibility constraints.
Making the data biased towards certain regions or groups.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 19 / 86
Outline
1 Multivariate and
High-Dimensional Problems
2 Visualisation
Three-Dimensional
Visualisation
Parallel Coordinate Plots
3 Multivariate Random Vectors
and Data
Population Case
Sample Case
Multivariate Random Vectors
Gaussian Random Vectors
Marginal and Conditional
Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 20 / 86
Sample Case
A sample is defined as a smaller and more manageable
representation of a larger group. A subset of a larger population
that contains characteristics of that population.
A sample is used in testing when the population size is too large for
all members or observations to be included in the test.
The sample is an unbiased subset of the population that best
represents the whole data.
The process of collecting data from a small subsection of the
population and then using it to generalize over the entire set is called
sampling.
Samples are used when the population is too large and unlimited in
size and the data collected is not reliable.
A sample should generally be unbiased and satisfy all variations
present in a population. A sample should typically be chosen at
random.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 21 / 86
Outline
1 Multivariate and
High-Dimensional Problems
2 Visualisation
Three-Dimensional
Visualisation
Parallel Coordinate Plots
3 Multivariate Random Vectors
and Data
Population Case
Sample Case
Multivariate Random Vectors
Gaussian Random Vectors
Marginal and Conditional
Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 22 / 86
Multivariate Random Vectors
Random vectors are vector-valued functions defined on a sample
space.
where v(t) is the vector function and f(t), g(t) and h(t) are the
coordinate functions of Cartesian 3-space.
It can be represented as v(t) =< f (t), g(t), h(t) >.
A vector-valued function is a mathematical function of one or
more variables whose range is a set of multidimensional vectors or
infinite-dimensional vectors.
the collection of random vectors as the data or the random sample.
Specific feature values are measured for each of the random vectors
in the collection.
We call these values the realised or observed values of the data or
simply the observed data.
The observed values are no longer random.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 23 / 86
The Population Case
Let X = [X1 X2... Xd ]T
(1)
be a random vector from a distribution F:Rd → [0, 1]. The individual
Xj, with j ≤ d are random variables, also called the variables,
components or entries of X. X is d-dimensional or d-variate.
X has a finite d-dimensional mean or expected value EX and a
finite d×d covariance matrix var(X).
µ = EX, Σ = var(X) = E[(X − µ)(X − µ)T
] (2)
The µ and Σ are
µ = [µ1 µ2... µd ]T
, Σ =






σ2
1 σ12 . . . σ1d
σ21 σ2
2 . . . σ2d
. . . . . .
. . . . . .
σd1 σd2 . . . σ2
d






(3)
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 24 / 86
The Population Case
σj
2 = var(Xj) and σjk = cov(Xj , Xk). σjj for the diagonal elements
σj
2 of Σ.
X ∼ (µ, Σ) (4)
Equation 4 is shorthand for a random vector X which has mean µ
and covariance matrix Σ.
If X is a d-dimensional random vector and A is a d × k matrix,
for some k ≥ 1, then ATX is a k-dimensional random vector.
Result 1.1
Let X ∼ (µ, Σ) be a d-variate random vector. Let A and B be matrices
of size d × k and d × l, respectively.
The mean and covariance matrix of the k-variate random vector
ATX are ATX ∼ (ATµ, ATΣA).
The random vectors ATX and BTX are uncorrelated if and only if
ATΣB = 0k×l (All entries are 0s).
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 25 / 86
The Population Case
Question 1: Suppose you have a set of n=5 data items, representing 5
insects, where each data item has a height (X), width (Y), and speed (Z)
(therefore d = 3)
Table 3.1: Three features of five different insects.
Height (cm) Width (cm) Speed (m/s)
I1 0.64 0.58 0.29
I2 0.66 0.57 0.33
I3 0.68 0.59 0.37
I4 0.69 0.66 0.46
I5 0.73 0.60 0.55
Solution:
mean (µ) = [0.64+0.66+0.68+0.69+0.73
5 , 0.58+0.57+0.59+0.66+0.60
5 ,
0.29+0.33+0.37+0.46+0.55
5 ]T = [0.68, 0.60, 0.40]T
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 26 / 86
The Population Case
I1-µ = [-0.04, -0.02, -0.11]T, (I1-µ)(I1-µ)T =


0.0016 0.0008 0.0044
0.0008 0.0004 0.0022
0.0044 0.0022 0.0121


I2-µ = [-0.02, -0.03, -0.07]T, (I2-µ)(I2-µ)T =


0.0004 0.0006 0.0014
0.0006 0.0009 0.0021
0.0014 0.0014 0.0049


I3-µ = [0, -0.01, -0.03]T, (I3-µ)(I3-µ)T =


0 0 0
0 0.0001 0.0003
0 0.0003 0.0009


I4-µ = [0.01, 0.06, 0.06]T, (I4-µ)(I4-µ)T =


0.0001 0.0006 0.0006
0.0006 0.0036 0.0036
0.0006 0.0036 0.0036


I5-µ = [0.05, 0, 0.15]T, (I5-µ)(I5-µ)T =


0.0025 0 0.0075
0 0 0
0.0075 0.0036 0.0225


Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 27 / 86
The Population Case
Covariance Matrix (Σ) = 1
n
Pn
i=1(Ii − µ)(Ii − µ)T =
1
5


0.0046 0.0020 0.0139
0.0020 0.005 0.0082
0.0139 0.0082 0.044

 =


9.2E − 4 4.0E − 4 0.00278
4.0E − 4 1.0E − 3 0.00164
0.00278 0.00164 0.0088


Definition
Mean: The mean is the average or the most common value in a collection
of numbers.
Variance: The expectation of the squared deviation of a random variable
from its mean.
Covariance: A measure of the relationship between two random variables
and to what extent, they change together.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 28 / 86
The Population Case
Question 2: Verify result 1.1a for A =


0.02 0.01
0.01 0.03
0.03 0.02

 and
X =


0.64 0.66 0.68 0.69 0.73
0.58 0.57 0.59 0.66 0.60
0.29 0.33 0.37 0.46 0.55

, µATX = ATµX and ΣATX =
ATΣX A
Solution: ATX =

0.0273 0.0288 0.0306 0.0342 0.0371
0.0296 0.0303 0.0319 0.0359 0.0363

µATX =

0.0316
0.0328

and ATµX =

0.02 0.01 0.03
0.01 0.03 0.02

*


0.68
0.60
0.40

 =

0.0316
0.0328

Hence, µATX = ATµX
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 29 / 86
The Population Case
ATΣX A =

0.02 0.01 0.03
0.01 0.03 0.02

*


9.2E − 4 4.0E − 4 0.00278
4.0E − 4 1.0E − 3 0.00164
0.00278 0.00164 0.0088

 *


0.02 0.01
0.01 0.03
0.03 0.02

 =

0.000012868 0.000009794
0.000009794 0.000007832

(ATX)1 - µATX = [-0.0043, -0.0032]T
((ATX)1 - µATX)((ATX)1 - µATX)T =

0.00001849 0.00001376
0.00001376 0.00001024

(ATX)2 - µATX = [-0.0028, -0.0025]T
((ATX)2 -µATX)((ATX)2 - µATX)T =

0.00000784 0.000007
0.000007 0.00000625

(ATX)3 - µATX = [-0.001, -0.0009]T
((ATX)3 -µATX)((ATX)3 - µATX)T =

0.000001 0.0000009
0.0000009 0.00000081

Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 30 / 86
The Population Case
(ATX)4 - µATX = [0.0026, 0.0031]T
((ATX)4 -µATX)((ATX)4 - µATX)T =

0.00000676 0.00000806
0.00000806 0.00000961

(ATX)5 - µATX = [0.0055, 0.0035]T
((ATX)5 -µATX)((ATX)5 - µATX)T =

0.00003025 0.00001925
0.00001925 0.00001225

Covariance Matrix (ΣATX) = 1
n
Pn
i=1((ATX)i − µATX
)((ATX)i − µATX
)T
= 1
5

0.00006434 0.00004897
0.00004897 0.00003916

=

0.000012868 0.000009794
0.000009794 0.000007832

Hence ΣATX = ATΣX A
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 31 / 86
The Population Case
Question 3: Verify result 1.1b, ATX and BTX are correlated for A =


0.02 0.01
0.01 0.03
0.03 0.02

, B =


0.02 0.01
0.01 0.03
0.03 0.02

, X =


0.64 0.66 0.68 0.69 0.73
0.58 0.57 0.59 0.66 0.60
0.29 0.33 0.37 0.46 0.55


Solution: Σ =


9.2E − 4 4.0E − 4 0.00278
4.0E − 4 1.0E − 3 0.00164
0.00278 0.00164 0.0088


ATΣB =

0.02 0.01 0.03
0.01 0.03 0.02

*


9.2E − 4 4.0E − 4 0.00278
4.0E − 4 1.0E − 3 0.00164
0.00278 0.00164 0.0088

 *


0.02 0.01
0.01 0.03
0.03 0.02

 =

0.000012868 0.000009794
0.000009794 0.000007832

Hence, ATX and BTX are correlated with each-other.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 32 / 86
The Sample Case
Let X1,...,Xn be d-dimensional random vectors. We assume that
the Xi are independent and from the same distribution F:Rd → [0,
1] with finite mean µ and covariance matrix Σ.
We omit reference to F when knowledge of the distribution is not
required, i.e., Rd → [0, 1].
In statistics one often identifies a random vector with its observed
values and writes Xi = xi
We explore properties of random samples but only encounter
observed values of random vectors. For this reason
X = [X1, X2, ..., Xn]T
(5)
for the sample of independent random vectors Xi and call this
collection a random sample or data.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 33 / 86
The Sample Case
X =






X11 X21 . . . Xn1
X12 X22 . . . Xn2
. . . . . .
. . . . . .
X1d X2d . . . Xnd






=






X•1
X•2
.
.
X•d






(6)
The ith column of X is the ith random vector Xi, and the jth row of
X•j is the jth variable across all n random vectors. i in Xij refers to
the ith vector Xi, and the j refers to the jth variable.
For data, the mean µ and covariance matrix Σ are usually not
known; instead, we work with the sample mean X and the sample
covariance matrix S. It is represented by
X ∼ Sam(X, S) (7)
The sample mean and sample covariance matrix depend on the
sample size n.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 34 / 86
The Sample Case
X =
1
n
n
X
i=1
Xi S =
1
n − 1
n
X
i=1
(Xi − X)(Xi − X)T
(8)
Definitions of the sample covariance matrix use n-1 or (n-1)-1 in
the literature. (n-1)-1 is preferred as an unbiased estimator of the
population variance Σ.
Xcent = X − X = [X1 − X, X2 − X, ..., Xn − X] (9)
Xcent is the centred data and it is of size dxn. Using this notation,
the d×d sample covariance matrix S becomes
S =
1
n − 1
XcentXT
cent =
1
n − 1
(X − X)(X − X)T
(10)
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 35 / 86
The Sample Case
The entries of the sample covariance matrix S are sjk, and
sjk =
1
n − 1
n
X
i=1
(Xij − mj )(Xik − mk) (11)
X =[m1,...,md]T, and mj is the sample mean of the jth variable.
As for the population, we write sj
2 or sjj for the diagonal elements
of S.
Consider a ∈ Rd; then the projection of X onto a is aTX.
Similarly, the projection of the matrix X onto a is done element-wise
for each random vector Xi and results in the 1×n vector aTX.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 36 / 86
The Sample Case
Question 4: The math and science scores of good, average and poor
students from a class are given as follows:
Student Math (X) Science (Y)
1 92 68
2 55 30
3 100 78
Find the sample mean (X), covariance matrix (S), S12 of the above data.
Solution: X =

92 55 100
68 30 78

X =
92+55+100
3
68+30+78
3

=

82.33
58.66

X1 − X = [9.67, 9.34]T , (X1 − X)(X1 − X)T =

93.5089 90.3178
90.3178 87.2356

X2 − X = [−27.33, −28.66]T
(X2 − X)(X2 − X)T =

746.9289 783.2778
783.2778 821.3956

Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 37 / 86
The Sample Case
X3 − X = [17.67, 19.34]T , (X3 − X)(X3 − X)T =

312.2289 341.7378
341.7378 374.0356

Sample covariance matrix (S) = 1
n−1
Pn
i=1(Xi − X)(Xi − X)T =
1
2

1152.6667 1215.3334
1215.3334 1282.6668

=

576.3335 607.6667
607.6667 641.3334

From Equation 11, S12 = 1
2
P3
i=1(Xi1 − m1)(Xi2 − m2)
S12 = 1
2[(X11 −m1)(X12 −m2)+(X21 −m1)(X22 −m2)+(X31 −m1)(X32−
m2)] = 1
2[(92 − 82.33)(68 − 58.66) + (55 − 82.33)(30 − 58.66) + (100 −
82.33)(78 − 58.66)] = 607.6667
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 38 / 86
The Sample Case
Question 5: Compute projection of matrix X =

92 55 100
68 30 78

onto a
vector

−45
45

.
Solution: P = aTX =

−45 45

*

92 55 100
68 30 78

=

−1080 −1125 −990

So, projection of X onto a vector [−45, 45]T is a 1x3 matrix.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 39 / 86
Outline
1 Multivariate and
High-Dimensional Problems
2 Visualisation
Three-Dimensional
Visualisation
Parallel Coordinate Plots
3 Multivariate Random Vectors
and Data
Population Case
Sample Case
Multivariate Random Vectors
Gaussian Random Vectors
Marginal and Conditional
Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 40 / 86
Gaussian Random Vectors
The univariate normal probability density function f is
f (X) =
1
σ
√
2π
e
−1
2
( X−µ
σ
)2
(12)
X ∼ N(µ, σ2
) (13)
Equation 13 is shorthand for a random value from the univariate
normal distribution with mean µ and variance σ2.
Figure 3.1: Three normal pdfs of 1000 random values having µ and σ are (0, 0.8),
(-2, 1) and (3, 2) respectively.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 41 / 86
Gaussian Random Vectors
The d-variate normal probability density function f is
f (X) = (2π)−d
2 |Σ|−1
2 exp

−1
2(X − µ)T Σ−1(X − µ)

(14)
X ∼ N(µ, Σ) (15)
Equation 15 is shorthand for a d-dimensional random vector from the
d-variate normal distribution with mean µ and covariance matrix Σ.
Figure 3.2: 2-dimensional normal pdf having µ = [1, 2] and Σ =

0.25 0.3
0.3 1

Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 42 / 86
Gaussian Random Vectors
Result 1.2
Let X ∼ N(µ, Σ) be d-variate, and assume that Σ−1 exists.
1 Let XΣ = Σ−1/2(X − µ); then XΣ ∼ N(0, Id×d), where Id×d is the
d×d identity matrix.
2 Let X2 = (X-µ)TΣ-1(X-µ); then X2 ∼ Xd
2, the Chi-squared X2
distribution in d degrees of freedom.
Question 6: Let X ∼ N(µ, Σ) be 2-variate, where µ = [2, 3]T , Σ =

4 0
0 16

and Σ−1 =

0.25 0
0 0.0625

. Verify Result 1.2.1 and 1.2.2
Solution:
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 43 / 86
Gaussian Random Vectors
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 44 / 86
Gaussian Random Vectors
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 45 / 86
Gaussian Random Vectors
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 46 / 86
Gaussian Random Vectors
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 47 / 86
Gaussian Random Vectors
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 48 / 86
Gaussian Random Vectors
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 49 / 86
Gaussian Random Vectors
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 50 / 86
Gaussian Random Vectors
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 51 / 86
Gaussian Random Vectors
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 52 / 86
Gaussian Random Vectors
Hence, the quantity X2 is a scalar random variable which has, as in
the one-dimensional case, a X2 - distribution, but this time in d
degrees of freedom.
Fix a dimension d ≥ 1. Let Xi ∼ N(µ, σ) be independent d -
dimensional random vectors for i = 1,..., n with sample mean X
and sample covariance matrix S.
We define Hotelling’s T2 by
T2
= n(X − µ)T
S−1
(X − µ) (16)
Question 7: Compute the Hotelling’s T2 of the sample X of size 2x5000
from Example 6.
Solution:
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 53 / 86
Gaussian Random Vectors
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 54 / 86
Gaussian Random Vectors
Further let Zj ∼ N(0, Σ) for j = 1,..., m be independent d -
dimensional random vectors, and let
W =
m
X
j=1
Zj ZT
j (17)
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 55 / 86
Gaussian Random Vectors
In Equation 17,
W be the d×d random matrix generated by the Zj.
W has the Wishart distribution W(m,Σ) with m degrees of
freedom and covariance matrix Σ.
m is the number of summands and Σ is the common d×d
covariance matrix.
Result 1.3
Let Xi ∼ N(µ, Σ) be d-dimensional random vectors for i = 1,...,n. Let
S be the sample covariance matrix, and assume that S is invertible.
1 The sample mean X satisfies X ∼ N (µ, Σ/n).
2 For n observations Xi and their sample covariance matrix S there
exist n-1 independent random vectors Zj ∼ N(0, Σ) such that
S = 1
n−1
Pn−1
j=1 (Zj − Z)(Zj − Z)T , where (n-1)S has a W((n-1), Σ)
Wishart distribution.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 56 / 86
Gaussian Random Vectors
Result 1.3 (Continue)
3 Assume that nd. Let T2 be given by Equation 16. It follows that
n − d
(n − 1)d
T2
∼ Fd,n−d (18)
The F distribution in d and n-d degrees of freedom.
n−d
(n−1)d T2 of different set of random data of size n and dimension d
from a Gaussian distribution, has a F distribution in d and n-d degrees
of freedom.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 57 / 86
Gaussian Random Vectors
Question 8: Let Z ∼ N(µ, Σ) 2-variate, where µ = [0, 0]T , Σ =

4 2
2 16

.
Compute W and (n-1)S matrices, and plot sample mean distribution.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 58 / 86
Gaussian Random Vectors
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 59 / 86
Gaussian Random Vectors
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 60 / 86
Gaussian Random Vectors
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 61 / 86
Gaussian Random Vectors
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 62 / 86
Gaussian Random Vectors
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 63 / 86
Gaussian Random Vectors
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 64 / 86
Gaussian Random Vectors
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 65 / 86
Gaussian Random Vectors
The W matrix computed from Sample Covariance Matrix S is identical
to W matrix computed using population mean µ (Slide no. 62).
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 66 / 86
Gaussian Random Vectors
Let X ∼ (µ, Σ) be d-dimensional. The multivariate normal
probability density function f is
f (Xi ) = (2π)−d
2 det(Σ)−1
2 exp

−1
2(Xi − µ)T Σ−1(Xi − µ)

(19)
Where det(σ) is the determinant of Σ and X = [X1, X2, ···, Xn] of
independent random vectors from the normal distribution with
the mean µ and covariance matrix Σ.
The normal or Gaussian likelihood (function) L as a function of
the parameter θ of interest conditional on the data.
L(θ|X) = (2π)−nd
2 det(Σ)−n
2 exp

−1
2(Xi − µ)T Σ−1(Xi − µ)

(20)
The parameter of interest θ is mean µ and the covariance matrix
Σ. So, θ = (µ, Σ).
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 67 / 86
Gaussian Random Vectors
The maximum likelihood estimator (MLE) of θ, denoted by θ̂, is
θ̂ = (µ̂, Σ̂) (21)
µ̂ =
1
n
n
X
i=1
Xi = X (22)
Σ̂ =
1
n
n
X
i=1
(Xi − X)(Xi − X)T
=
n − 1
n
S (23)
Here, µ̂, X, Σ̂ and S are estimated population mean, sample mean,
estimated population covariance matrix and sample covariance
matrix respectively.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 68 / 86
Gaussian Random Vectors
Question 9: From a sample size of 5000 (Example 6), compute the
maximum likelihood estimation of the population mean and population
covariance matrix.
Solution:
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 69 / 86
Outline
1 Multivariate and
High-Dimensional Problems
2 Visualisation
Three-Dimensional
Visualisation
Parallel Coordinate Plots
3 Multivariate Random Vectors
and Data
Population Case
Sample Case
Multivariate Random Vectors
Gaussian Random Vectors
Marginal and Conditional
Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 70 / 86
Marginal and Conditional Normal Distributions
Consider a normal random vector X = [X1, X2,..., Xd]T. Let X[1] be a
vector consisting of the first d1 entries of X, and let X[2] be the vector
consisting of the remaining d2 entries:
X =

X[1]
X[2]

(24)
For ι = 1, 2 we let µι be the mean of X[ι] and Σι its covariance matrix.
Question 10: Let X ∼ N(µ, Σ) be 4-variate, where µ = [2, 3, 2, 3]T ,
Σ =




4 1 1 1
1 4 1 1
1 1 4 1
1 1 1 4



  Σ−1 =




0.286 −0.048 −0.048 −0.048
−0.048 0.286 −0.048 −0.048
−0.048 −0.048 0.286 −0.048
−0.048 −0.048 −0.048 0.286



.
Compute µ1 and Σ1 of X[1] and µ2 and Σ2 of X[2], where d1 and d2 are 2.
Analyze all the properties from Result 1.4 and 1.5.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 71 / 86
Marginal and Conditional Normal Distributions
Result 1.4
Assume that X[1], X[2] and X are given by Equation 24 for some d1, d2
 d such that d1 + d2 = d. Assume also that X ∼ N(µ, Σ).
1 For j = 1,...,d the jth variable Xj of X has the distribution N(µj , σ2
j ).
2 ι = 1, 2, X[ι] has the distribution N(µι, Σι).
3 The (between) covariance matrix cov(X[1], X[2]) of X[1] and X[2]
is the d1xd2 submatrix Σ12 of.
Σ =

Σ1 Σ12
ΣT
12 Σ2

(25)
The marginal distributions of normal random vectors are normal with
means and covariance matrices of the original random vectors.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 72 / 86
Marginal and Conditional Normal Distributions
Result 1.5
Assume that X[1], X[2] and X are given by Equation 24 for some d1, d2
 d such that d1 + d2 = d. Assume also that X ∼ N(µ, Σ) and that Σ1
and Σ2 are invertible.
If X[1] and X[2] are independent. The covariance matrix Σ12 of
X[1] and X[2] satisfies
Σ12 = 0d1xd2 (26)
Assume that Σ12 ̸= 0d1×d2
. Put X21 = X2 -Σ12
TΣ1
-1X1. Then
X21 is a d2-dimensional random vector which is independent of X1
and X21 ∼N(µ21, Σ21) with
µ21 = µ2 − ΣT
12Σ−1
1 µ1 and Σ2/1 = Σ2 − ΣT
12Σ−1
1 Σ12 (27)
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 73 / 86
Marginal and Conditional Normal Distributions
Result 1.5 (Continue)
Let (X[1] | X[2]) be the conditional random vector X[1] given X[2].
Then (X[1] | X[2]) ∼ N(µX1 | X2
, ΣX1 | X2
)
µX1|X2
= µ1 + Σ12Σ−1
2 (X[2]
− µ2) (28)
ΣX1|X2
= Σ1 − Σ12Σ−1
2 ΣT
12 (29)
The first property specifies independence always implies uncorrelated
-ness, and for the normal distribution, the converse holds too. The
second property shows how one can uncorrelate the vectors X[1] and
X[2]. The last property is about the adjustments that are needed when
the sub-vectors have a non-zero covariance matrix.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 74 / 86
Marginal and Conditional Normal Distributions
Q.10 solution:
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 75 / 86
Marginal and Conditional Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 76 / 86
Marginal and Conditional Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 77 / 86
Marginal and Conditional Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 78 / 86
Marginal and Conditional Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 79 / 86
Marginal and Conditional Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 80 / 86
Marginal and Conditional Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 81 / 86
Marginal and Conditional Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 82 / 86
Marginal and Conditional Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 83 / 86
Marginal and Conditional Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 84 / 86
Summary
Here, we have discussed
Different types of multivariate and high-dimensional problems.
Three-dimensional visualisation of features from the HIV and Iris
flower data-sets.
Data visualisation using vertical parallel coordinate plot.
Data Visualisation using horizontal parallel coordinate plot.
Differentiation between population cases and sample cases.
Population mean, population covariance matrix, sample mean and
sample covariance matrix of multivariate random vectors.
Population mean, population covariance matrix, sample mean and
sample covariance matrix of Gaussian random vectors.
Parameters and properties of marginal and conditional normal
distributions.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 85 / 86
For Further Reading I
I. Koch.
Analysis of multivariate and high-dimensional data (Vol. 32).
Cambridge Universities Press, 2014.
F. Emdad and S. R. Zekavat.
High dimensional data analysis: overview, analysis and applications.
VDM verlag, 2008.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 86 / 86

More Related Content

Similar to Multidimensional Data

The Inquisitive Data Scientist: Facilitating Well-Informed Data Science throu...
The Inquisitive Data Scientist: Facilitating Well-Informed Data Science throu...The Inquisitive Data Scientist: Facilitating Well-Informed Data Science throu...
The Inquisitive Data Scientist: Facilitating Well-Informed Data Science throu...
Cagatay Turkay
 
Business statistics what and why
Business statistics what and whyBusiness statistics what and why
Business statistics what and whydibasharmin
 
Graphs in pharmaceutical biostatistics
Graphs in pharmaceutical biostatisticsGraphs in pharmaceutical biostatistics
Graphs in pharmaceutical biostatistics
VandanaGupta127
 
BRM_Data Analysis, Interpretation and Reporting Part II.ppt
BRM_Data Analysis, Interpretation and Reporting Part II.pptBRM_Data Analysis, Interpretation and Reporting Part II.ppt
BRM_Data Analysis, Interpretation and Reporting Part II.ppt
AbdifatahAhmedHurre
 
La statistique et le machine learning pour l'intégration de données de la bio...
La statistique et le machine learning pour l'intégration de données de la bio...La statistique et le machine learning pour l'intégration de données de la bio...
La statistique et le machine learning pour l'intégration de données de la bio...
tuxette
 
STATS 101 WK7 NOTE.pptx
STATS 101 WK7 NOTE.pptxSTATS 101 WK7 NOTE.pptx
STATS 101 WK7 NOTE.pptx
MulbahKromah
 
A Critique Of Anscombe S Work On Statistical Analysis Using Graphs (2013 Home...
A Critique Of Anscombe S Work On Statistical Analysis Using Graphs (2013 Home...A Critique Of Anscombe S Work On Statistical Analysis Using Graphs (2013 Home...
A Critique Of Anscombe S Work On Statistical Analysis Using Graphs (2013 Home...
Simar Neasy
 
Image investigation using higher moment statistics and edge detection for rec...
Image investigation using higher moment statistics and edge detection for rec...Image investigation using higher moment statistics and edge detection for rec...
Image investigation using higher moment statistics and edge detection for rec...
journalBEEI
 
Engineering Statistics
Engineering Statistics Engineering Statistics
Engineering Statistics
Bahzad5
 
Data structure
Data   structureData   structure
Lect 5 data models-gis
Lect 5 data models-gisLect 5 data models-gis
Lect 5 data models-gis
Rehana Jamal
 
Artificial Intelligence - Data Analysis, Creative & Critical Thinking and AI...
Artificial Intelligence - Data Analysis, Creative & Critical Thinking and  AI...Artificial Intelligence - Data Analysis, Creative & Critical Thinking and  AI...
Artificial Intelligence - Data Analysis, Creative & Critical Thinking and AI...
deboshreechatterjee2
 
Chapter 2. Know Your Data.ppt
Chapter 2. Know Your Data.pptChapter 2. Know Your Data.ppt
Chapter 2. Know Your Data.ppt
Subrata Kumer Paul
 
Data Mining Exploring DataLecture Notes for Chapter 3
Data Mining Exploring DataLecture Notes for Chapter 3Data Mining Exploring DataLecture Notes for Chapter 3
Data Mining Exploring DataLecture Notes for Chapter 3
OllieShoresna
 
Math Stats Probability
Math Stats ProbabilityMath Stats Probability
Math Stats Probability
Mark Brahier
 
02Data.ppt
02Data.ppt02Data.ppt
02Data.ppt
AlwinHilton
 
02Data.ppt
02Data.ppt02Data.ppt
02Data.ppt
TanviBhasin2
 
Data science
Data scienceData science
Data science
Rakibul Hasan Pranto
 
statistics - Populations and Samples.pdf
statistics - Populations and Samples.pdfstatistics - Populations and Samples.pdf
statistics - Populations and Samples.pdf
kobra22
 
General Statistics boa
General Statistics boaGeneral Statistics boa
General Statistics boaraileeanne
 

Similar to Multidimensional Data (20)

The Inquisitive Data Scientist: Facilitating Well-Informed Data Science throu...
The Inquisitive Data Scientist: Facilitating Well-Informed Data Science throu...The Inquisitive Data Scientist: Facilitating Well-Informed Data Science throu...
The Inquisitive Data Scientist: Facilitating Well-Informed Data Science throu...
 
Business statistics what and why
Business statistics what and whyBusiness statistics what and why
Business statistics what and why
 
Graphs in pharmaceutical biostatistics
Graphs in pharmaceutical biostatisticsGraphs in pharmaceutical biostatistics
Graphs in pharmaceutical biostatistics
 
BRM_Data Analysis, Interpretation and Reporting Part II.ppt
BRM_Data Analysis, Interpretation and Reporting Part II.pptBRM_Data Analysis, Interpretation and Reporting Part II.ppt
BRM_Data Analysis, Interpretation and Reporting Part II.ppt
 
La statistique et le machine learning pour l'intégration de données de la bio...
La statistique et le machine learning pour l'intégration de données de la bio...La statistique et le machine learning pour l'intégration de données de la bio...
La statistique et le machine learning pour l'intégration de données de la bio...
 
STATS 101 WK7 NOTE.pptx
STATS 101 WK7 NOTE.pptxSTATS 101 WK7 NOTE.pptx
STATS 101 WK7 NOTE.pptx
 
A Critique Of Anscombe S Work On Statistical Analysis Using Graphs (2013 Home...
A Critique Of Anscombe S Work On Statistical Analysis Using Graphs (2013 Home...A Critique Of Anscombe S Work On Statistical Analysis Using Graphs (2013 Home...
A Critique Of Anscombe S Work On Statistical Analysis Using Graphs (2013 Home...
 
Image investigation using higher moment statistics and edge detection for rec...
Image investigation using higher moment statistics and edge detection for rec...Image investigation using higher moment statistics and edge detection for rec...
Image investigation using higher moment statistics and edge detection for rec...
 
Engineering Statistics
Engineering Statistics Engineering Statistics
Engineering Statistics
 
Data structure
Data   structureData   structure
Data structure
 
Lect 5 data models-gis
Lect 5 data models-gisLect 5 data models-gis
Lect 5 data models-gis
 
Artificial Intelligence - Data Analysis, Creative & Critical Thinking and AI...
Artificial Intelligence - Data Analysis, Creative & Critical Thinking and  AI...Artificial Intelligence - Data Analysis, Creative & Critical Thinking and  AI...
Artificial Intelligence - Data Analysis, Creative & Critical Thinking and AI...
 
Chapter 2. Know Your Data.ppt
Chapter 2. Know Your Data.pptChapter 2. Know Your Data.ppt
Chapter 2. Know Your Data.ppt
 
Data Mining Exploring DataLecture Notes for Chapter 3
Data Mining Exploring DataLecture Notes for Chapter 3Data Mining Exploring DataLecture Notes for Chapter 3
Data Mining Exploring DataLecture Notes for Chapter 3
 
Math Stats Probability
Math Stats ProbabilityMath Stats Probability
Math Stats Probability
 
02Data.ppt
02Data.ppt02Data.ppt
02Data.ppt
 
02Data.ppt
02Data.ppt02Data.ppt
02Data.ppt
 
Data science
Data scienceData science
Data science
 
statistics - Populations and Samples.pdf
statistics - Populations and Samples.pdfstatistics - Populations and Samples.pdf
statistics - Populations and Samples.pdf
 
General Statistics boa
General Statistics boaGeneral Statistics boa
General Statistics boa
 

More from Ashutosh Satapathy

Introduction to Data Structures .
Introduction to Data Structures        .Introduction to Data Structures        .
Introduction to Data Structures .
Ashutosh Satapathy
 
Searching and Sorting Algorithms
Searching and Sorting AlgorithmsSearching and Sorting Algorithms
Searching and Sorting Algorithms
Ashutosh Satapathy
 
Time and Space Complexity
Time and Space ComplexityTime and Space Complexity
Time and Space Complexity
Ashutosh Satapathy
 
Algorithm Specification and Data Abstraction
Algorithm Specification and Data Abstraction Algorithm Specification and Data Abstraction
Algorithm Specification and Data Abstraction
Ashutosh Satapathy
 
ORAM
ORAMORAM
ObliVM
ObliVMObliVM
Secure Multi-Party Computation
Secure Multi-Party ComputationSecure Multi-Party Computation
Secure Multi-Party Computation
Ashutosh Satapathy
 

More from Ashutosh Satapathy (7)

Introduction to Data Structures .
Introduction to Data Structures        .Introduction to Data Structures        .
Introduction to Data Structures .
 
Searching and Sorting Algorithms
Searching and Sorting AlgorithmsSearching and Sorting Algorithms
Searching and Sorting Algorithms
 
Time and Space Complexity
Time and Space ComplexityTime and Space Complexity
Time and Space Complexity
 
Algorithm Specification and Data Abstraction
Algorithm Specification and Data Abstraction Algorithm Specification and Data Abstraction
Algorithm Specification and Data Abstraction
 
ORAM
ORAMORAM
ORAM
 
ObliVM
ObliVMObliVM
ObliVM
 
Secure Multi-Party Computation
Secure Multi-Party ComputationSecure Multi-Party Computation
Secure Multi-Party Computation
 

Recently uploaded

Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 

Recently uploaded (20)

Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 

Multidimensional Data

  • 1. Multidimensional Data Dr. Ashutosh Satapathy Assistant Professor, Department of CSE VR Siddhartha Engineering College Kanuru, Vijayawada October 19, 2022 Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 1 / 86
  • 2. Outline 1 Multivariate and High-Dimensional Problems 2 Visualisation Three-Dimensional Visualisation Parallel Coordinate Plots 3 Multivariate Random Vectors and Data Population Case Sample Case Multivariate Random Vectors Gaussian Random Vectors Marginal and Conditional Normal Distributions Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 2 / 86
  • 3. Multivariate and High-Dimensional Problems Early in the twentieth century, scientists such as Pearson (1901), Hotelling (1933) and Fisher (1936) developed methods for analysing multivariate data in order to 1 Understand the structure in the data and summarise it in simpler ways. 2 Understand the relationship of one part of the data to another part. 3 Make decisions and inferences based on the data. The early methods these scientists developed are linear; as time moved on, more complex methods were developed. These data sets essential structure can often be obscured by noise. Reduce the original data in such a way that informative and interesting structure in the data is preserved while noisy, irrelevant or purely random variables, dimensions or features are removed, as these can adversely affect the analysis. Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 3 / 86
  • 4. Multivariate and High-Dimensional Problems Traditionally one assumes that the dimension d is small compared to the sample size n. Many recent data sets do not fit into this framework; we encounter the following problems. Data whose dimension is comparable to the sample size, and both are large. High-dimension and low sample size data whose dimension d vastly exceeds the sample size n, so d ≥ n. Functional data whose observations are functions. High-dimensional and functional data pose special challenges. Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 4 / 86
  • 5. Visualisation Before we analyse a set of data, it is important to look at it. Often we get useful clues such as skewness, bi- or multi-modality, outliers, or distinct groupings. Graphical displays are exploratory data-analysis tools, which, if appropriately used, can enhance our understanding of data. Visual clues are easier to understand and interpret than numbers alone, and the information you can get from graphical displays can help you understand answers that are based on numbers. Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 5 / 86
  • 6. Outline 1 Multivariate and High-Dimensional Problems 2 Visualisation Three-Dimensional Visualisation Parallel Coordinate Plots 3 Multivariate Random Vectors and Data Population Case Sample Case Multivariate Random Vectors Gaussian Random Vectors Marginal and Conditional Normal Distributions Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 6 / 86
  • 7. Three-Dimensional Visualisation Two-dimensional scatter-plots are a natural – though limited – way of looking at data with three or more variables. As the number of variables, and therefore the dimension increases. We can, of course, still display three of the d dimensions in scatter-plots, but it is less clear how one can look at more than three dimensions in a single plot. Figure 2.1 display the 10,000 observations and the three variables CD3, CD8 and CD4 of the five-dimensional HIV+ and HIV- data sets. The data-sets contain measurements of blood cells relevant to HIV. Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 7 / 86
  • 8. Three-Dimensional Visualisation Figure 2.1: HIV+ data (left) and HIV- data (right) of variables CD3, CD8 and CD4. There are differences between the point clouds in the two figures, and an important task is to exhibit and quantify the differences. Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 8 / 86
  • 9. Three-Dimensional Visualisation Projecting the Figure 2.1 data onto a number of orthogonal directions and displaying the lower-dimensional projected data in Figure 2.2. Figure 2.2: Orthogonal projections of the HIV+ data (left) and the HIV- data(right). Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 9 / 86
  • 10. Three-Dimensional Visualisation We can see a smaller fourth cluster in the top right corner of the HIV- data, which seems to have almost disappeared in the HIV+ data in the left panel. Many of the methods we explore use projections: Principal Component Analysis, Factor Analysis, Multidimensional Scaling, Independent Component Analysis and Projection Pursuit. In each case the projections focus on different aspects and properties of the data. Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 10 / 86
  • 11. Three-Dimensional Visualisation Figure 2.3: Three different species of Iris flowers. We display the four variables of Fisher’s iris data – sepal length, sepal width, petal length and petal width – in a sequence of three- dimensional scatter-plots. Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 11 / 86
  • 12. Three-Dimensional Visualisation Figure 2.4: Features 1, 2 and 3 (top left), features 1, 2 and 4 (top right), features 1, 3 and 4 (bottom left) and features 2, 3 and 4 (bottom right). Red refers to Setosa, green to Versicolor and black to Virginica. Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 12 / 86
  • 13. Outline 1 Multivariate and High-Dimensional Problems 2 Visualisation Three-Dimensional Visualisation Parallel Coordinate Plots 3 Multivariate Random Vectors and Data Population Case Sample Case Multivariate Random Vectors Gaussian Random Vectors Marginal and Conditional Normal Distributions Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 13 / 86
  • 14. Parallel Coordinate Plots As the dimension grows, three-dimensional scatter-plots become less relevant, unless we know that only some variables are important. An alternative, which allows us to see all variables at once, present the data in the form of parallel coordinate plots. The idea is to present the data as two-dimensional graphs. The variable numbers are represented as values on the y-axis in a vertical parallel coordinate plot. For a vector X = [X1,..., Xd ]T we represent the first variable X1 by the point (X1, 1) and the jth variable Xj by (Xj , j). Finally, we connect the d points by a line which goes from (X1, 1) to (X2, 2) and so on to (Xd , d). We apply the same rule to the next d-dimensional feature vectors. Figure 2.5 shows a vertical parallel coordinate plot for Fisher’s iris data. Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 14 / 86
  • 15. Parallel Coordinate Plots Figure 2.5: Iris data with variables represented on the y-axis and separate colours for the three species. Red refers to the observations of Setosa, green to those of Versicolor and black to those of Virginica. Unlike the previous Figure 2.4, Figure 2.5 tells us that dimension 3 separates the two groups most strongly. Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 15 / 86
  • 16. Parallel Coordinate Plots Instead of the three colours shown in Figure 2.5, different colors can be used for each observation in Figure 2.6. In a horizontal parallel coordinate plot, the x-axis represents the variable numbers 1, ..., d. For a feature vector X = [X1 ··· Xd]T, the first variable gives rise to the point (1, X1) and the jth variable Xj to (j, Xj). The d points are connected by a line, starting with (1, X1), then (2, X2), until we reach (d, Xd). As the variables are presented along the x-axis, horizontal parallel coordinate plots are often used. The differently coloured lines make it easier to trace particular observations in Figure 2.6. Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 16 / 86
  • 17. Parallel Coordinate Plots Figure 2.6: Parallel coordinate view of the illicit drug market data. Figure 2.6 shows the 66 monthly observations on 15 features or variables of the illicit drug market data. Each observation (month) is displayed in a different colour. Looking at variable 5, heroin overdose, the question arises whether there could be two groups of observations corresponding to the high and low values of this variable. Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 17 / 86
  • 18. Outline 1 Multivariate and High-Dimensional Problems 2 Visualisation Three-Dimensional Visualisation Parallel Coordinate Plots 3 Multivariate Random Vectors and Data Population Case Sample Case Multivariate Random Vectors Gaussian Random Vectors Marginal and Conditional Normal Distributions Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 18 / 86
  • 19. Population Case In data science, population is the entire set of items from which you draw data for a statistical study. It can be a group of individuals, a set of items, etc. Generally, population refers to the people who live in a particular area at a specific time. But in data science, population refers to data on your study of interest. It can be a group of individuals, objects, events, organizations, etc. You use populations to draw conclusions. An example of a population would be the entire student body at a school. The problem statement is the percentage of students who speak English fluently. If you had to collect the same data from the entire country of India, it would be impossible to draw reliable conclusions because of geographical and accessibility constraints. Making the data biased towards certain regions or groups. Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 19 / 86
  • 20. Outline 1 Multivariate and High-Dimensional Problems 2 Visualisation Three-Dimensional Visualisation Parallel Coordinate Plots 3 Multivariate Random Vectors and Data Population Case Sample Case Multivariate Random Vectors Gaussian Random Vectors Marginal and Conditional Normal Distributions Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 20 / 86
  • 21. Sample Case A sample is defined as a smaller and more manageable representation of a larger group. A subset of a larger population that contains characteristics of that population. A sample is used in testing when the population size is too large for all members or observations to be included in the test. The sample is an unbiased subset of the population that best represents the whole data. The process of collecting data from a small subsection of the population and then using it to generalize over the entire set is called sampling. Samples are used when the population is too large and unlimited in size and the data collected is not reliable. A sample should generally be unbiased and satisfy all variations present in a population. A sample should typically be chosen at random. Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 21 / 86
  • 22. Outline 1 Multivariate and High-Dimensional Problems 2 Visualisation Three-Dimensional Visualisation Parallel Coordinate Plots 3 Multivariate Random Vectors and Data Population Case Sample Case Multivariate Random Vectors Gaussian Random Vectors Marginal and Conditional Normal Distributions Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 22 / 86
  • 23. Multivariate Random Vectors Random vectors are vector-valued functions defined on a sample space. where v(t) is the vector function and f(t), g(t) and h(t) are the coordinate functions of Cartesian 3-space. It can be represented as v(t) =< f (t), g(t), h(t) >. A vector-valued function is a mathematical function of one or more variables whose range is a set of multidimensional vectors or infinite-dimensional vectors. the collection of random vectors as the data or the random sample. Specific feature values are measured for each of the random vectors in the collection. We call these values the realised or observed values of the data or simply the observed data. The observed values are no longer random. Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 23 / 86
  • 24. The Population Case Let X = [X1 X2... Xd ]T (1) be a random vector from a distribution F:Rd → [0, 1]. The individual Xj, with j ≤ d are random variables, also called the variables, components or entries of X. X is d-dimensional or d-variate. X has a finite d-dimensional mean or expected value EX and a finite d×d covariance matrix var(X). µ = EX, Σ = var(X) = E[(X − µ)(X − µ)T ] (2) The µ and Σ are µ = [µ1 µ2... µd ]T , Σ =       σ2 1 σ12 . . . σ1d σ21 σ2 2 . . . σ2d . . . . . . . . . . . . σd1 σd2 . . . σ2 d       (3) Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 24 / 86
  • 25. The Population Case σj 2 = var(Xj) and σjk = cov(Xj , Xk). σjj for the diagonal elements σj 2 of Σ. X ∼ (µ, Σ) (4) Equation 4 is shorthand for a random vector X which has mean µ and covariance matrix Σ. If X is a d-dimensional random vector and A is a d × k matrix, for some k ≥ 1, then ATX is a k-dimensional random vector. Result 1.1 Let X ∼ (µ, Σ) be a d-variate random vector. Let A and B be matrices of size d × k and d × l, respectively. The mean and covariance matrix of the k-variate random vector ATX are ATX ∼ (ATµ, ATΣA). The random vectors ATX and BTX are uncorrelated if and only if ATΣB = 0k×l (All entries are 0s). Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 25 / 86
  • 26. The Population Case Question 1: Suppose you have a set of n=5 data items, representing 5 insects, where each data item has a height (X), width (Y), and speed (Z) (therefore d = 3) Table 3.1: Three features of five different insects. Height (cm) Width (cm) Speed (m/s) I1 0.64 0.58 0.29 I2 0.66 0.57 0.33 I3 0.68 0.59 0.37 I4 0.69 0.66 0.46 I5 0.73 0.60 0.55 Solution: mean (µ) = [0.64+0.66+0.68+0.69+0.73 5 , 0.58+0.57+0.59+0.66+0.60 5 , 0.29+0.33+0.37+0.46+0.55 5 ]T = [0.68, 0.60, 0.40]T Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 26 / 86
  • 27. The Population Case I1-µ = [-0.04, -0.02, -0.11]T, (I1-µ)(I1-µ)T =   0.0016 0.0008 0.0044 0.0008 0.0004 0.0022 0.0044 0.0022 0.0121   I2-µ = [-0.02, -0.03, -0.07]T, (I2-µ)(I2-µ)T =   0.0004 0.0006 0.0014 0.0006 0.0009 0.0021 0.0014 0.0014 0.0049   I3-µ = [0, -0.01, -0.03]T, (I3-µ)(I3-µ)T =   0 0 0 0 0.0001 0.0003 0 0.0003 0.0009   I4-µ = [0.01, 0.06, 0.06]T, (I4-µ)(I4-µ)T =   0.0001 0.0006 0.0006 0.0006 0.0036 0.0036 0.0006 0.0036 0.0036   I5-µ = [0.05, 0, 0.15]T, (I5-µ)(I5-µ)T =   0.0025 0 0.0075 0 0 0 0.0075 0.0036 0.0225   Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 27 / 86
  • 28. The Population Case Covariance Matrix (Σ) = 1 n Pn i=1(Ii − µ)(Ii − µ)T = 1 5   0.0046 0.0020 0.0139 0.0020 0.005 0.0082 0.0139 0.0082 0.044   =   9.2E − 4 4.0E − 4 0.00278 4.0E − 4 1.0E − 3 0.00164 0.00278 0.00164 0.0088   Definition Mean: The mean is the average or the most common value in a collection of numbers. Variance: The expectation of the squared deviation of a random variable from its mean. Covariance: A measure of the relationship between two random variables and to what extent, they change together. Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 28 / 86
  • 29. The Population Case Question 2: Verify result 1.1a for A =   0.02 0.01 0.01 0.03 0.03 0.02   and X =   0.64 0.66 0.68 0.69 0.73 0.58 0.57 0.59 0.66 0.60 0.29 0.33 0.37 0.46 0.55  , µATX = ATµX and ΣATX = ATΣX A Solution: ATX = 0.0273 0.0288 0.0306 0.0342 0.0371 0.0296 0.0303 0.0319 0.0359 0.0363 µATX = 0.0316 0.0328 and ATµX = 0.02 0.01 0.03 0.01 0.03 0.02 *   0.68 0.60 0.40   = 0.0316 0.0328 Hence, µATX = ATµX Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 29 / 86
  • 30. The Population Case ATΣX A = 0.02 0.01 0.03 0.01 0.03 0.02 *   9.2E − 4 4.0E − 4 0.00278 4.0E − 4 1.0E − 3 0.00164 0.00278 0.00164 0.0088   *   0.02 0.01 0.01 0.03 0.03 0.02   = 0.000012868 0.000009794 0.000009794 0.000007832 (ATX)1 - µATX = [-0.0043, -0.0032]T ((ATX)1 - µATX)((ATX)1 - µATX)T = 0.00001849 0.00001376 0.00001376 0.00001024 (ATX)2 - µATX = [-0.0028, -0.0025]T ((ATX)2 -µATX)((ATX)2 - µATX)T = 0.00000784 0.000007 0.000007 0.00000625 (ATX)3 - µATX = [-0.001, -0.0009]T ((ATX)3 -µATX)((ATX)3 - µATX)T = 0.000001 0.0000009 0.0000009 0.00000081 Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 30 / 86
  • 31. The Population Case (ATX)4 - µATX = [0.0026, 0.0031]T ((ATX)4 -µATX)((ATX)4 - µATX)T = 0.00000676 0.00000806 0.00000806 0.00000961 (ATX)5 - µATX = [0.0055, 0.0035]T ((ATX)5 -µATX)((ATX)5 - µATX)T = 0.00003025 0.00001925 0.00001925 0.00001225 Covariance Matrix (ΣATX) = 1 n Pn i=1((ATX)i − µATX )((ATX)i − µATX )T = 1 5 0.00006434 0.00004897 0.00004897 0.00003916 = 0.000012868 0.000009794 0.000009794 0.000007832 Hence ΣATX = ATΣX A Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 31 / 86
  • 32. The Population Case Question 3: Verify result 1.1b, ATX and BTX are correlated for A =   0.02 0.01 0.01 0.03 0.03 0.02  , B =   0.02 0.01 0.01 0.03 0.03 0.02  , X =   0.64 0.66 0.68 0.69 0.73 0.58 0.57 0.59 0.66 0.60 0.29 0.33 0.37 0.46 0.55   Solution: Σ =   9.2E − 4 4.0E − 4 0.00278 4.0E − 4 1.0E − 3 0.00164 0.00278 0.00164 0.0088   ATΣB = 0.02 0.01 0.03 0.01 0.03 0.02 *   9.2E − 4 4.0E − 4 0.00278 4.0E − 4 1.0E − 3 0.00164 0.00278 0.00164 0.0088   *   0.02 0.01 0.01 0.03 0.03 0.02   = 0.000012868 0.000009794 0.000009794 0.000007832 Hence, ATX and BTX are correlated with each-other. Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 32 / 86
  • 33. The Sample Case Let X1,...,Xn be d-dimensional random vectors. We assume that the Xi are independent and from the same distribution F:Rd → [0, 1] with finite mean µ and covariance matrix Σ. We omit reference to F when knowledge of the distribution is not required, i.e., Rd → [0, 1]. In statistics one often identifies a random vector with its observed values and writes Xi = xi We explore properties of random samples but only encounter observed values of random vectors. For this reason X = [X1, X2, ..., Xn]T (5) for the sample of independent random vectors Xi and call this collection a random sample or data. Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 33 / 86
  • 34. The Sample Case X =       X11 X21 . . . Xn1 X12 X22 . . . Xn2 . . . . . . . . . . . . X1d X2d . . . Xnd       =       X•1 X•2 . . X•d       (6) The ith column of X is the ith random vector Xi, and the jth row of X•j is the jth variable across all n random vectors. i in Xij refers to the ith vector Xi, and the j refers to the jth variable. For data, the mean µ and covariance matrix Σ are usually not known; instead, we work with the sample mean X and the sample covariance matrix S. It is represented by X ∼ Sam(X, S) (7) The sample mean and sample covariance matrix depend on the sample size n. Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 34 / 86
  • 35. The Sample Case X = 1 n n X i=1 Xi S = 1 n − 1 n X i=1 (Xi − X)(Xi − X)T (8) Definitions of the sample covariance matrix use n-1 or (n-1)-1 in the literature. (n-1)-1 is preferred as an unbiased estimator of the population variance Σ. Xcent = X − X = [X1 − X, X2 − X, ..., Xn − X] (9) Xcent is the centred data and it is of size dxn. Using this notation, the d×d sample covariance matrix S becomes S = 1 n − 1 XcentXT cent = 1 n − 1 (X − X)(X − X)T (10) Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 35 / 86
  • 36. The Sample Case The entries of the sample covariance matrix S are sjk, and sjk = 1 n − 1 n X i=1 (Xij − mj )(Xik − mk) (11) X =[m1,...,md]T, and mj is the sample mean of the jth variable. As for the population, we write sj 2 or sjj for the diagonal elements of S. Consider a ∈ Rd; then the projection of X onto a is aTX. Similarly, the projection of the matrix X onto a is done element-wise for each random vector Xi and results in the 1×n vector aTX. Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 36 / 86
  • 37. The Sample Case Question 4: The math and science scores of good, average and poor students from a class are given as follows: Student Math (X) Science (Y) 1 92 68 2 55 30 3 100 78 Find the sample mean (X), covariance matrix (S), S12 of the above data. Solution: X = 92 55 100 68 30 78 X = 92+55+100 3 68+30+78 3 = 82.33 58.66 X1 − X = [9.67, 9.34]T , (X1 − X)(X1 − X)T = 93.5089 90.3178 90.3178 87.2356 X2 − X = [−27.33, −28.66]T (X2 − X)(X2 − X)T = 746.9289 783.2778 783.2778 821.3956 Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 37 / 86
  • 38. The Sample Case X3 − X = [17.67, 19.34]T , (X3 − X)(X3 − X)T = 312.2289 341.7378 341.7378 374.0356 Sample covariance matrix (S) = 1 n−1 Pn i=1(Xi − X)(Xi − X)T = 1 2 1152.6667 1215.3334 1215.3334 1282.6668 = 576.3335 607.6667 607.6667 641.3334 From Equation 11, S12 = 1 2 P3 i=1(Xi1 − m1)(Xi2 − m2) S12 = 1 2[(X11 −m1)(X12 −m2)+(X21 −m1)(X22 −m2)+(X31 −m1)(X32− m2)] = 1 2[(92 − 82.33)(68 − 58.66) + (55 − 82.33)(30 − 58.66) + (100 − 82.33)(78 − 58.66)] = 607.6667 Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 38 / 86
  • 39. The Sample Case Question 5: Compute projection of matrix X = 92 55 100 68 30 78 onto a vector −45 45 . Solution: P = aTX = −45 45 * 92 55 100 68 30 78 = −1080 −1125 −990 So, projection of X onto a vector [−45, 45]T is a 1x3 matrix. Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 39 / 86
  • 40. Outline 1 Multivariate and High-Dimensional Problems 2 Visualisation Three-Dimensional Visualisation Parallel Coordinate Plots 3 Multivariate Random Vectors and Data Population Case Sample Case Multivariate Random Vectors Gaussian Random Vectors Marginal and Conditional Normal Distributions Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 40 / 86
  • 41. Gaussian Random Vectors The univariate normal probability density function f is f (X) = 1 σ √ 2π e −1 2 ( X−µ σ )2 (12) X ∼ N(µ, σ2 ) (13) Equation 13 is shorthand for a random value from the univariate normal distribution with mean µ and variance σ2. Figure 3.1: Three normal pdfs of 1000 random values having µ and σ are (0, 0.8), (-2, 1) and (3, 2) respectively. Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 41 / 86
  • 42. Gaussian Random Vectors The d-variate normal probability density function f is f (X) = (2π)−d 2 |Σ|−1 2 exp −1 2(X − µ)T Σ−1(X − µ) (14) X ∼ N(µ, Σ) (15) Equation 15 is shorthand for a d-dimensional random vector from the d-variate normal distribution with mean µ and covariance matrix Σ. Figure 3.2: 2-dimensional normal pdf having µ = [1, 2] and Σ = 0.25 0.3 0.3 1 Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 42 / 86
  • 43. Gaussian Random Vectors Result 1.2 Let X ∼ N(µ, Σ) be d-variate, and assume that Σ−1 exists. 1 Let XΣ = Σ−1/2(X − µ); then XΣ ∼ N(0, Id×d), where Id×d is the d×d identity matrix. 2 Let X2 = (X-µ)TΣ-1(X-µ); then X2 ∼ Xd 2, the Chi-squared X2 distribution in d degrees of freedom. Question 6: Let X ∼ N(µ, Σ) be 2-variate, where µ = [2, 3]T , Σ = 4 0 0 16 and Σ−1 = 0.25 0 0 0.0625 . Verify Result 1.2.1 and 1.2.2 Solution: Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 43 / 86
  • 44. Gaussian Random Vectors Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 44 / 86
  • 45. Gaussian Random Vectors Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 45 / 86
  • 46. Gaussian Random Vectors Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 46 / 86
  • 47. Gaussian Random Vectors Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 47 / 86
  • 48. Gaussian Random Vectors Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 48 / 86
  • 49. Gaussian Random Vectors Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 49 / 86
  • 50. Gaussian Random Vectors Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 50 / 86
  • 51. Gaussian Random Vectors Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 51 / 86
  • 52. Gaussian Random Vectors Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 52 / 86
  • 53. Gaussian Random Vectors Hence, the quantity X2 is a scalar random variable which has, as in the one-dimensional case, a X2 - distribution, but this time in d degrees of freedom. Fix a dimension d ≥ 1. Let Xi ∼ N(µ, σ) be independent d - dimensional random vectors for i = 1,..., n with sample mean X and sample covariance matrix S. We define Hotelling’s T2 by T2 = n(X − µ)T S−1 (X − µ) (16) Question 7: Compute the Hotelling’s T2 of the sample X of size 2x5000 from Example 6. Solution: Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 53 / 86
  • 54. Gaussian Random Vectors Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 54 / 86
  • 55. Gaussian Random Vectors Further let Zj ∼ N(0, Σ) for j = 1,..., m be independent d - dimensional random vectors, and let W = m X j=1 Zj ZT j (17) Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 55 / 86
  • 56. Gaussian Random Vectors In Equation 17, W be the d×d random matrix generated by the Zj. W has the Wishart distribution W(m,Σ) with m degrees of freedom and covariance matrix Σ. m is the number of summands and Σ is the common d×d covariance matrix. Result 1.3 Let Xi ∼ N(µ, Σ) be d-dimensional random vectors for i = 1,...,n. Let S be the sample covariance matrix, and assume that S is invertible. 1 The sample mean X satisfies X ∼ N (µ, Σ/n). 2 For n observations Xi and their sample covariance matrix S there exist n-1 independent random vectors Zj ∼ N(0, Σ) such that S = 1 n−1 Pn−1 j=1 (Zj − Z)(Zj − Z)T , where (n-1)S has a W((n-1), Σ) Wishart distribution. Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 56 / 86
  • 57. Gaussian Random Vectors Result 1.3 (Continue) 3 Assume that nd. Let T2 be given by Equation 16. It follows that n − d (n − 1)d T2 ∼ Fd,n−d (18) The F distribution in d and n-d degrees of freedom. n−d (n−1)d T2 of different set of random data of size n and dimension d from a Gaussian distribution, has a F distribution in d and n-d degrees of freedom. Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 57 / 86
  • 58. Gaussian Random Vectors Question 8: Let Z ∼ N(µ, Σ) 2-variate, where µ = [0, 0]T , Σ = 4 2 2 16 . Compute W and (n-1)S matrices, and plot sample mean distribution. Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 58 / 86
  • 59. Gaussian Random Vectors Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 59 / 86
  • 60. Gaussian Random Vectors Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 60 / 86
  • 61. Gaussian Random Vectors Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 61 / 86
  • 62. Gaussian Random Vectors Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 62 / 86
  • 63. Gaussian Random Vectors Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 63 / 86
  • 64. Gaussian Random Vectors Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 64 / 86
  • 65. Gaussian Random Vectors Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 65 / 86
  • 66. Gaussian Random Vectors The W matrix computed from Sample Covariance Matrix S is identical to W matrix computed using population mean µ (Slide no. 62). Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 66 / 86
  • 67. Gaussian Random Vectors Let X ∼ (µ, Σ) be d-dimensional. The multivariate normal probability density function f is f (Xi ) = (2π)−d 2 det(Σ)−1 2 exp −1 2(Xi − µ)T Σ−1(Xi − µ) (19) Where det(σ) is the determinant of Σ and X = [X1, X2, ···, Xn] of independent random vectors from the normal distribution with the mean µ and covariance matrix Σ. The normal or Gaussian likelihood (function) L as a function of the parameter θ of interest conditional on the data. L(θ|X) = (2π)−nd 2 det(Σ)−n 2 exp −1 2(Xi − µ)T Σ−1(Xi − µ) (20) The parameter of interest θ is mean µ and the covariance matrix Σ. So, θ = (µ, Σ). Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 67 / 86
  • 68. Gaussian Random Vectors The maximum likelihood estimator (MLE) of θ, denoted by θ̂, is θ̂ = (µ̂, Σ̂) (21) µ̂ = 1 n n X i=1 Xi = X (22) Σ̂ = 1 n n X i=1 (Xi − X)(Xi − X)T = n − 1 n S (23) Here, µ̂, X, Σ̂ and S are estimated population mean, sample mean, estimated population covariance matrix and sample covariance matrix respectively. Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 68 / 86
  • 69. Gaussian Random Vectors Question 9: From a sample size of 5000 (Example 6), compute the maximum likelihood estimation of the population mean and population covariance matrix. Solution: Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 69 / 86
  • 70. Outline 1 Multivariate and High-Dimensional Problems 2 Visualisation Three-Dimensional Visualisation Parallel Coordinate Plots 3 Multivariate Random Vectors and Data Population Case Sample Case Multivariate Random Vectors Gaussian Random Vectors Marginal and Conditional Normal Distributions Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 70 / 86
  • 71. Marginal and Conditional Normal Distributions Consider a normal random vector X = [X1, X2,..., Xd]T. Let X[1] be a vector consisting of the first d1 entries of X, and let X[2] be the vector consisting of the remaining d2 entries: X = X[1] X[2] (24) For ι = 1, 2 we let µι be the mean of X[ι] and Σι its covariance matrix. Question 10: Let X ∼ N(µ, Σ) be 4-variate, where µ = [2, 3, 2, 3]T , Σ =     4 1 1 1 1 4 1 1 1 1 4 1 1 1 1 4     Σ−1 =     0.286 −0.048 −0.048 −0.048 −0.048 0.286 −0.048 −0.048 −0.048 −0.048 0.286 −0.048 −0.048 −0.048 −0.048 0.286    . Compute µ1 and Σ1 of X[1] and µ2 and Σ2 of X[2], where d1 and d2 are 2. Analyze all the properties from Result 1.4 and 1.5. Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 71 / 86
  • 72. Marginal and Conditional Normal Distributions Result 1.4 Assume that X[1], X[2] and X are given by Equation 24 for some d1, d2 d such that d1 + d2 = d. Assume also that X ∼ N(µ, Σ). 1 For j = 1,...,d the jth variable Xj of X has the distribution N(µj , σ2 j ). 2 ι = 1, 2, X[ι] has the distribution N(µι, Σι). 3 The (between) covariance matrix cov(X[1], X[2]) of X[1] and X[2] is the d1xd2 submatrix Σ12 of. Σ = Σ1 Σ12 ΣT 12 Σ2 (25) The marginal distributions of normal random vectors are normal with means and covariance matrices of the original random vectors. Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 72 / 86
  • 73. Marginal and Conditional Normal Distributions Result 1.5 Assume that X[1], X[2] and X are given by Equation 24 for some d1, d2 d such that d1 + d2 = d. Assume also that X ∼ N(µ, Σ) and that Σ1 and Σ2 are invertible. If X[1] and X[2] are independent. The covariance matrix Σ12 of X[1] and X[2] satisfies Σ12 = 0d1xd2 (26) Assume that Σ12 ̸= 0d1×d2 . Put X21 = X2 -Σ12 TΣ1 -1X1. Then X21 is a d2-dimensional random vector which is independent of X1 and X21 ∼N(µ21, Σ21) with µ21 = µ2 − ΣT 12Σ−1 1 µ1 and Σ2/1 = Σ2 − ΣT 12Σ−1 1 Σ12 (27) Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 73 / 86
  • 74. Marginal and Conditional Normal Distributions Result 1.5 (Continue) Let (X[1] | X[2]) be the conditional random vector X[1] given X[2]. Then (X[1] | X[2]) ∼ N(µX1 | X2 , ΣX1 | X2 ) µX1|X2 = µ1 + Σ12Σ−1 2 (X[2] − µ2) (28) ΣX1|X2 = Σ1 − Σ12Σ−1 2 ΣT 12 (29) The first property specifies independence always implies uncorrelated -ness, and for the normal distribution, the converse holds too. The second property shows how one can uncorrelate the vectors X[1] and X[2]. The last property is about the adjustments that are needed when the sub-vectors have a non-zero covariance matrix. Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 74 / 86
  • 75. Marginal and Conditional Normal Distributions Q.10 solution: Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 75 / 86
  • 76. Marginal and Conditional Normal Distributions Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 76 / 86
  • 77. Marginal and Conditional Normal Distributions Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 77 / 86
  • 78. Marginal and Conditional Normal Distributions Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 78 / 86
  • 79. Marginal and Conditional Normal Distributions Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 79 / 86
  • 80. Marginal and Conditional Normal Distributions Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 80 / 86
  • 81. Marginal and Conditional Normal Distributions Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 81 / 86
  • 82. Marginal and Conditional Normal Distributions Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 82 / 86
  • 83. Marginal and Conditional Normal Distributions Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 83 / 86
  • 84. Marginal and Conditional Normal Distributions Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 84 / 86
  • 85. Summary Here, we have discussed Different types of multivariate and high-dimensional problems. Three-dimensional visualisation of features from the HIV and Iris flower data-sets. Data visualisation using vertical parallel coordinate plot. Data Visualisation using horizontal parallel coordinate plot. Differentiation between population cases and sample cases. Population mean, population covariance matrix, sample mean and sample covariance matrix of multivariate random vectors. Population mean, population covariance matrix, sample mean and sample covariance matrix of Gaussian random vectors. Parameters and properties of marginal and conditional normal distributions. Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 85 / 86
  • 86. For Further Reading I I. Koch. Analysis of multivariate and high-dimensional data (Vol. 32). Cambridge Universities Press, 2014. F. Emdad and S. R. Zekavat. High dimensional data analysis: overview, analysis and applications. VDM verlag, 2008. Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 86 / 86