1. Multivariate and High Dimensional Problems.
2. Visualization
2.1 Three Dimensional Visualization
2.2 Parallel Coordinate Plots
3. Multivariate Random Vectors and Data
3.1.Population Case
3.2. Sample Case
3.3. Multivariate Random Vectors
3.4. Gaussian Random Vectors
3.5. Marginal and Conditional Normal Distributions.
Statistics is a mathematical science including methods of collecting, organizing, and analyzing data in such a way that meaningful conclusions can be drawn from them. In general, its investigations and analyses fall into two broad categories called descriptive and inferential statistics.
Descriptive statistics offer nurse researchers valuable options for analysing and pre-senting large and complex sets of data, suggests Christine Hallett
Statistics is a mathematical science including methods of collecting, organizing, and analyzing data in such a way that meaningful conclusions can be drawn from them. In general, its investigations and analyses fall into two broad categories called descriptive and inferential statistics.
Descriptive statistics offer nurse researchers valuable options for analysing and pre-senting large and complex sets of data, suggests Christine Hallett
The Inquisitive Data Scientist: Facilitating Well-Informed Data Science throu...Cagatay Turkay
Slides for my talk at the VRVis Research Centre in Vienna as part of their VRVIS Forum talk series on November 8th 2018 -- https://www.vrvis.at/newsroom/events/forum/148-invited-talk-by-cagatay-turkay-the-inquisitive-data-scientist/
The talk argues the importance of being "inquisitive" as a data scientist and discusses techniques from visualisation that support this.
Image investigation using higher moment statistics and edge detection for rec...journalBEEI
This paper discusses the recognition of skin abnormalities by investigating the images using high-level statistics and edge detection. Six images of skin diseases were analyzed using several statistical parameters, including high-level moments such as skewness and kurtosis. In comparison, some images from other categories such as animal, architecture, art, vehicle, food, people, and scenery have been analyzed as well. The results were compared to skin disease images. It is expected that the general pattern of statistical parameters can distinguish skin images against images from other categories. MatLab is used as a medium to calculate the values of statistical parameters. The mean and median of the skin disease image are much larger. Meanwhile, the standard deviation is the smallest compared to other categorical images. Almost all the analyzed images close to symmetry. Nearly all images category are distributed more leaning to the left, except for the images of the art category, which is slightly more leaning to the right. Moreover, the edge detection process has been done using the Sobel algorithm. The result, however, cannot clearly distinguish a skin's abnormality. This difficulty is because of the lack of accuracy in selection the intensity.
Jiawei Han, Micheline Kamber and Jian Pei
Data Mining: Concepts and Techniques, 3rd ed.
The Morgan Kaufmann Series in Data Management Systems
Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791
Data Mining Exploring DataLecture Notes for Chapter 3OllieShoresna
Data Mining: Exploring Data
Lecture Notes for Chapter 3
Introduction to Data Mining
by
Tan, Steinbach, Kumar
What is data exploration?Key motivations of data exploration includeHelping to select the right tool for preprocessing or analysisMaking use of humans’ abilities to recognize patterns People can recognize patterns not captured by data analysis tools
Related to the area of Exploratory Data Analysis (EDA)Created by statistician John TukeySeminal book is Exploratory Data Analysis by TukeyA nice online introduction can be found in Chapter 1 of the NIST Engineering Statistics Handbook
http://www.itl.nist.gov/div898/handbook/index.htm
A preliminary exploration of the data to better understand its characteristics.
Techniques Used In Data Exploration In EDA, as originally defined by TukeyThe focus was on visualizationClustering and anomaly detection were viewed as exploratory techniquesIn data mining, clustering and anomaly detection are major areas of interest, and not thought of as just exploratory
In our discussion of data exploration, we focus onSummary statisticsVisualizationOnline Analytical Processing (OLAP)
Iris Sample Data Set Many of the exploratory data techniques are illustrated with the Iris Plant data set.Can be obtained from the UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.htmlFrom the statistician Douglas FisherThree flower types (classes): Setosa Virginica VersicolourFour (non-class) attributes Sepal width and length Petal width and length
Virginica. Robert H. Mohlenbrock. USDA NRCS. 1995. Northeast wetland flora: Field office guide to plant species. Northeast National Technical Center, Chester, PA. Courtesy of USDA NRCS Wetland Science Institute.
Summary StatisticsSummary statistics are numbers that summarize properties of the data
Summarized properties include frequency, location and spread Examples: location - mean
spread - standard deviation
Most summary statistics can be calculated in a single pass through the data
Frequency and ModeThe frequency of an attribute value is the percentage of time the value occurs in the
data set For example, given the attribute ‘gender’ and a representative population of people, the gender ‘female’ occurs about 50% of the time.The mode of a an attribute is the most frequent attribute value The notions of frequency and mode are typically used with categorical data
PercentilesFor continuous data, the notion of a percentile is more useful.
Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile is a value of x such that p% of the observed values of x are less than .
For instance, the 50th percentile is the value such that 50% of all values of x are less than .
Measures of Location: Mean and MedianThe mean is the most common measure of the location of a set of points. However, the mean is very sensitive to outliers. ...
1. Introduction to data structures and their types.
2. Linear, non-linear, homogeneous, non-homogeneous, static and
dynamic data structures.
3. Linear data structures - array, stack, queue and linked list.
4. Non-linear data structures - tree and graph.
1. Introduction to searching and sorting algorithms
2. Types of searching algorithms - Linear search and Binary search.
3. Basic iterative sorting - Bubble sort, Selection sort and Insertion sort.
4. Time and space complexity analysis of searching and sorting
algorithms.
The Inquisitive Data Scientist: Facilitating Well-Informed Data Science throu...Cagatay Turkay
Slides for my talk at the VRVis Research Centre in Vienna as part of their VRVIS Forum talk series on November 8th 2018 -- https://www.vrvis.at/newsroom/events/forum/148-invited-talk-by-cagatay-turkay-the-inquisitive-data-scientist/
The talk argues the importance of being "inquisitive" as a data scientist and discusses techniques from visualisation that support this.
Image investigation using higher moment statistics and edge detection for rec...journalBEEI
This paper discusses the recognition of skin abnormalities by investigating the images using high-level statistics and edge detection. Six images of skin diseases were analyzed using several statistical parameters, including high-level moments such as skewness and kurtosis. In comparison, some images from other categories such as animal, architecture, art, vehicle, food, people, and scenery have been analyzed as well. The results were compared to skin disease images. It is expected that the general pattern of statistical parameters can distinguish skin images against images from other categories. MatLab is used as a medium to calculate the values of statistical parameters. The mean and median of the skin disease image are much larger. Meanwhile, the standard deviation is the smallest compared to other categorical images. Almost all the analyzed images close to symmetry. Nearly all images category are distributed more leaning to the left, except for the images of the art category, which is slightly more leaning to the right. Moreover, the edge detection process has been done using the Sobel algorithm. The result, however, cannot clearly distinguish a skin's abnormality. This difficulty is because of the lack of accuracy in selection the intensity.
Jiawei Han, Micheline Kamber and Jian Pei
Data Mining: Concepts and Techniques, 3rd ed.
The Morgan Kaufmann Series in Data Management Systems
Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791
Data Mining Exploring DataLecture Notes for Chapter 3OllieShoresna
Data Mining: Exploring Data
Lecture Notes for Chapter 3
Introduction to Data Mining
by
Tan, Steinbach, Kumar
What is data exploration?Key motivations of data exploration includeHelping to select the right tool for preprocessing or analysisMaking use of humans’ abilities to recognize patterns People can recognize patterns not captured by data analysis tools
Related to the area of Exploratory Data Analysis (EDA)Created by statistician John TukeySeminal book is Exploratory Data Analysis by TukeyA nice online introduction can be found in Chapter 1 of the NIST Engineering Statistics Handbook
http://www.itl.nist.gov/div898/handbook/index.htm
A preliminary exploration of the data to better understand its characteristics.
Techniques Used In Data Exploration In EDA, as originally defined by TukeyThe focus was on visualizationClustering and anomaly detection were viewed as exploratory techniquesIn data mining, clustering and anomaly detection are major areas of interest, and not thought of as just exploratory
In our discussion of data exploration, we focus onSummary statisticsVisualizationOnline Analytical Processing (OLAP)
Iris Sample Data Set Many of the exploratory data techniques are illustrated with the Iris Plant data set.Can be obtained from the UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.htmlFrom the statistician Douglas FisherThree flower types (classes): Setosa Virginica VersicolourFour (non-class) attributes Sepal width and length Petal width and length
Virginica. Robert H. Mohlenbrock. USDA NRCS. 1995. Northeast wetland flora: Field office guide to plant species. Northeast National Technical Center, Chester, PA. Courtesy of USDA NRCS Wetland Science Institute.
Summary StatisticsSummary statistics are numbers that summarize properties of the data
Summarized properties include frequency, location and spread Examples: location - mean
spread - standard deviation
Most summary statistics can be calculated in a single pass through the data
Frequency and ModeThe frequency of an attribute value is the percentage of time the value occurs in the
data set For example, given the attribute ‘gender’ and a representative population of people, the gender ‘female’ occurs about 50% of the time.The mode of a an attribute is the most frequent attribute value The notions of frequency and mode are typically used with categorical data
PercentilesFor continuous data, the notion of a percentile is more useful.
Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile is a value of x such that p% of the observed values of x are less than .
For instance, the 50th percentile is the value such that 50% of all values of x are less than .
Measures of Location: Mean and MedianThe mean is the most common measure of the location of a set of points. However, the mean is very sensitive to outliers. ...
1. Introduction to data structures and their types.
2. Linear, non-linear, homogeneous, non-homogeneous, static and
dynamic data structures.
3. Linear data structures - array, stack, queue and linked list.
4. Non-linear data structures - tree and graph.
1. Introduction to searching and sorting algorithms
2. Types of searching algorithms - Linear search and Binary search.
3. Basic iterative sorting - Bubble sort, Selection sort and Insertion sort.
4. Time and space complexity analysis of searching and sorting
algorithms.
1. Introduction to time and space complexity.
2. Different types of asymptotic notations and their limit definitions.
3. Growth of functions and types of time complexities.
4. Space and time complexity analysis of various algorithms.
1. Algorithm and characteristics of an algorithm.
2. Rules to be followed for design and analysis of an algorithm.
3. The differentiation of data structures, file structures, and storage structures.
4. Top-down and bottom-up design approaches through examples.
5. Rules to be followed while writing the pseudo code of an algorithm.
6. Abstract data type and its necessity in a program.
What is SMC. SMC Models. Type of Adversaries. Applications. Goals. Actions. Types of Operations. Randomization Techniques. Oblivious Transfer. Cryptographic Techniques
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
1. Multidimensional Data
Dr. Ashutosh Satapathy
Assistant Professor, Department of CSE
VR Siddhartha Engineering College
Kanuru, Vijayawada
October 19, 2022
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 1 / 86
2. Outline
1 Multivariate and
High-Dimensional Problems
2 Visualisation
Three-Dimensional
Visualisation
Parallel Coordinate Plots
3 Multivariate Random Vectors
and Data
Population Case
Sample Case
Multivariate Random Vectors
Gaussian Random Vectors
Marginal and Conditional
Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 2 / 86
3. Multivariate and High-Dimensional Problems
Early in the twentieth century, scientists such as Pearson (1901),
Hotelling (1933) and Fisher (1936) developed methods for
analysing multivariate data in order to
1 Understand the structure in the data and summarise it in simpler
ways.
2 Understand the relationship of one part of the data to another part.
3 Make decisions and inferences based on the data.
The early methods these scientists developed are linear; as time
moved on, more complex methods were developed.
These data sets essential structure can often be obscured by noise.
Reduce the original data in such a way that informative and
interesting structure in the data is preserved while noisy, irrelevant
or purely random variables, dimensions or features are removed,
as these can adversely affect the analysis.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 3 / 86
4. Multivariate and High-Dimensional Problems
Traditionally one assumes that the dimension d is small compared to
the sample size n.
Many recent data sets do not fit into this framework; we encounter
the following problems.
Data whose dimension is comparable to the sample size, and both
are large.
High-dimension and low sample size data whose dimension d
vastly exceeds the sample size n, so d ≥ n.
Functional data whose observations are functions.
High-dimensional and functional data pose special challenges.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 4 / 86
5. Visualisation
Before we analyse a set of data, it is important to look at it.
Often we get useful clues such as skewness, bi- or multi-modality,
outliers, or distinct groupings.
Graphical displays are exploratory data-analysis tools, which, if
appropriately used, can enhance our understanding of data.
Visual clues are easier to understand and interpret than numbers
alone, and the information you can get from graphical displays can
help you understand answers that are based on numbers.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 5 / 86
6. Outline
1 Multivariate and
High-Dimensional Problems
2 Visualisation
Three-Dimensional
Visualisation
Parallel Coordinate Plots
3 Multivariate Random Vectors
and Data
Population Case
Sample Case
Multivariate Random Vectors
Gaussian Random Vectors
Marginal and Conditional
Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 6 / 86
7. Three-Dimensional Visualisation
Two-dimensional scatter-plots are a natural – though limited – way
of looking at data with three or more variables.
As the number of variables, and therefore the dimension increases.
We can, of course, still display three of the d dimensions in
scatter-plots, but it is less clear how one can look at more than
three dimensions in a single plot.
Figure 2.1 display the 10,000 observations and the three variables
CD3, CD8 and CD4 of the five-dimensional HIV+ and HIV- data
sets.
The data-sets contain measurements of blood cells relevant to HIV.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 7 / 86
8. Three-Dimensional Visualisation
Figure 2.1: HIV+ data (left) and HIV- data (right) of variables CD3, CD8 and
CD4.
There are differences between the point clouds in the two figures, and an
important task is to exhibit and quantify the differences.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 8 / 86
9. Three-Dimensional Visualisation
Projecting the Figure 2.1 data onto a number of orthogonal directions
and displaying the lower-dimensional projected data in Figure 2.2.
Figure 2.2: Orthogonal projections of the HIV+ data (left) and the HIV-
data(right).
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 9 / 86
10. Three-Dimensional Visualisation
We can see a smaller fourth cluster in the top right corner of the
HIV- data, which seems to have almost disappeared in the HIV+ data
in the left panel.
Many of the methods we explore use projections: Principal
Component Analysis, Factor Analysis, Multidimensional Scaling,
Independent Component Analysis and Projection Pursuit.
In each case the projections focus on different aspects and properties
of the data.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 10 / 86
11. Three-Dimensional Visualisation
Figure 2.3: Three different species of Iris flowers.
We display the four variables of Fisher’s iris data – sepal length, sepal
width, petal length and petal width – in a sequence of three-
dimensional scatter-plots.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 11 / 86
12. Three-Dimensional Visualisation
Figure 2.4: Features 1, 2 and 3 (top left), features 1, 2 and 4 (top right), features
1, 3 and 4 (bottom left) and features 2, 3 and 4 (bottom right).
Red refers to Setosa, green to Versicolor and black to Virginica.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 12 / 86
13. Outline
1 Multivariate and
High-Dimensional Problems
2 Visualisation
Three-Dimensional
Visualisation
Parallel Coordinate Plots
3 Multivariate Random Vectors
and Data
Population Case
Sample Case
Multivariate Random Vectors
Gaussian Random Vectors
Marginal and Conditional
Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 13 / 86
14. Parallel Coordinate Plots
As the dimension grows, three-dimensional scatter-plots become
less relevant, unless we know that only some variables are important.
An alternative, which allows us to see all variables at once, present
the data in the form of parallel coordinate plots.
The idea is to present the data as two-dimensional graphs.
The variable numbers are represented as values on the y-axis in a
vertical parallel coordinate plot.
For a vector X = [X1,..., Xd ]T we represent the first variable X1 by
the point (X1, 1) and the jth variable Xj by (Xj , j).
Finally, we connect the d points by a line which goes from (X1, 1) to
(X2, 2) and so on to (Xd , d).
We apply the same rule to the next d-dimensional feature vectors.
Figure 2.5 shows a vertical parallel coordinate plot for Fisher’s iris
data.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 14 / 86
15. Parallel Coordinate Plots
Figure 2.5: Iris data with variables represented on the y-axis and separate colours
for the three species.
Red refers to the observations of Setosa, green to those of
Versicolor and black to those of Virginica.
Unlike the previous Figure 2.4, Figure 2.5 tells us that dimension 3
separates the two groups most strongly.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 15 / 86
16. Parallel Coordinate Plots
Instead of the three colours shown in Figure 2.5, different colors can
be used for each observation in Figure 2.6.
In a horizontal parallel coordinate plot, the x-axis represents the
variable numbers 1, ..., d. For a feature vector X = [X1 ··· Xd]T,
the first variable gives rise to the point (1, X1) and the jth variable
Xj to (j, Xj).
The d points are connected by a line, starting with (1, X1), then (2,
X2), until we reach (d, Xd).
As the variables are presented along the x-axis, horizontal parallel
coordinate plots are often used.
The differently coloured lines make it easier to trace particular
observations in Figure 2.6.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 16 / 86
17. Parallel Coordinate Plots
Figure 2.6: Parallel coordinate view of the illicit drug market data.
Figure 2.6 shows the 66 monthly observations on 15 features or
variables of the illicit drug market data.
Each observation (month) is displayed in a different colour.
Looking at variable 5, heroin overdose, the question arises whether
there could be two groups of observations corresponding to the high
and low values of this variable.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 17 / 86
18. Outline
1 Multivariate and
High-Dimensional Problems
2 Visualisation
Three-Dimensional
Visualisation
Parallel Coordinate Plots
3 Multivariate Random Vectors
and Data
Population Case
Sample Case
Multivariate Random Vectors
Gaussian Random Vectors
Marginal and Conditional
Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 18 / 86
19. Population Case
In data science, population is the entire set of items from which you
draw data for a statistical study. It can be a group of individuals, a
set of items, etc.
Generally, population refers to the people who live in a particular area
at a specific time. But in data science, population refers to data on
your study of interest.
It can be a group of individuals, objects, events, organizations, etc.
You use populations to draw conclusions. An example of a
population would be the entire student body at a school. The
problem statement is the percentage of students who speak English
fluently.
If you had to collect the same data from the entire country of India,
it would be impossible to draw reliable conclusions because of
geographical and accessibility constraints.
Making the data biased towards certain regions or groups.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 19 / 86
20. Outline
1 Multivariate and
High-Dimensional Problems
2 Visualisation
Three-Dimensional
Visualisation
Parallel Coordinate Plots
3 Multivariate Random Vectors
and Data
Population Case
Sample Case
Multivariate Random Vectors
Gaussian Random Vectors
Marginal and Conditional
Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 20 / 86
21. Sample Case
A sample is defined as a smaller and more manageable
representation of a larger group. A subset of a larger population
that contains characteristics of that population.
A sample is used in testing when the population size is too large for
all members or observations to be included in the test.
The sample is an unbiased subset of the population that best
represents the whole data.
The process of collecting data from a small subsection of the
population and then using it to generalize over the entire set is called
sampling.
Samples are used when the population is too large and unlimited in
size and the data collected is not reliable.
A sample should generally be unbiased and satisfy all variations
present in a population. A sample should typically be chosen at
random.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 21 / 86
22. Outline
1 Multivariate and
High-Dimensional Problems
2 Visualisation
Three-Dimensional
Visualisation
Parallel Coordinate Plots
3 Multivariate Random Vectors
and Data
Population Case
Sample Case
Multivariate Random Vectors
Gaussian Random Vectors
Marginal and Conditional
Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 22 / 86
23. Multivariate Random Vectors
Random vectors are vector-valued functions defined on a sample
space.
where v(t) is the vector function and f(t), g(t) and h(t) are the
coordinate functions of Cartesian 3-space.
It can be represented as v(t) =< f (t), g(t), h(t) >.
A vector-valued function is a mathematical function of one or
more variables whose range is a set of multidimensional vectors or
infinite-dimensional vectors.
the collection of random vectors as the data or the random sample.
Specific feature values are measured for each of the random vectors
in the collection.
We call these values the realised or observed values of the data or
simply the observed data.
The observed values are no longer random.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 23 / 86
24. The Population Case
Let X = [X1 X2... Xd ]T
(1)
be a random vector from a distribution F:Rd → [0, 1]. The individual
Xj, with j ≤ d are random variables, also called the variables,
components or entries of X. X is d-dimensional or d-variate.
X has a finite d-dimensional mean or expected value EX and a
finite d×d covariance matrix var(X).
µ = EX, Σ = var(X) = E[(X − µ)(X − µ)T
] (2)
The µ and Σ are
µ = [µ1 µ2... µd ]T
, Σ =
σ2
1 σ12 . . . σ1d
σ21 σ2
2 . . . σ2d
. . . . . .
. . . . . .
σd1 σd2 . . . σ2
d
(3)
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 24 / 86
25. The Population Case
σj
2 = var(Xj) and σjk = cov(Xj , Xk). σjj for the diagonal elements
σj
2 of Σ.
X ∼ (µ, Σ) (4)
Equation 4 is shorthand for a random vector X which has mean µ
and covariance matrix Σ.
If X is a d-dimensional random vector and A is a d × k matrix,
for some k ≥ 1, then ATX is a k-dimensional random vector.
Result 1.1
Let X ∼ (µ, Σ) be a d-variate random vector. Let A and B be matrices
of size d × k and d × l, respectively.
The mean and covariance matrix of the k-variate random vector
ATX are ATX ∼ (ATµ, ATΣA).
The random vectors ATX and BTX are uncorrelated if and only if
ATΣB = 0k×l (All entries are 0s).
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 25 / 86
26. The Population Case
Question 1: Suppose you have a set of n=5 data items, representing 5
insects, where each data item has a height (X), width (Y), and speed (Z)
(therefore d = 3)
Table 3.1: Three features of five different insects.
Height (cm) Width (cm) Speed (m/s)
I1 0.64 0.58 0.29
I2 0.66 0.57 0.33
I3 0.68 0.59 0.37
I4 0.69 0.66 0.46
I5 0.73 0.60 0.55
Solution:
mean (µ) = [0.64+0.66+0.68+0.69+0.73
5 , 0.58+0.57+0.59+0.66+0.60
5 ,
0.29+0.33+0.37+0.46+0.55
5 ]T = [0.68, 0.60, 0.40]T
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 26 / 86
28. The Population Case
Covariance Matrix (Σ) = 1
n
Pn
i=1(Ii − µ)(Ii − µ)T =
1
5
0.0046 0.0020 0.0139
0.0020 0.005 0.0082
0.0139 0.0082 0.044
=
9.2E − 4 4.0E − 4 0.00278
4.0E − 4 1.0E − 3 0.00164
0.00278 0.00164 0.0088
Definition
Mean: The mean is the average or the most common value in a collection
of numbers.
Variance: The expectation of the squared deviation of a random variable
from its mean.
Covariance: A measure of the relationship between two random variables
and to what extent, they change together.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 28 / 86
29. The Population Case
Question 2: Verify result 1.1a for A =
0.02 0.01
0.01 0.03
0.03 0.02
and
X =
0.64 0.66 0.68 0.69 0.73
0.58 0.57 0.59 0.66 0.60
0.29 0.33 0.37 0.46 0.55
, µATX = ATµX and ΣATX =
ATΣX A
Solution: ATX =
0.0273 0.0288 0.0306 0.0342 0.0371
0.0296 0.0303 0.0319 0.0359 0.0363
µATX =
0.0316
0.0328
and ATµX =
0.02 0.01 0.03
0.01 0.03 0.02
*
0.68
0.60
0.40
=
0.0316
0.0328
Hence, µATX = ATµX
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 29 / 86
31. The Population Case
(ATX)4 - µATX = [0.0026, 0.0031]T
((ATX)4 -µATX)((ATX)4 - µATX)T =
0.00000676 0.00000806
0.00000806 0.00000961
(ATX)5 - µATX = [0.0055, 0.0035]T
((ATX)5 -µATX)((ATX)5 - µATX)T =
0.00003025 0.00001925
0.00001925 0.00001225
Covariance Matrix (ΣATX) = 1
n
Pn
i=1((ATX)i − µATX
)((ATX)i − µATX
)T
= 1
5
0.00006434 0.00004897
0.00004897 0.00003916
=
0.000012868 0.000009794
0.000009794 0.000007832
Hence ΣATX = ATΣX A
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 31 / 86
32. The Population Case
Question 3: Verify result 1.1b, ATX and BTX are correlated for A =
0.02 0.01
0.01 0.03
0.03 0.02
, B =
0.02 0.01
0.01 0.03
0.03 0.02
, X =
0.64 0.66 0.68 0.69 0.73
0.58 0.57 0.59 0.66 0.60
0.29 0.33 0.37 0.46 0.55
Solution: Σ =
9.2E − 4 4.0E − 4 0.00278
4.0E − 4 1.0E − 3 0.00164
0.00278 0.00164 0.0088
ATΣB =
0.02 0.01 0.03
0.01 0.03 0.02
*
9.2E − 4 4.0E − 4 0.00278
4.0E − 4 1.0E − 3 0.00164
0.00278 0.00164 0.0088
*
0.02 0.01
0.01 0.03
0.03 0.02
=
0.000012868 0.000009794
0.000009794 0.000007832
Hence, ATX and BTX are correlated with each-other.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 32 / 86
33. The Sample Case
Let X1,...,Xn be d-dimensional random vectors. We assume that
the Xi are independent and from the same distribution F:Rd → [0,
1] with finite mean µ and covariance matrix Σ.
We omit reference to F when knowledge of the distribution is not
required, i.e., Rd → [0, 1].
In statistics one often identifies a random vector with its observed
values and writes Xi = xi
We explore properties of random samples but only encounter
observed values of random vectors. For this reason
X = [X1, X2, ..., Xn]T
(5)
for the sample of independent random vectors Xi and call this
collection a random sample or data.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 33 / 86
34. The Sample Case
X =
X11 X21 . . . Xn1
X12 X22 . . . Xn2
. . . . . .
. . . . . .
X1d X2d . . . Xnd
=
X•1
X•2
.
.
X•d
(6)
The ith column of X is the ith random vector Xi, and the jth row of
X•j is the jth variable across all n random vectors. i in Xij refers to
the ith vector Xi, and the j refers to the jth variable.
For data, the mean µ and covariance matrix Σ are usually not
known; instead, we work with the sample mean X and the sample
covariance matrix S. It is represented by
X ∼ Sam(X, S) (7)
The sample mean and sample covariance matrix depend on the
sample size n.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 34 / 86
35. The Sample Case
X =
1
n
n
X
i=1
Xi S =
1
n − 1
n
X
i=1
(Xi − X)(Xi − X)T
(8)
Definitions of the sample covariance matrix use n-1 or (n-1)-1 in
the literature. (n-1)-1 is preferred as an unbiased estimator of the
population variance Σ.
Xcent = X − X = [X1 − X, X2 − X, ..., Xn − X] (9)
Xcent is the centred data and it is of size dxn. Using this notation,
the d×d sample covariance matrix S becomes
S =
1
n − 1
XcentXT
cent =
1
n − 1
(X − X)(X − X)T
(10)
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 35 / 86
36. The Sample Case
The entries of the sample covariance matrix S are sjk, and
sjk =
1
n − 1
n
X
i=1
(Xij − mj )(Xik − mk) (11)
X =[m1,...,md]T, and mj is the sample mean of the jth variable.
As for the population, we write sj
2 or sjj for the diagonal elements
of S.
Consider a ∈ Rd; then the projection of X onto a is aTX.
Similarly, the projection of the matrix X onto a is done element-wise
for each random vector Xi and results in the 1×n vector aTX.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 36 / 86
37. The Sample Case
Question 4: The math and science scores of good, average and poor
students from a class are given as follows:
Student Math (X) Science (Y)
1 92 68
2 55 30
3 100 78
Find the sample mean (X), covariance matrix (S), S12 of the above data.
Solution: X =
92 55 100
68 30 78
X =
92+55+100
3
68+30+78
3
=
82.33
58.66
X1 − X = [9.67, 9.34]T , (X1 − X)(X1 − X)T =
93.5089 90.3178
90.3178 87.2356
X2 − X = [−27.33, −28.66]T
(X2 − X)(X2 − X)T =
746.9289 783.2778
783.2778 821.3956
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 37 / 86
39. The Sample Case
Question 5: Compute projection of matrix X =
92 55 100
68 30 78
onto a
vector
−45
45
.
Solution: P = aTX =
−45 45
*
92 55 100
68 30 78
=
−1080 −1125 −990
So, projection of X onto a vector [−45, 45]T is a 1x3 matrix.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 39 / 86
40. Outline
1 Multivariate and
High-Dimensional Problems
2 Visualisation
Three-Dimensional
Visualisation
Parallel Coordinate Plots
3 Multivariate Random Vectors
and Data
Population Case
Sample Case
Multivariate Random Vectors
Gaussian Random Vectors
Marginal and Conditional
Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 40 / 86
41. Gaussian Random Vectors
The univariate normal probability density function f is
f (X) =
1
σ
√
2π
e
−1
2
( X−µ
σ
)2
(12)
X ∼ N(µ, σ2
) (13)
Equation 13 is shorthand for a random value from the univariate
normal distribution with mean µ and variance σ2.
Figure 3.1: Three normal pdfs of 1000 random values having µ and σ are (0, 0.8),
(-2, 1) and (3, 2) respectively.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 41 / 86
42. Gaussian Random Vectors
The d-variate normal probability density function f is
f (X) = (2π)−d
2 |Σ|−1
2 exp
−1
2(X − µ)T Σ−1(X − µ)
(14)
X ∼ N(µ, Σ) (15)
Equation 15 is shorthand for a d-dimensional random vector from the
d-variate normal distribution with mean µ and covariance matrix Σ.
Figure 3.2: 2-dimensional normal pdf having µ = [1, 2] and Σ =
0.25 0.3
0.3 1
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 42 / 86
43. Gaussian Random Vectors
Result 1.2
Let X ∼ N(µ, Σ) be d-variate, and assume that Σ−1 exists.
1 Let XΣ = Σ−1/2(X − µ); then XΣ ∼ N(0, Id×d), where Id×d is the
d×d identity matrix.
2 Let X2 = (X-µ)TΣ-1(X-µ); then X2 ∼ Xd
2, the Chi-squared X2
distribution in d degrees of freedom.
Question 6: Let X ∼ N(µ, Σ) be 2-variate, where µ = [2, 3]T , Σ =
4 0
0 16
and Σ−1 =
0.25 0
0 0.0625
. Verify Result 1.2.1 and 1.2.2
Solution:
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 43 / 86
53. Gaussian Random Vectors
Hence, the quantity X2 is a scalar random variable which has, as in
the one-dimensional case, a X2 - distribution, but this time in d
degrees of freedom.
Fix a dimension d ≥ 1. Let Xi ∼ N(µ, σ) be independent d -
dimensional random vectors for i = 1,..., n with sample mean X
and sample covariance matrix S.
We define Hotelling’s T2 by
T2
= n(X − µ)T
S−1
(X − µ) (16)
Question 7: Compute the Hotelling’s T2 of the sample X of size 2x5000
from Example 6.
Solution:
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 53 / 86
55. Gaussian Random Vectors
Further let Zj ∼ N(0, Σ) for j = 1,..., m be independent d -
dimensional random vectors, and let
W =
m
X
j=1
Zj ZT
j (17)
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 55 / 86
56. Gaussian Random Vectors
In Equation 17,
W be the d×d random matrix generated by the Zj.
W has the Wishart distribution W(m,Σ) with m degrees of
freedom and covariance matrix Σ.
m is the number of summands and Σ is the common d×d
covariance matrix.
Result 1.3
Let Xi ∼ N(µ, Σ) be d-dimensional random vectors for i = 1,...,n. Let
S be the sample covariance matrix, and assume that S is invertible.
1 The sample mean X satisfies X ∼ N (µ, Σ/n).
2 For n observations Xi and their sample covariance matrix S there
exist n-1 independent random vectors Zj ∼ N(0, Σ) such that
S = 1
n−1
Pn−1
j=1 (Zj − Z)(Zj − Z)T , where (n-1)S has a W((n-1), Σ)
Wishart distribution.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 56 / 86
57. Gaussian Random Vectors
Result 1.3 (Continue)
3 Assume that nd. Let T2 be given by Equation 16. It follows that
n − d
(n − 1)d
T2
∼ Fd,n−d (18)
The F distribution in d and n-d degrees of freedom.
n−d
(n−1)d T2 of different set of random data of size n and dimension d
from a Gaussian distribution, has a F distribution in d and n-d degrees
of freedom.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 57 / 86
58. Gaussian Random Vectors
Question 8: Let Z ∼ N(µ, Σ) 2-variate, where µ = [0, 0]T , Σ =
4 2
2 16
.
Compute W and (n-1)S matrices, and plot sample mean distribution.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 58 / 86
66. Gaussian Random Vectors
The W matrix computed from Sample Covariance Matrix S is identical
to W matrix computed using population mean µ (Slide no. 62).
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 66 / 86
67. Gaussian Random Vectors
Let X ∼ (µ, Σ) be d-dimensional. The multivariate normal
probability density function f is
f (Xi ) = (2π)−d
2 det(Σ)−1
2 exp
−1
2(Xi − µ)T Σ−1(Xi − µ)
(19)
Where det(σ) is the determinant of Σ and X = [X1, X2, ···, Xn] of
independent random vectors from the normal distribution with
the mean µ and covariance matrix Σ.
The normal or Gaussian likelihood (function) L as a function of
the parameter θ of interest conditional on the data.
L(θ|X) = (2π)−nd
2 det(Σ)−n
2 exp
−1
2(Xi − µ)T Σ−1(Xi − µ)
(20)
The parameter of interest θ is mean µ and the covariance matrix
Σ. So, θ = (µ, Σ).
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 67 / 86
68. Gaussian Random Vectors
The maximum likelihood estimator (MLE) of θ, denoted by θ̂, is
θ̂ = (µ̂, Σ̂) (21)
µ̂ =
1
n
n
X
i=1
Xi = X (22)
Σ̂ =
1
n
n
X
i=1
(Xi − X)(Xi − X)T
=
n − 1
n
S (23)
Here, µ̂, X, Σ̂ and S are estimated population mean, sample mean,
estimated population covariance matrix and sample covariance
matrix respectively.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 68 / 86
69. Gaussian Random Vectors
Question 9: From a sample size of 5000 (Example 6), compute the
maximum likelihood estimation of the population mean and population
covariance matrix.
Solution:
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 69 / 86
70. Outline
1 Multivariate and
High-Dimensional Problems
2 Visualisation
Three-Dimensional
Visualisation
Parallel Coordinate Plots
3 Multivariate Random Vectors
and Data
Population Case
Sample Case
Multivariate Random Vectors
Gaussian Random Vectors
Marginal and Conditional
Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 70 / 86
71. Marginal and Conditional Normal Distributions
Consider a normal random vector X = [X1, X2,..., Xd]T. Let X[1] be a
vector consisting of the first d1 entries of X, and let X[2] be the vector
consisting of the remaining d2 entries:
X =
X[1]
X[2]
(24)
For ι = 1, 2 we let µι be the mean of X[ι] and Σι its covariance matrix.
Question 10: Let X ∼ N(µ, Σ) be 4-variate, where µ = [2, 3, 2, 3]T ,
Σ =
4 1 1 1
1 4 1 1
1 1 4 1
1 1 1 4
Σ−1 =
0.286 −0.048 −0.048 −0.048
−0.048 0.286 −0.048 −0.048
−0.048 −0.048 0.286 −0.048
−0.048 −0.048 −0.048 0.286
.
Compute µ1 and Σ1 of X[1] and µ2 and Σ2 of X[2], where d1 and d2 are 2.
Analyze all the properties from Result 1.4 and 1.5.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 71 / 86
72. Marginal and Conditional Normal Distributions
Result 1.4
Assume that X[1], X[2] and X are given by Equation 24 for some d1, d2
d such that d1 + d2 = d. Assume also that X ∼ N(µ, Σ).
1 For j = 1,...,d the jth variable Xj of X has the distribution N(µj , σ2
j ).
2 ι = 1, 2, X[ι] has the distribution N(µι, Σι).
3 The (between) covariance matrix cov(X[1], X[2]) of X[1] and X[2]
is the d1xd2 submatrix Σ12 of.
Σ =
Σ1 Σ12
ΣT
12 Σ2
(25)
The marginal distributions of normal random vectors are normal with
means and covariance matrices of the original random vectors.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 72 / 86
73. Marginal and Conditional Normal Distributions
Result 1.5
Assume that X[1], X[2] and X are given by Equation 24 for some d1, d2
d such that d1 + d2 = d. Assume also that X ∼ N(µ, Σ) and that Σ1
and Σ2 are invertible.
If X[1] and X[2] are independent. The covariance matrix Σ12 of
X[1] and X[2] satisfies
Σ12 = 0d1xd2 (26)
Assume that Σ12 ̸= 0d1×d2
. Put X21 = X2 -Σ12
TΣ1
-1X1. Then
X21 is a d2-dimensional random vector which is independent of X1
and X21 ∼N(µ21, Σ21) with
µ21 = µ2 − ΣT
12Σ−1
1 µ1 and Σ2/1 = Σ2 − ΣT
12Σ−1
1 Σ12 (27)
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 73 / 86
74. Marginal and Conditional Normal Distributions
Result 1.5 (Continue)
Let (X[1] | X[2]) be the conditional random vector X[1] given X[2].
Then (X[1] | X[2]) ∼ N(µX1 | X2
, ΣX1 | X2
)
µX1|X2
= µ1 + Σ12Σ−1
2 (X[2]
− µ2) (28)
ΣX1|X2
= Σ1 − Σ12Σ−1
2 ΣT
12 (29)
The first property specifies independence always implies uncorrelated
-ness, and for the normal distribution, the converse holds too. The
second property shows how one can uncorrelate the vectors X[1] and
X[2]. The last property is about the adjustments that are needed when
the sub-vectors have a non-zero covariance matrix.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 74 / 86
75. Marginal and Conditional Normal Distributions
Q.10 solution:
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 75 / 86
76. Marginal and Conditional Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 76 / 86
77. Marginal and Conditional Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 77 / 86
78. Marginal and Conditional Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 78 / 86
79. Marginal and Conditional Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 79 / 86
80. Marginal and Conditional Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 80 / 86
81. Marginal and Conditional Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 81 / 86
82. Marginal and Conditional Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 82 / 86
83. Marginal and Conditional Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 83 / 86
84. Marginal and Conditional Normal Distributions
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 84 / 86
85. Summary
Here, we have discussed
Different types of multivariate and high-dimensional problems.
Three-dimensional visualisation of features from the HIV and Iris
flower data-sets.
Data visualisation using vertical parallel coordinate plot.
Data Visualisation using horizontal parallel coordinate plot.
Differentiation between population cases and sample cases.
Population mean, population covariance matrix, sample mean and
sample covariance matrix of multivariate random vectors.
Population mean, population covariance matrix, sample mean and
sample covariance matrix of Gaussian random vectors.
Parameters and properties of marginal and conditional normal
distributions.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 85 / 86
86. For Further Reading I
I. Koch.
Analysis of multivariate and high-dimensional data (Vol. 32).
Cambridge Universities Press, 2014.
F. Emdad and S. R. Zekavat.
High dimensional data analysis: overview, analysis and applications.
VDM verlag, 2008.
Dr. Ashutosh Satapathy Multidimensional Data October 19, 2022 86 / 86