SlideShare a Scribd company logo
1 of 67
Introduction to Multivariate Data Analysis (MVA)
1
o Introduction to exploring data with MVA
o Tutorial on using R to perform multivariate analysis
What is Multivariate analysis?
•‘Multivariate’ means data represented by two or more variables
e.g. height, weight and gender of a person
• Majority of datasets collected in biomedical research are multivariate
• These datasets nearly always contain ‘noise’
• Aim of exploratory MVA is to discover patterns that exist within the data
despite noise
e.g. patterns maybe subgroups of patients with a
certain disease
• When we apply MV methods we study:
• Variation in each of these variables
• Similarity or distance between variables
• in MVA we work in multidimensional space
2
A Typical Multivariate Dataset has Independent and Dependent Variables
p1 p2 p3 p4 p5
g1 77.2 91.6 41.9 37.2 68.5
g2 74.2 66.9 21.2 31.4 57.1
g3 66.6 49.6 71.2 27.8 72.6
g4 28.9 0.2 17.7 1.4 8.1
g5 3.5 3.9 4.1 8.2 6.4
g6 18 47.4 94 59 7
g7 73.1 42.8 34.9 96.3 25
g8 66.7 34.3 48.2 44.3 51
g9 98.2 82.7 28.1 17.7 47.6
g10 20.3 61.6 45.5 83.5 70.9
g11 0.3 0.9 2.1 4.1 1.1
g12 34.1 12.3 90.6 73.4 90.9
g13 68 48.2 5.2 10.1 66.7
g14 5.3 74.6 64.1 19.4 16.8
g15 73.5 67.8 13.6 12.5 81.6
g16 4 14 16.5 22 16.5
g17 69.5 61.3 53.3 78.7 73.3
g18 0.9 7.4 12.5 1.4 15.9
g19 1.7 16.2 32.5 37.4 79.4
g20 49.8 52.4 85.7 47.7 84.8
Dependent Variables (DV’s)IndependentVariables(IV’s)
e.g. The expression levels for 20 genes in 5 patients
3An expression level in a patient is dependent on the gene
Multivariate datasets can contain mixed data types :
Data in a variable can be:
Numerical 0,1,2,3…
0.1,0.2,0.3… e.g. height, gene expression level
Categorical (factor) A, B, AB, O… e.g. blood group
0,1,2,3… immunohistochemistry score
0 or 1 survival 0= dead; 1 alive
Data types
P1 P2 P3 P4 P5
V1 77.2 74.2 66.6 28.9 3.5
V2 91.6 66.9 49.6 0.2 3.9
V3 41.9 21.2 71.2 17.7 4.1
V4 0 1 0 1 1
V5 A A C E B
4
Numerical
categorical
There are different categories of MVA methods
Multivariate statistics Machine learning
Exploratory
Modelling &
Classification
-Find underlying patterns in the data
-Determine groups e.g. similar genes
-Generate hypotheses
-Create models e.g. predict cancer
-Classify groups e.g. new cancer subgroup
5
MVA methods
We will look at
multivariate statistical
methods for exploratory
analysis
• Hierarchical
Cluster
Analysis
(HCA)
All these methods allow good visualization of patterns in your data
Exploratory multivariate analysis methods
Clustering Data Reduction
Tree based Partition
• K-Means
• Partition Around Medoids (PAM)
• Principal Components Analysis
•(PCA)
Main categories of Exploratory MVA methods that we will look at
6
Commonly used software for multivariate analysis in academia
Commercial:
SPSS - Limited
Minitab - Limited
Matlab - Comprehensive
Free & open source:
R - Comprehensive
Octave - Comprehensive
WEKA - Comprehensive
Many other (more limited) free software packages available here:
http://www.freestatistics.info/en/stat.php
7
This lecture focuses on how we can use R directly from within Microsoft Excel
R Statistical Analysis & Programming Environment
http://cran.r-project.org/
http://cran.r-project.org/doc/manuals/R-intro.pdf
Download here:
Introductory book:
Recommended book: R for Medicine and Biology, Jones & Bartlett, 2009
8
R can be your ‘hub’ for data analysis
9
Rest of Lecture is….
Exploring our data using these methods…
Hierarchical Cluster Analysis
Partition Clustering
PCA
1
2
3
+
Examples
10
Please download the Demo.xlsx workbook from Blackboard
- This workbook contains all the R code you need to work through the lecture
The Excel Workbook for MVA – Demo.xlsx
Select Worksheet
Select Code 11
Hierarchical Cluster Analysis
1
12
Hierarchical Cluster Analysis
Objective:
We have a dataset of DV’s (columns) and IV’s
(rows)
We want to VISUALIZE how DV’s group together
according to how similar they are across the IV
scores or vice versa
So we measure Similarity = Distance
What does HCA give you?
A tree (or dendrogram)
A B C D
S1 42 18 4 37
S2 35 23 10 48
S3 39 25 7 22
... ... ... ... ...
S10 27 22 16 41
Patients
genes
13
Data distance matrix Build tree Visualize How many groups there are
Steps:
1 2 3
The distance between two points is the length of the path connecting them.
The closer together two points (i.e. your variables) are the more similar
they are in what is being measured
What do we mean by distance?
14
Point B
Point A
Think of your data as being points in multidimensional space
1. Create a distance matrix Measure similarity between column variables
S1
S2
A
B
0
50
50
How similar are variables A & B
Across all cases S1....Sn?
26.8
24
12
AB = √ (24)2 + (12)2 = 26.8
A B C D
S1 42 18 4 37
S2 35 23 10 48
S3 39 25 7 22
... ... ... ... ...
S10 27 22 16 41
Patients
genes
15
S1
S10
A
B
0
50
50
26.4
Measure similarity between variables
S1
S2
A
B
0
50
50
S1
S3
A
B
0
50
50
25.3
And so on ......
26.8
Distance between AB:
√ (24)2 + (12)2 + (8)2 + ...... + (5)2
A B C D
S1 42 18 4 37
S2 35 23 10 48
S3 39 25 7 22
... ... ... ... ...
S10 27 22 16 41
Patients
genes
16
The distance represents similarity measures for ALL pairs of variables across ALL cases
A 0
B 26 0
C 18 32 0
D 31 22 9 0
A B C D
17
The distance matrix
Tree Building from distance matrix
A 0
B 26 0
C 18 32 0
D 31 22 9 0
A B C D
A 0
B 26 0
C&D 24.5 27 0
A B C&D
B 0
A&C&D 26.5 0
B A&C&D
C DAB
1. Find smallest distance value between a pair
2. Take average and create a new matrix combining the pair
24.5
9
26.5
18
Euclidean distance. This is probably the most commonly chosen type of distance. It simply is the geometric distance
in the multidimensional space.
Squared Euclidean distance. You may want to square the standard Euclidean distance in order to place
progressively greater weight on objects that are further apart.
City-block (Manhattan) distance. This distance is simply the average difference across dimensions. In most
cases, this distance measure yields results similar to the simple Euclidean distance. However, note that in this measure, the
effect of single large differences (outliers) is dampened (since they are not squared).
Correlation
Gower's distance – allows you to use mixed numerical and categorical data
Some common distance measures
19
This is what I
just used
Single linkage (nearest neighbor). The distance between two clusters is determined
by the distance of the two closest objects (nearest neighbors) in the different clusters. This rule will, in a
sense, string objects together to form clusters, and the resulting clusters tend to represent long
"chains.“
Complete linkage (furthest neighbor). In this method, the distances between
clusters are determined by the greatest distance between any two objects in the different clusters (i.e.,
by the "furthest neighbors"). This method usually performs quite well in cases when the objects actually
form naturally distinct "clumps." If the clusters tend to be somehow elongated or of a "chain" type
nature, then this method is inappropriate.
Unweighted pair-group average. In this method, the distance between two clusters is
calculated as the average distance between all pairs of objects in the two different clusters. This method
is also very efficient when the objects form natural distinct "clumps," however, it performs equally well
with elongated, "chain" type clusters.
Some common tree building algorithms
20
This is what I
just used
21
Install all the required libraries for MVA in R
These libraries need to be downloaded into R
Copy the lines of code from the ‘Setup’ worksheet
Run the code in R (see next slide)
22
Select a Download Source…
Choose Bristol or London
Install all the required libraries for MVA in R
23
Install all the required libraries for MVA in R
Then load the libraries into R
Select data from gray, highlighted area…
Paste into a text file
Call the filename ‘data.txt’
Load into a data.frame called ‘dat’
Use code:
read.table(‘data.txt’, header=TRUE, row.names=1)
Make sure that R is pointing to your directory/folder
24
Data Worksheet
Using Hierarchical Cluster Analysis in R
25
Using Hierarchical Cluster Analysis in R
Click on the HCA tab in the workbook
To Plot a dendrogram for DV’s with: Distance matrix= ‘correlation’, Tree building = ‘complete’
26
- Copy code from A17 and rin in R (the dendrogram should appear)
- The tree show the similarities between patients according to gene expression levels
27
To Plot a dendrogram for IV’s with: Distance matrix= ‘correlation’, Tree building = ‘complete’
- Copy code from cell A22 and run in R
- The tree shows similarities for gene expression across patients
To plot a dendrogram and HEATMAP for IV’s and DV’s
28
- Run the code from cells c18:C23
- The trees are now visualized together and the heatmap colours are relative to the
expression levels of each gene in each patient (green = high; red = low; black = intermediate)
Summary of what HCA has shown us
HCA...
•Provides an overall feel for how our data
groups
• In the example, there might be:
•2 clusters of patients
•2 large clusters of genes
• 4 or 5 smaller sub-clusters of
genes
•Genes cluster according to patterns of
expression across patients
29
Confirm the number of groups in our data using
Partition Clustering
2
30
Partition Clustering
Objective:
We have a dataset of DV’s (columns) and IV’s
(rows)
We have a feel for how many clusters there are
in our dataset after using HCA
We want to assign our variables into distinct
clusters – so we use a partition clustering
method
What does Partition clustering give you?
A table showing the hard assignment of your
variables into to discrete clusters
A B C D
S1 42 18 4 37
S2 35 23 10 48
S3 39 25 7 22
... ... ... ... ...
S10 27 22 16 41
Patients
genes
31
Steps in Partition Clustering
1. Choose a partition clustering method suitable for your data
e.g. K-Means, Partition Around Medoids
2. Tell the method how many clusters you think there are in the dataset
e.g. 2, 3, 4…..
3. Read output table to see which cluster each variable has been assigned to
4. Try to assess the ‘fit’ of each variable in a cluster
i.e. how well has clustering worked?
5. Repeat with a different cluster number until you get the best fit
32
Most widely used method is K-Means clustering
K-Means uses euclidean distance to create the distance matrix
Partition Clustering Algorithm Overview….
1. You have to define the number of clusters
2. A distance matrix is created between variables
3. Random cluster ‘centres’ are created in multidimensional space
4. Method then assigns samples to nearest cluster centre
5. Cluster centres are then moved to better fit the samples
6. Samples are reassigned to cluster centres
7. Process repeated until best fit is achieved
33
All this will be explained pictorially in the next few slides
An Example … are there 4 clusters in this dataset?
Data Space...
The gray dots represent data and red squares possible cluster ‘centres’
3535
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html
Using the interactive tool at the URL below we can follow how K-Means partitions our data
36
K-Means starts by RANDOMLY assigning cluster centres to the data
Boundaries are drawn around the nearest data points that K-Means thinks should group with the cluster
centre. The cluster centre is then shifted towards the centre of these data points
37
The boundary lines are then redrawn around the data points that are closest to the new cluster centres
This means that some data points better fit a new cluster
38
It keeps doing this….
39
…and on….
40
41
…and on….
42
…and on….
43
…and on….
A best fit is achieved – it cannot get a better fit by moving centres around…until….
44
Variable Cluster Variable Cluster Variable Cluster
1 3 11 2 21 2
2 4 12 4 22 4
3 4 13 1 23 1
4 1 14 3 24 2
5 2 15 4 25 1
6 4 16 4
7 2 17 2
8 4 18 3
9 3 19 2
10 1 20 1
Variables are then listed according to cluster
45
Can Partition Clustering methods be used on categorical data?
•You just need to use a different method to create the distance matrix
•Do not use K-Means!
•Use Partition Around Medoids (PAM) instead of K-Means with
Gower’s Distance measure.
Yes!
46
PAM is more robust than K-Means as…
• It gives a better approximation of the centre of a cluster
• It can use any type of distance matrix (not just euclidean distance)
• It uses a novel visualization tool, the silhouette plot, to help you decide the
optimal number of clusters
An alternative method to K-Means is…K-Medoids Clustering
The most common K-Medoids method is:
Partition Around Medoids (PAM)
Pam measures the average DISSIMILARITY between variables in a cluster
Why use PAM?
47
Evaluating how well our clustering has worked
How good is fit of clusters across variables?
What is the optimal number of clusters?
The silhouette plot provides these answers
Clusters = 4
N = 75
Bars = fit of sample in cluster
Bar Length = goodness of fit
Each cluster has an average
length (Si)
Average Silhouette
Width = 0.74
Rough rule of thumb:
Average Silhouette
Width > 0.4 is good
48
Anything greater than 0.5
is a decent fit
If Clusters = 5 then:
Average Silhouette Width
decreases
Look at cluster 3
One sample has a poor fit
Other samples have not so
good a fit
Choose K that has the highest Silhouette Width
Keep trying different cluster numbers (k) to see how the average
silhouette width changes
Not very
good fit
49
The K-Means & PAM Worksheet
50
Running PAM in R
Clustering IV’s
51
Change the value of K (no. clusters) and observe the average silhouette width
Average
Silhouette = 0.45
Width
Average
Silhouette = 0.49
Width
Average
Silhouette = 0.59
Width
K=3 K=4 K=5
52
Getting output to show cluster assignment
Click on a new worksheet and paste output from R
53
Summary of what PAM has shown us
•PAM told us that it is most likely that
there are 5 clusters of genes in our
dataset
•PAM assigned each gene to a definite
cluster
54
Visualize the relationship between variables in groups with
Principal Components Analysis
3
55
Principal Components Analysis (PCA)
What does it do…
• It is a data reduction technique
•It seeks a linear combination of variables such that the maximum variance is extracted
from the variables.
• PCA produces uncorrelated factors (components).
What does it give you…
• The components might represent underlying groups within the data
• By finding a small number of components you have reduced the dimensionality of
your data
56
X Y
1 42 18
2 35 23
3 39 25
... ... ...
N 27 22
PCA – The Concepts
If we take data for two variables and plot as a scatter plot, we can draw a
line of best fit through the data (the length of which is from the furthest
two data points)
By summing the distances between points and the line we can determine
how much variation in the data each line captures.
We can then draw a second line at right angles between the two further
data points in that direction and this line captures more variation
57
•In multivariate data we have many variables potted in multidimensional space
•So we draw many ‘lines of best fit’ – each line is called an eigenvector
•The variables have a score on each eigenvector depending on how much variation is
explained by that line (eigenvalue)
•We refer to the eigenvectors as components
•Different variables will have similar or different correlations on each component
•Therefore we can group together variables according to these similarities
Each data point has a score on
each component like a
correlation
eigenvalue
eigenvector
PCA – The Concepts
58
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4
Proportion of Variance 0.62 0.24 0.08 0.04
Cumulative Proportion 0.62 0.86 0.95 1.00
How many groups are there?
Why is this important?
- It tells us how many components to retain (i.e. we throw out minor components)
- The number of components we retain is the number of groups in the data
Rough rule of thumb:
Retain components explaining >= 5% of the variation
59
Each component explains different amounts of variation in the data
Eigenvalues help us decide on many components to retain
A Scree plot will show you the eigenvalues
for each component
This scree plot shows the
variance of each component
Rough rule of thumb:
Look to see where the curve levels off
The Kaiser criterion:
Retain components having an eigenvalue > 1 60
How many groups are there?
The PCA Worksheet
61
1. Click on a new worksheet
2. Paste output from R
Getting output to show scores of IV’s on components
62
Optimal number of components is 4
where variance explained is > =5%
Generate a Variance Table & a Scree Plot
63
Visualizing the scores of IV’s on components using a scatterplot
This plot shows:
Component 1 (PC1)
v.
Component 2 (PC2)
• PC1 & PC2 separate groups
of genes and patients
64
You can see that
P1 and P2 are
similar due to
levels of gene g9
P3 and P4 are similar
P5 is clearly different to the other
patients according to gene expression
levels
This plot shows:
Component 1 (PC1)
v.
Component 3 (PC3)
This plot gives
another view on the
data groups and the
relationship between
variables and
components
65
Visualizing the scores of IV’s on components using a scatterplot
Putting it all together…A whole map of the patterns in our data….
A
B
C
D
E
A
B
C
D
E
A
E
…We have a consensus of
how our variables group
We could generate new
hypotheses from our data
66
Typical MVA workflow you can apply to your data in research projects
Estimate number of groups with Tree
based Clustering
Confirm number of groups with
Partition Clustering
Visualize relationship between
variables with data reduction
Dataset
Hierarchical Cluster Analysis
K-Means, PAM
Principal Components
Analysis (PCA)
67

More Related Content

What's hot

Cluster Analysis
Cluster Analysis Cluster Analysis
Cluster Analysis Baivab Nag
 
Cluster Analysis Assignment 2013-2014(2)
Cluster Analysis Assignment 2013-2014(2)Cluster Analysis Assignment 2013-2014(2)
Cluster Analysis Assignment 2013-2014(2)TIEZHENG YUAN
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysisguest0edcaf
 
Matrix algebra in_r
Matrix algebra in_rMatrix algebra in_r
Matrix algebra in_rRazzaqe
 
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...ijaia
 
CATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORY
CATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORYCATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORY
CATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORYijaia
 
Two-factor Mixed MANOVA with SPSS
Two-factor Mixed MANOVA with SPSSTwo-factor Mixed MANOVA with SPSS
Two-factor Mixed MANOVA with SPSSJ P Verma
 
Comparative study of ksvdd and fsvm for classification of mislabeled data
Comparative study of ksvdd and fsvm for classification of mislabeled dataComparative study of ksvdd and fsvm for classification of mislabeled data
Comparative study of ksvdd and fsvm for classification of mislabeled dataeSAT Journals
 
Hierarchical Clustering in Data Mining
Hierarchical Clustering in Data MiningHierarchical Clustering in Data Mining
Hierarchical Clustering in Data MiningYashraj Nigam
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis緯鈞 沈
 
Repeated measures anova with spss
Repeated measures anova with spssRepeated measures anova with spss
Repeated measures anova with spssJ P Verma
 
Higgs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - KaggleHiggs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - KaggleSajith Edirisinghe
 
Reporting a paired sample t test
Reporting a paired sample t testReporting a paired sample t test
Reporting a paired sample t testKen Plummer
 
Discriminant Analysis-lecture 8
Discriminant Analysis-lecture 8Discriminant Analysis-lecture 8
Discriminant Analysis-lecture 8Laila Fatehy
 

What's hot (18)

Cluster Analysis
Cluster Analysis Cluster Analysis
Cluster Analysis
 
Cluster Analysis Assignment 2013-2014(2)
Cluster Analysis Assignment 2013-2014(2)Cluster Analysis Assignment 2013-2014(2)
Cluster Analysis Assignment 2013-2014(2)
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
Matrix algebra in_r
Matrix algebra in_rMatrix algebra in_r
Matrix algebra in_r
 
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...
 
CATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORY
CATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORYCATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORY
CATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORY
 
Measures of Variation
Measures of Variation Measures of Variation
Measures of Variation
 
Two-factor Mixed MANOVA with SPSS
Two-factor Mixed MANOVA with SPSSTwo-factor Mixed MANOVA with SPSS
Two-factor Mixed MANOVA with SPSS
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Comparative study of ksvdd and fsvm for classification of mislabeled data
Comparative study of ksvdd and fsvm for classification of mislabeled dataComparative study of ksvdd and fsvm for classification of mislabeled data
Comparative study of ksvdd and fsvm for classification of mislabeled data
 
Hierarchical Clustering in Data Mining
Hierarchical Clustering in Data MiningHierarchical Clustering in Data Mining
Hierarchical Clustering in Data Mining
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Repeated measures anova with spss
Repeated measures anova with spssRepeated measures anova with spss
Repeated measures anova with spss
 
Higgs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - KaggleHiggs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - Kaggle
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Rajia cluster analysis
Rajia cluster analysisRajia cluster analysis
Rajia cluster analysis
 
Reporting a paired sample t test
Reporting a paired sample t testReporting a paired sample t test
Reporting a paired sample t test
 
Discriminant Analysis-lecture 8
Discriminant Analysis-lecture 8Discriminant Analysis-lecture 8
Discriminant Analysis-lecture 8
 

Viewers also liked

Viewers also liked (20)

British airways
British airwaysBritish airways
British airways
 
Classical mgmt
Classical mgmtClassical mgmt
Classical mgmt
 
Company Profiles
Company ProfilesCompany Profiles
Company Profiles
 
Metropolis Healthcare Ltd
Metropolis Healthcare LtdMetropolis Healthcare Ltd
Metropolis Healthcare Ltd
 
Ch 2 data analysis
Ch 2 data analysisCh 2 data analysis
Ch 2 data analysis
 
HTML for Education
HTML for EducationHTML for Education
HTML for Education
 
EDX -EXCEL Data Analysis - Take It to the MAX() (Delft University of Technol...
EDX -EXCEL Data Analysis - Take It to the MAX()  (Delft University of Technol...EDX -EXCEL Data Analysis - Take It to the MAX()  (Delft University of Technol...
EDX -EXCEL Data Analysis - Take It to the MAX() (Delft University of Technol...
 
PS CH 10 matter properties and changes edited
PS CH 10 matter properties and changes editedPS CH 10 matter properties and changes edited
PS CH 10 matter properties and changes edited
 
Chap017
Chap017Chap017
Chap017
 
Chapter36a
Chapter36aChapter36a
Chapter36a
 
Securitas
SecuritasSecuritas
Securitas
 
Marketing techniques & much more
Marketing techniques & much moreMarketing techniques & much more
Marketing techniques & much more
 
121 vhgfhg
121 vhgfhg121 vhgfhg
121 vhgfhg
 
Porter 5 forces
Porter 5 forcesPorter 5 forces
Porter 5 forces
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Test for equal variances
Test for equal variancesTest for equal variances
Test for equal variances
 
NFL 2013 Combine Data Multivariate Analysis
NFL 2013 Combine Data Multivariate AnalysisNFL 2013 Combine Data Multivariate Analysis
NFL 2013 Combine Data Multivariate Analysis
 
Ppt of Modern Retail stores
Ppt of Modern Retail storesPpt of Modern Retail stores
Ppt of Modern Retail stores
 
Introduction of matrices
Introduction of matricesIntroduction of matrices
Introduction of matrices
 
Excel 3: Data Analysis
Excel 3: Data Analysis Excel 3: Data Analysis
Excel 3: Data Analysis
 

Similar to Pm m23 & pmnm06 week 3 lectures 2015

SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional VerificationSai Kiran Kadam
 
Statistical Clustering
Statistical ClusteringStatistical Clustering
Statistical Clusteringtim_hare
 
Classifiers
ClassifiersClassifiers
ClassifiersAyurdata
 
Heart disease classification
Heart disease classificationHeart disease classification
Heart disease classificationSnehaDey21
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Seval Çapraz
 
Measures of Dispersion.pptx
Measures of Dispersion.pptxMeasures of Dispersion.pptx
Measures of Dispersion.pptxVanmala Buchke
 
3.2 measures of variation
3.2 measures of variation3.2 measures of variation
3.2 measures of variationleblance
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and ldaSuresh Pokharel
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learningAmAn Singh
 
Islamic University Pattern Recognition & Neural Network 2019
Islamic University Pattern Recognition & Neural Network 2019 Islamic University Pattern Recognition & Neural Network 2019
Islamic University Pattern Recognition & Neural Network 2019 Rakibul Hasan Pranto
 
Singular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxSingular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxrajalakshmi5921
 
EDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxEDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxrajalakshmi5921
 
3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdf3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdfAmanuelDina
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Ganesan Narayanasamy
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clusteringishmecse13
 
Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematicshktripathy
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptionsrefedey275
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxnikshaikh786
 

Similar to Pm m23 & pmnm06 week 3 lectures 2015 (20)

SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional Verification
 
Statistical Clustering
Statistical ClusteringStatistical Clustering
Statistical Clustering
 
Classifiers
ClassifiersClassifiers
Classifiers
 
Declarative data analysis
Declarative data analysisDeclarative data analysis
Declarative data analysis
 
Heart disease classification
Heart disease classificationHeart disease classification
Heart disease classification
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
 
Measures of Dispersion.pptx
Measures of Dispersion.pptxMeasures of Dispersion.pptx
Measures of Dispersion.pptx
 
3.2 measures of variation
3.2 measures of variation3.2 measures of variation
3.2 measures of variation
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and lda
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
Islamic University Pattern Recognition & Neural Network 2019
Islamic University Pattern Recognition & Neural Network 2019 Islamic University Pattern Recognition & Neural Network 2019
Islamic University Pattern Recognition & Neural Network 2019
 
Singular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxSingular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptx
 
EDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxEDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptx
 
3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdf3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdf
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
 
Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematics
 
Microarray Analysis
Microarray AnalysisMicroarray Analysis
Microarray Analysis
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
 

Pm m23 & pmnm06 week 3 lectures 2015

  • 1. Introduction to Multivariate Data Analysis (MVA) 1 o Introduction to exploring data with MVA o Tutorial on using R to perform multivariate analysis
  • 2. What is Multivariate analysis? •‘Multivariate’ means data represented by two or more variables e.g. height, weight and gender of a person • Majority of datasets collected in biomedical research are multivariate • These datasets nearly always contain ‘noise’ • Aim of exploratory MVA is to discover patterns that exist within the data despite noise e.g. patterns maybe subgroups of patients with a certain disease • When we apply MV methods we study: • Variation in each of these variables • Similarity or distance between variables • in MVA we work in multidimensional space 2
  • 3. A Typical Multivariate Dataset has Independent and Dependent Variables p1 p2 p3 p4 p5 g1 77.2 91.6 41.9 37.2 68.5 g2 74.2 66.9 21.2 31.4 57.1 g3 66.6 49.6 71.2 27.8 72.6 g4 28.9 0.2 17.7 1.4 8.1 g5 3.5 3.9 4.1 8.2 6.4 g6 18 47.4 94 59 7 g7 73.1 42.8 34.9 96.3 25 g8 66.7 34.3 48.2 44.3 51 g9 98.2 82.7 28.1 17.7 47.6 g10 20.3 61.6 45.5 83.5 70.9 g11 0.3 0.9 2.1 4.1 1.1 g12 34.1 12.3 90.6 73.4 90.9 g13 68 48.2 5.2 10.1 66.7 g14 5.3 74.6 64.1 19.4 16.8 g15 73.5 67.8 13.6 12.5 81.6 g16 4 14 16.5 22 16.5 g17 69.5 61.3 53.3 78.7 73.3 g18 0.9 7.4 12.5 1.4 15.9 g19 1.7 16.2 32.5 37.4 79.4 g20 49.8 52.4 85.7 47.7 84.8 Dependent Variables (DV’s)IndependentVariables(IV’s) e.g. The expression levels for 20 genes in 5 patients 3An expression level in a patient is dependent on the gene
  • 4. Multivariate datasets can contain mixed data types : Data in a variable can be: Numerical 0,1,2,3… 0.1,0.2,0.3… e.g. height, gene expression level Categorical (factor) A, B, AB, O… e.g. blood group 0,1,2,3… immunohistochemistry score 0 or 1 survival 0= dead; 1 alive Data types P1 P2 P3 P4 P5 V1 77.2 74.2 66.6 28.9 3.5 V2 91.6 66.9 49.6 0.2 3.9 V3 41.9 21.2 71.2 17.7 4.1 V4 0 1 0 1 1 V5 A A C E B 4 Numerical categorical
  • 5. There are different categories of MVA methods Multivariate statistics Machine learning Exploratory Modelling & Classification -Find underlying patterns in the data -Determine groups e.g. similar genes -Generate hypotheses -Create models e.g. predict cancer -Classify groups e.g. new cancer subgroup 5 MVA methods We will look at multivariate statistical methods for exploratory analysis
  • 6. • Hierarchical Cluster Analysis (HCA) All these methods allow good visualization of patterns in your data Exploratory multivariate analysis methods Clustering Data Reduction Tree based Partition • K-Means • Partition Around Medoids (PAM) • Principal Components Analysis •(PCA) Main categories of Exploratory MVA methods that we will look at 6
  • 7. Commonly used software for multivariate analysis in academia Commercial: SPSS - Limited Minitab - Limited Matlab - Comprehensive Free & open source: R - Comprehensive Octave - Comprehensive WEKA - Comprehensive Many other (more limited) free software packages available here: http://www.freestatistics.info/en/stat.php 7 This lecture focuses on how we can use R directly from within Microsoft Excel
  • 8. R Statistical Analysis & Programming Environment http://cran.r-project.org/ http://cran.r-project.org/doc/manuals/R-intro.pdf Download here: Introductory book: Recommended book: R for Medicine and Biology, Jones & Bartlett, 2009 8
  • 9. R can be your ‘hub’ for data analysis 9
  • 10. Rest of Lecture is…. Exploring our data using these methods… Hierarchical Cluster Analysis Partition Clustering PCA 1 2 3 + Examples 10 Please download the Demo.xlsx workbook from Blackboard - This workbook contains all the R code you need to work through the lecture
  • 11. The Excel Workbook for MVA – Demo.xlsx Select Worksheet Select Code 11
  • 13. Hierarchical Cluster Analysis Objective: We have a dataset of DV’s (columns) and IV’s (rows) We want to VISUALIZE how DV’s group together according to how similar they are across the IV scores or vice versa So we measure Similarity = Distance What does HCA give you? A tree (or dendrogram) A B C D S1 42 18 4 37 S2 35 23 10 48 S3 39 25 7 22 ... ... ... ... ... S10 27 22 16 41 Patients genes 13 Data distance matrix Build tree Visualize How many groups there are Steps: 1 2 3
  • 14. The distance between two points is the length of the path connecting them. The closer together two points (i.e. your variables) are the more similar they are in what is being measured What do we mean by distance? 14 Point B Point A Think of your data as being points in multidimensional space
  • 15. 1. Create a distance matrix Measure similarity between column variables S1 S2 A B 0 50 50 How similar are variables A & B Across all cases S1....Sn? 26.8 24 12 AB = √ (24)2 + (12)2 = 26.8 A B C D S1 42 18 4 37 S2 35 23 10 48 S3 39 25 7 22 ... ... ... ... ... S10 27 22 16 41 Patients genes 15
  • 16. S1 S10 A B 0 50 50 26.4 Measure similarity between variables S1 S2 A B 0 50 50 S1 S3 A B 0 50 50 25.3 And so on ...... 26.8 Distance between AB: √ (24)2 + (12)2 + (8)2 + ...... + (5)2 A B C D S1 42 18 4 37 S2 35 23 10 48 S3 39 25 7 22 ... ... ... ... ... S10 27 22 16 41 Patients genes 16
  • 17. The distance represents similarity measures for ALL pairs of variables across ALL cases A 0 B 26 0 C 18 32 0 D 31 22 9 0 A B C D 17 The distance matrix
  • 18. Tree Building from distance matrix A 0 B 26 0 C 18 32 0 D 31 22 9 0 A B C D A 0 B 26 0 C&D 24.5 27 0 A B C&D B 0 A&C&D 26.5 0 B A&C&D C DAB 1. Find smallest distance value between a pair 2. Take average and create a new matrix combining the pair 24.5 9 26.5 18
  • 19. Euclidean distance. This is probably the most commonly chosen type of distance. It simply is the geometric distance in the multidimensional space. Squared Euclidean distance. You may want to square the standard Euclidean distance in order to place progressively greater weight on objects that are further apart. City-block (Manhattan) distance. This distance is simply the average difference across dimensions. In most cases, this distance measure yields results similar to the simple Euclidean distance. However, note that in this measure, the effect of single large differences (outliers) is dampened (since they are not squared). Correlation Gower's distance – allows you to use mixed numerical and categorical data Some common distance measures 19 This is what I just used
  • 20. Single linkage (nearest neighbor). The distance between two clusters is determined by the distance of the two closest objects (nearest neighbors) in the different clusters. This rule will, in a sense, string objects together to form clusters, and the resulting clusters tend to represent long "chains.“ Complete linkage (furthest neighbor). In this method, the distances between clusters are determined by the greatest distance between any two objects in the different clusters (i.e., by the "furthest neighbors"). This method usually performs quite well in cases when the objects actually form naturally distinct "clumps." If the clusters tend to be somehow elongated or of a "chain" type nature, then this method is inappropriate. Unweighted pair-group average. In this method, the distance between two clusters is calculated as the average distance between all pairs of objects in the two different clusters. This method is also very efficient when the objects form natural distinct "clumps," however, it performs equally well with elongated, "chain" type clusters. Some common tree building algorithms 20 This is what I just used
  • 21. 21 Install all the required libraries for MVA in R These libraries need to be downloaded into R Copy the lines of code from the ‘Setup’ worksheet Run the code in R (see next slide)
  • 22. 22 Select a Download Source… Choose Bristol or London Install all the required libraries for MVA in R
  • 23. 23 Install all the required libraries for MVA in R Then load the libraries into R
  • 24. Select data from gray, highlighted area… Paste into a text file Call the filename ‘data.txt’ Load into a data.frame called ‘dat’ Use code: read.table(‘data.txt’, header=TRUE, row.names=1) Make sure that R is pointing to your directory/folder 24 Data Worksheet Using Hierarchical Cluster Analysis in R
  • 25. 25 Using Hierarchical Cluster Analysis in R Click on the HCA tab in the workbook
  • 26. To Plot a dendrogram for DV’s with: Distance matrix= ‘correlation’, Tree building = ‘complete’ 26 - Copy code from A17 and rin in R (the dendrogram should appear) - The tree show the similarities between patients according to gene expression levels
  • 27. 27 To Plot a dendrogram for IV’s with: Distance matrix= ‘correlation’, Tree building = ‘complete’ - Copy code from cell A22 and run in R - The tree shows similarities for gene expression across patients
  • 28. To plot a dendrogram and HEATMAP for IV’s and DV’s 28 - Run the code from cells c18:C23 - The trees are now visualized together and the heatmap colours are relative to the expression levels of each gene in each patient (green = high; red = low; black = intermediate)
  • 29. Summary of what HCA has shown us HCA... •Provides an overall feel for how our data groups • In the example, there might be: •2 clusters of patients •2 large clusters of genes • 4 or 5 smaller sub-clusters of genes •Genes cluster according to patterns of expression across patients 29
  • 30. Confirm the number of groups in our data using Partition Clustering 2 30
  • 31. Partition Clustering Objective: We have a dataset of DV’s (columns) and IV’s (rows) We have a feel for how many clusters there are in our dataset after using HCA We want to assign our variables into distinct clusters – so we use a partition clustering method What does Partition clustering give you? A table showing the hard assignment of your variables into to discrete clusters A B C D S1 42 18 4 37 S2 35 23 10 48 S3 39 25 7 22 ... ... ... ... ... S10 27 22 16 41 Patients genes 31
  • 32. Steps in Partition Clustering 1. Choose a partition clustering method suitable for your data e.g. K-Means, Partition Around Medoids 2. Tell the method how many clusters you think there are in the dataset e.g. 2, 3, 4….. 3. Read output table to see which cluster each variable has been assigned to 4. Try to assess the ‘fit’ of each variable in a cluster i.e. how well has clustering worked? 5. Repeat with a different cluster number until you get the best fit 32
  • 33. Most widely used method is K-Means clustering K-Means uses euclidean distance to create the distance matrix Partition Clustering Algorithm Overview…. 1. You have to define the number of clusters 2. A distance matrix is created between variables 3. Random cluster ‘centres’ are created in multidimensional space 4. Method then assigns samples to nearest cluster centre 5. Cluster centres are then moved to better fit the samples 6. Samples are reassigned to cluster centres 7. Process repeated until best fit is achieved 33 All this will be explained pictorially in the next few slides
  • 34. An Example … are there 4 clusters in this dataset? Data Space... The gray dots represent data and red squares possible cluster ‘centres’
  • 35. 3535 http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html Using the interactive tool at the URL below we can follow how K-Means partitions our data
  • 36. 36 K-Means starts by RANDOMLY assigning cluster centres to the data
  • 37. Boundaries are drawn around the nearest data points that K-Means thinks should group with the cluster centre. The cluster centre is then shifted towards the centre of these data points 37
  • 38. The boundary lines are then redrawn around the data points that are closest to the new cluster centres This means that some data points better fit a new cluster 38
  • 39. It keeps doing this…. 39
  • 44. A best fit is achieved – it cannot get a better fit by moving centres around…until…. 44
  • 45. Variable Cluster Variable Cluster Variable Cluster 1 3 11 2 21 2 2 4 12 4 22 4 3 4 13 1 23 1 4 1 14 3 24 2 5 2 15 4 25 1 6 4 16 4 7 2 17 2 8 4 18 3 9 3 19 2 10 1 20 1 Variables are then listed according to cluster 45
  • 46. Can Partition Clustering methods be used on categorical data? •You just need to use a different method to create the distance matrix •Do not use K-Means! •Use Partition Around Medoids (PAM) instead of K-Means with Gower’s Distance measure. Yes! 46
  • 47. PAM is more robust than K-Means as… • It gives a better approximation of the centre of a cluster • It can use any type of distance matrix (not just euclidean distance) • It uses a novel visualization tool, the silhouette plot, to help you decide the optimal number of clusters An alternative method to K-Means is…K-Medoids Clustering The most common K-Medoids method is: Partition Around Medoids (PAM) Pam measures the average DISSIMILARITY between variables in a cluster Why use PAM? 47
  • 48. Evaluating how well our clustering has worked How good is fit of clusters across variables? What is the optimal number of clusters? The silhouette plot provides these answers Clusters = 4 N = 75 Bars = fit of sample in cluster Bar Length = goodness of fit Each cluster has an average length (Si) Average Silhouette Width = 0.74 Rough rule of thumb: Average Silhouette Width > 0.4 is good 48 Anything greater than 0.5 is a decent fit
  • 49. If Clusters = 5 then: Average Silhouette Width decreases Look at cluster 3 One sample has a poor fit Other samples have not so good a fit Choose K that has the highest Silhouette Width Keep trying different cluster numbers (k) to see how the average silhouette width changes Not very good fit 49
  • 50. The K-Means & PAM Worksheet 50
  • 51. Running PAM in R Clustering IV’s 51
  • 52. Change the value of K (no. clusters) and observe the average silhouette width Average Silhouette = 0.45 Width Average Silhouette = 0.49 Width Average Silhouette = 0.59 Width K=3 K=4 K=5 52
  • 53. Getting output to show cluster assignment Click on a new worksheet and paste output from R 53
  • 54. Summary of what PAM has shown us •PAM told us that it is most likely that there are 5 clusters of genes in our dataset •PAM assigned each gene to a definite cluster 54
  • 55. Visualize the relationship between variables in groups with Principal Components Analysis 3 55
  • 56. Principal Components Analysis (PCA) What does it do… • It is a data reduction technique •It seeks a linear combination of variables such that the maximum variance is extracted from the variables. • PCA produces uncorrelated factors (components). What does it give you… • The components might represent underlying groups within the data • By finding a small number of components you have reduced the dimensionality of your data 56
  • 57. X Y 1 42 18 2 35 23 3 39 25 ... ... ... N 27 22 PCA – The Concepts If we take data for two variables and plot as a scatter plot, we can draw a line of best fit through the data (the length of which is from the furthest two data points) By summing the distances between points and the line we can determine how much variation in the data each line captures. We can then draw a second line at right angles between the two further data points in that direction and this line captures more variation 57
  • 58. •In multivariate data we have many variables potted in multidimensional space •So we draw many ‘lines of best fit’ – each line is called an eigenvector •The variables have a score on each eigenvector depending on how much variation is explained by that line (eigenvalue) •We refer to the eigenvectors as components •Different variables will have similar or different correlations on each component •Therefore we can group together variables according to these similarities Each data point has a score on each component like a correlation eigenvalue eigenvector PCA – The Concepts 58
  • 59. Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Proportion of Variance 0.62 0.24 0.08 0.04 Cumulative Proportion 0.62 0.86 0.95 1.00 How many groups are there? Why is this important? - It tells us how many components to retain (i.e. we throw out minor components) - The number of components we retain is the number of groups in the data Rough rule of thumb: Retain components explaining >= 5% of the variation 59 Each component explains different amounts of variation in the data
  • 60. Eigenvalues help us decide on many components to retain A Scree plot will show you the eigenvalues for each component This scree plot shows the variance of each component Rough rule of thumb: Look to see where the curve levels off The Kaiser criterion: Retain components having an eigenvalue > 1 60 How many groups are there?
  • 62. 1. Click on a new worksheet 2. Paste output from R Getting output to show scores of IV’s on components 62
  • 63. Optimal number of components is 4 where variance explained is > =5% Generate a Variance Table & a Scree Plot 63
  • 64. Visualizing the scores of IV’s on components using a scatterplot This plot shows: Component 1 (PC1) v. Component 2 (PC2) • PC1 & PC2 separate groups of genes and patients 64 You can see that P1 and P2 are similar due to levels of gene g9 P3 and P4 are similar P5 is clearly different to the other patients according to gene expression levels
  • 65. This plot shows: Component 1 (PC1) v. Component 3 (PC3) This plot gives another view on the data groups and the relationship between variables and components 65 Visualizing the scores of IV’s on components using a scatterplot
  • 66. Putting it all together…A whole map of the patterns in our data…. A B C D E A B C D E A E …We have a consensus of how our variables group We could generate new hypotheses from our data 66
  • 67. Typical MVA workflow you can apply to your data in research projects Estimate number of groups with Tree based Clustering Confirm number of groups with Partition Clustering Visualize relationship between variables with data reduction Dataset Hierarchical Cluster Analysis K-Means, PAM Principal Components Analysis (PCA) 67