PMM23 Week 3 Lectures

749 views
645 views

Published on

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
749
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
7
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

PMM23 Week 3 Lectures

  1. 1. Introduction to Multivariate Data Analysis (MVA)o Introduction to exploring data with MVAo Tutorial on using Excel to perform multivariate analysis 1
  2. 2. What is Multivariate analysis?•‘Multivariate’ means data represented by two or more variables e.g. height, weight and gender of a person• Majority of datasets collected in biomedical research are multivariate• These datasets nearly always contain ‘noise’• Aim of exploratory MVA is to discover patterns that exist within the datadespite noise e.g. patterns maybe subgroups of patients with acertain disease• When we apply MV methods we study: • Variation in each of these variables • Similarity or distance between variables• in MVA we work in multidimensional space 2
  3. 3. A Typical Multivariate Dataset has Independent and Dependent Variablese.g. The expression levels for 20 genes in 5 patients Dependent Variables (DV’s) p1 p2 p3 p4 p5 g1 77.2 91.6 41.9 37.2 68.5 g2 74.2 66.9 21.2 31.4 57.1 g3 66.6 49.6 71.2 27.8 72.6 Independent Variables (IV’s) g4 28.9 0.2 17.7 1.4 8.1 g5 3.5 3.9 4.1 8.2 6.4 g6 18 47.4 94 59 7 g7 73.1 42.8 34.9 96.3 25 g8 66.7 34.3 48.2 44.3 51 g9 98.2 82.7 28.1 17.7 47.6 g10 20.3 61.6 45.5 83.5 70.9 g11 0.3 0.9 2.1 4.1 1.1 g12 34.1 12.3 90.6 73.4 90.9 g13 68 48.2 5.2 10.1 66.7 g14 5.3 74.6 64.1 19.4 16.8 g15 73.5 67.8 13.6 12.5 81.6 g16 4 14 16.5 22 16.5 g17 69.5 61.3 53.3 78.7 73.3 g18 0.9 7.4 12.5 1.4 15.9 g19 1.7 16.2 32.5 37.4 79.4 g20 49.8 52.4 85.7 47.7 84.8An expression level in a patient is dependent on the gene 3
  4. 4. Data typesData in a variable can be:Numerical 0,1,2,3… 0.1,0.2,0.3… e.g. height, gene expression levelCategorical (factor) A, B, AB, O… e.g. blood group 0,1,2,3… immunohistochemistry score 0 or 1 survival 0= dead; 1 aliveMultivariate datasets can contain mixed data types : P1 P2 P3 P4 P5 V1 77.2 74.2 66.6 28.9 3.5 Numerical V2 91.6 66.9 49.6 0.2 3.9 V3 41.9 21.2 71.2 17.7 4.1 V4 0 1 0 1 1 categorical V5 A A C E B 4
  5. 5. There are different categories of MVA methods MVA methodsWe will look atmultivariate statisticalmethods for exploratoryanalysis Multivariate statistics Machine learning Modelling & Exploratory Classification -Find underlying patterns in the data -Create models e.g. predict cancer -Classify groups e.g. new cancer subgroup -Determine groups e.g. similar genes -Generate hypotheses 5
  6. 6. Main categories of Exploratory MVA methods that we will look at Exploratory multivariate analysis methods Clustering Data Reduction • Principal Components Analysis Tree based Partition •(PCA) • Hierarchical • K-Means Cluster • Partition Around Medoids (PAM) Analysis (HCA) All these methods allow good visualization of patterns in your data 6
  7. 7. Commonly used software for multivariate analysis in academia Commercial: SPSS - Limited Minitab - Limited Matlab - Comprehensive Free & open source: R - Comprehensive Octave - Comprehensive WEKA - Comprehensive Many other (more limited) free software packages available here: http://www.freestatistics.info/en/stat.phpThis lecture focuses on how we can use R directly from within Microsoft Excel 7
  8. 8. R Statistical Analysis & Programming EnvironmentDownload here: http://cran.r-project.org/Introductory book: http://cran.r-project.org/doc/manuals/R-intro.pdfRecommended book: R for Medicine and Biology, Jones & Bartlett, 2009 8
  9. 9. R can be your ‘hub’ for data analysis 9
  10. 10. You can use R directly from Excel RComExcel and R can be linked by installing a piece of ‘middleware’ called Rcom (see next slide)Combining Excel and R provides you with a environment for complete data processing andanalysis: 1. Use Excel to put your data together 2. Use a menu in Excel to analyse your data in R 3. Open the Demo Workbook 4. Use this workbook to analyze your data 10
  11. 11. Full instructions for downloading and installing R for Excel 1. Download and install R and other software you need to use R in Microsoft Excel: http://cancerinformatics.swansea.ac.uk/pathology/pmm23/rexcel.htm** PM-M23 Students – You should have already have installed this software in Week 2 ** 2. Download the Excel Workbook that accompanies the lecture: http://cancerinformatics.swansea.ac.uk/pathology/pmm23/Demo.zip 11
  12. 12. If you encounter the following error during installation :Then you will need to download, unzip and install the Office Service Pack 1 file:http://cancerinformatics.swansea.ac.uk/pathology/pmm23/officesp1.zipIf an error occurs during that installation you will need to download, unzip andinstall the Office Update file:http://cancerinformatics.swansea.ac.uk/pathology/pmm23/officeupdate.zip 12
  13. 13. The Excel Workbook for MVA – Demo.xlsxSelect Worksheet Select Code 13
  14. 14. Rest of Lecture is….Exploring our data using these methods in Excel… 1 Hierarchical Cluster Analysis 2 Partition Clustering + 3 PCA + Examples 14
  15. 15. 1Hierarchical Cluster Analysis 15
  16. 16. Hierarchical Cluster Analysis Patients Objective: A B C D We have a dataset of DV’s (columns) and IV’s S1 42 18 4 37 (rows) S2 35 23 10 48genes We want to VISUALIZE how DV’s group together S3 39 25 7 22 according to how similar they are across the IV ... ... ... ... ... scores or vice versa S10 27 22 16 41 So we measure Similarity = Distance What does HCA give you? A tree (or dendrogram) Steps: 1 2 3 Data distance matrix Build tree Visualize How many groups there are 16
  17. 17. What do we mean by distance? Think of your data as being points in multidimensional spacePoint B Point A The distance between two points is the length of the path connecting them. The closer together two points (i.e. your variables) are the more similar they are in what is being measured 17
  18. 18. 1. Create a distance matrix Measure similarity between column variables Patients 50 A A B C D S1 42 18 4 37 S2 35 23 10 48 26.8genes 24 S1 S3 39 25 7 22 ... ... ... ... ... S10 27 22 16 41 B 12 0 50 S2 How similar are variables A & B Across all cases S1....Sn? AB = √ (24)2 + (12)2 = 26.8 18
  19. 19. Patients Measure similarity between variables A B C D 50 50 S1 42 18 4 37 A A S2 35 23 10 48 26.8genes S1 S1 25.3 S3 39 25 7 22 B ... ... ... ... ... B S10 27 22 16 41 0 50 0 50 S2 S3 50 A 26.4 S1 Distance between AB: B √ (24)2 + (12)2 + (8) 2 + ...... + (5) 2 0 50 S10 And so on ...... 19
  20. 20. The distance matrixThe distance represents similarity measures for ALL pairs of variables across ALL cases A 0 B 26 0 C 18 32 0 D 31 22 9 0 A B C D 20
  21. 21. Tree Building from distance matrix 1. Find smallest distance value between a pair 2. Take average and create a new matrix combining the pair A 0 A 0 B 26 0 B 26 0 C 18 32 0 C&D 24.5 27 0 D 31 22 9 0 A B C D A B C&D 26.5 24.5 B 0 9 A&C&D 26.5 0 B A C D B A&C&D 21
  22. 22. This is what ISome common distance measures just used Euclidean distance. This is probably the most commonly chosen type of distance. It simply is the geometric distance in the multidimensional space. Squared Euclidean distance. You may want to square the standard Euclidean distance in order to place progressively greater weight on objects that are further apart. City-block (Manhattan) distance. This distance is simply the average difference across dimensions. In most cases, this distance measure yields results similar to the simple Euclidean distance. However, note that in this measure, the effect of single large differences (outliers) is dampened (since they are not squared). Correlation Gowers distance – allows you to use mixed numerical and categorical data 22
  23. 23. Some common tree building algorithms Single linkage (nearest neighbor). The distance between two clusters is determined by the distance of the two closest objects (nearest neighbors) in the different clusters. This rule will, in a sense, string objects together to form clusters, and the resulting clusters tend to represent long "chains.“ Complete linkage (furthest neighbor). In this method, the distances between clusters are determined by the greatest distance between any two objects in the different clusters (i.e., by the "furthest neighbors"). This method usually performs quite well in cases when the objects actually form naturally distinct "clumps." If the clusters tend to be somehow elongated or of a "chain" type nature, then this method is inappropriate. Unweighted pair-group average. In this method, the distance between two clusters is calculated as the average distance between all pairs of objects in the two different clusters. This method is also very efficient when the objects form natural distinct "clumps," however, it performs equally well with elongated, "chain" type clusters. This is what I just used 23
  24. 24. Using Hierarchical Cluster Analysis in Excel 1 Start R… 1. Click on the Add-ins tab 2. Click on the RExcel Menu 3. Click on ‘Connect R’ These steps are always used to start R in Excel 24
  25. 25. Using Hierarchical Cluster Analysis in Excel 2 Install libraries in R… 1. Highlight the cell A2 2. Right click the selection 3. Click on Run Code to install 25
  26. 26. Using Hierarchical Cluster Analysis in Excel 3 Select a Download Source… 1. Choose Bristol or London 26
  27. 27. Using Hierarchical Cluster Analysis in Excel 4 Setup Worksheet Load the necessary Libraries… 1. Highlight the cells with the code 2. Right click the selection 3. Click on Run Code to load the libraries in R 27
  28. 28. Using Hierarchical Cluster Analysis in Excel Data Worksheet 5 Select data… 1. Highlight the dataset with column/row names 2. Right click the selection 3. Click on ‘Put R Var’ 4. Type in ‘dat’ into the ‘Array name in R’ box 5. Click the ‘with rownames’, ‘with ‘columnames’ boxes 6. Click OK 28
  29. 29. Using Hierarchical Cluster Analysis in Excel Click on the HCA tab in the workbook 29
  30. 30. To Plot a dendrogram for DV’s with: Distance matrix= ‘correlation’, Tree building = ‘complete’ - Right click the cell A19 and Click on ‘Run code’ (the dendrogram should appear) - The tree show the similarities between patients according to gene expression levels 30
  31. 31. To Plot a dendrogram for IV’s with: Distance matrix= ‘correlation’, Tree building = ‘complete’ - Right click the cell A22 and click on ‘Run code’ - The tree shows similarities for gene expression across patients 31
  32. 32. To plot a dendrogram and HEATMAP for IV’s and DV’s- Highlight and right click the cells c18:C23 and click on ‘Run code’- The trees are now visualized together and the heatmap colours are relative to the expression levels of each gene in each patient (green = high; red = low; black = intermediate) 32
  33. 33. Summary of what HCA has shown usHCA...•Provides an overall feel for how our datagroups• In the example, there might be: •2 clusters of patients •2 large clusters of genes • 4 or 5 smaller sub-clusters of genes•Genes cluster according to patterns ofexpression across patients 33
  34. 34. 2Confirm the number of groups in our data using Partition Clustering 34
  35. 35. Partition Clustering Patients A B C D Objective: S1 42 18 4 37 We have a dataset of DV’s (columns) and IV’s S2 35 23 10 48genes (rows) S3 39 25 7 22 ... ... ... ... ... We have a feel for how many clusters there are in our dataset after using HCA S10 27 22 16 41 We want to assign our variables into distinct clusters – so we use a partition clustering method What does Partition clustering give you? A table showing the hard assignment of your variables into to discrete clusters 35
  36. 36. Steps in Partition Clustering1. Choose a partition clustering method suitable for your data e.g. K-Means, Partition Around Medoids2. Tell the method how many clusters you think there are in the dataset e.g. 2, 3, 4…..3. Read output table to see which cluster each variable has been assigned to4. Try to assess the ‘fit’ of each variable in a cluster i.e. how well has clustering worked?5. Repeat with a different cluster number until you get the best fit 36
  37. 37. Partition Clustering Algorithm Overview…. All this will be explained pictorially in the next few slides 1. You have to define the number of clusters 2. A distance matrix is created between variables 3. Random cluster ‘centres’ are created in multidimensional space 4. Method then assigns samples to nearest cluster centre 5. Cluster centres are then moved to better fit the samples 6. Samples are reassigned to cluster centres 7. Process repeated until best fit is achieved Most widely used method is K-Means clustering K-Means uses euclidean distance to create the distance matrix 37
  38. 38. An Example … are there 4 clusters in this dataset?Data Space... The gray dots represent data and red squares possible cluster ‘centres’
  39. 39. Using the interactive tool at the URL below we can follow how K-Means partitions our data http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html 39
  40. 40. K-Means starts by RANDOMLY assigning cluster centres to the data 40
  41. 41. Boundaries are drawn around the nearest data points that K-Means thinks should group with the clustercentre. The cluster centre is then shifted towards the centre of these data points 41
  42. 42. The boundary lines are then redrawn around the data points that are closest to the new cluster centresThis means that some data points better fit a new cluster 42
  43. 43. It keeps doing this…. 43
  44. 44. …and on…. 44
  45. 45. …and on…. 45
  46. 46. …and on…. 46
  47. 47. …and on…. 47
  48. 48. …until…. A best fit is achieved – it cannot get a better fit by moving centres around 48
  49. 49. Variables are then listed according to cluster Variable Cluster Variable Cluster Variable Cluster 1 3 11 2 21 2 2 4 12 4 22 4 3 4 13 1 23 1 4 1 14 3 24 2 5 2 15 4 25 1 6 4 16 4 7 2 17 2 8 4 18 3 9 3 19 2 10 1 20 1 49
  50. 50. Can Partition Clustering methods be used on categorical data? Yes!•You just need to use a different method to create the distance matrix•Do not use K-Means!•Use Partition Around Medoids (PAM) instead of K-Means withGower’s Distance measure. 50
  51. 51. An alternative method to K-Means is…K-Medoids Clustering The most common K-Medoids method is: Partition Around Medoids (PAM) Pam measures the average DISSIMILARITY between variables in a cluster Why use PAM? PAM is more robust than K-Means as… • It gives a better approximation of the centre of a cluster • It can use any type of distance matrix (not just euclidean distance) • It uses a novel visualization tool, the silhouette plot, to help you decide the optimal number of clusters 51
  52. 52. Evaluating how well our clustering has workedHow good is fit of clusters across variables?What is the optimal number of clusters?The silhouette plot provides these answersClusters = 4N = 75Bars = fit of sample in clusterBar Length = goodness of fitEach cluster has an averagelength (Si)Average SilhouetteWidth = 0.74Rough rule of thumb:Average SilhouetteWidth > 0.4 is good Anything greater than 0.5 is a decent fit 52
  53. 53. Keep trying different cluster numbers (k) to see how the average silhouette width changesIf Clusters = 5 then:Average Silhouette Widthdecreases Not veryLook at cluster 3 good fitOne sample has a poor fitOther samples have not sogood a fit Choose K that has the highest Silhouette Width 53
  54. 54. The K-Means & PAM Worksheet 54
  55. 55. Running PAM in ExcelClustering IV’s 55
  56. 56. Change the value of K (no. clusters) and observe the average silhouette width K=3 K=4 K=5 Average Average Average Silhouette = 0.45 Silhouette = 0.49 Silhouette = 0.59 Width Width Width 56
  57. 57. Getting output to show cluster assignment 1. Click on a new worksheet 2. Right click a cell 3. Click ‘Get R Output’ 57
  58. 58. Summary of what PAM has shown us•PAM told us that it is most likely thatthere are 5 clusters of genes in ourdataset•PAM assigned each gene to a definitecluster 58
  59. 59. 3Visualize the relationship between variables in groups withPrincipal Components Analysis 59
  60. 60. Principal Components Analysis (PCA)What does it do…• It is a data reduction technique•It seeks a linear combination of variables such that the maximum variance is extractedfrom the variables.• PCA produces uncorrelated factors (components).What does it give you…• The components might represent underlying groups within the data• By finding a small number of components you have reduced the dimensionality ofyour data 60
  61. 61. PCA – The Concepts X Y If we take data for two variables and plot as a scatter plot, we can draw a1 42 18 line of best fit through the data (the length of which is from the furthest2 35 23 two data points)3 39 25 By summing the distances between points and the line we can determine... ... ... how much variation in the data each line captures.N 27 22 We can then draw a second line at right angles between the two further data points in that direction and this line captures more variation 61
  62. 62. PCA – The Concepts Each data point has a score on each component like a correlation eigenvector eigenvalue•In multivariate data we have many variables potted in multidimensional space•So we draw many ‘lines of best fit’ – each line is called an eigenvector•The variables have a score on each eigenvector depending on how much variation isexplained by that line (eigenvalue)•We refer to the eigenvectors as components•Different variables will have similar or different correlations on each component 62•Therefore we can group together variables according to these similarities
  63. 63. How many groups are there? Each component explains different amounts of variation in the dataImportance of components: Comp.1 Comp.2 Comp.3 Comp.4Proportion of Variance 0.62 0.24 0.08 0.04Cumulative Proportion 0.62 0.86 0.95 1.00 Why is this important? - It tells us how many components to retain (i.e. we throw out minor components) - The number of components we retain is the number of groups in the data Rough rule of thumb: Retain components explaining >= 5% of the variation 63
  64. 64. How many groups are there? Eigenvalues help us decide on many components to retainA Scree plot will show you the eigenvaluesfor each component This scree plot shows the variance of each componentRough rule of thumb:Look to see where the curve levels offThe Kaiser criterion:Retain components having an eigenvalue > 1 64
  65. 65. The PCA Worksheet 65
  66. 66. Getting output to show scores of IV’s on components 1. Click on a new worksheet 2. Right click a cell 3. Click ‘Get R Output’ 66
  67. 67. Generate a Variance Table & a Scree Plot Optimal number of components is 4 where variance explained is > =5% 67
  68. 68. Visualizing the scores of IV’s on components using a scatterplot This plot shows: Component 1 (PC1) v. Component 2 (PC2) • PC1 & PC2 separate groups of genes and patients You can see that P1 and P2 are similar due to levels of gene g9 P5 is clearly different to the other patients according to gene expression P3 and P4 are similar 68 levels
  69. 69. Visualizing the scores of IV’s on components using a scatterplot This plot shows: Component 1 (PC1) v. Component 3 (PC3) This plot gives another view on the data groups and the relationship between variables and components 69
  70. 70. Putting it all together…A whole map of the patterns in our data…. A …We have a consensus of B how our variables group E D We could generate new A hypotheses from our data C E A B C D E 70
  71. 71. Typical MVA workflow you can apply to your data in research projects Dataset Estimate number of groups with Tree Hierarchical Cluster Analysis based Clustering K-Means, PAM Confirm number of groups with Partition Clustering Visualize relationship between Principal Components variables with data reduction Analysis (PCA) 71

×