Upcoming SlideShare
×

# 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

1,995 views

Published on

8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

Published in: Technology, Education
2 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
1,995
On SlideShare
0
From Embeds
0
Number of Embeds
26
Actions
Shares
0
55
0
Likes
2
Embeds 0
No embeds

No notes for slide

### 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

1. 1. Multivariate Samples Recall some very basic concepts of univariate and bivariate statistics Describe Multivariate Samples Analyze multivariate samples in a geometrical perspective Describe distances in the Euclidean Space
2. 2. The data we will consider Example1. Innovation and Research in Europe (Source: Eurostat) Country code Geo Country name Country european region Region E-government on-line availability - Online availability of 20 basic public services E_gov_avail Exports of high technology products as a share of total exports HT_Exports % of males 20-24 having completed at least upper 2° educ. Y_Educ__Lev_m % of fem. 20-24 having completed at least upper 2° educ. Y_Educ_Lev_f Youth education attainment level - total - % of the population 20-24 who completed at least upper secondary education Y_Educ_Lev Expenditure on Telecommunications as a % of GDP Telec_Expenditure Expenditure on Information Technology as a % of GDP IT_Expenditure No patents granted by the US Patent and Trademark Office per million inhabitants USTPO No patent applications to the European Patent Office per million inhabitants EPO Male tertiary graduates in S&T per 1000 of males aged 20-29 ST_grad_m Female tertiary graduates in S&T per 1000 of females aged 20-29 ST_grad_f Science and technology - Tertiary graduates in S&T x 1000 persons aged 20-29 ST_grad Level of Internet access - % of households who have Internet access at home Internet_Acc GERD - abroad - % of GERD financed by abroad GERD_abroad GERD - government - % of GERD financed by government GERD_govern GERD - industry - % of GERD financed by industry GERD_industry Gross domestic expenditure on R&D (GERD) - As a % of GDP GERD Spending on Human Resources (total public expen. on education) - % of GDP Educ_Exp
3. 3. Some basic concepts of Univariate and Bivariate statistics
4. 4. Back to basics…. Considering one variable Let us consider one variable of interest, say EPO In statistics a commonly used position measure is the arithmetic (sample) mean , obtained by summing up all the observed values and dividing the results by the nr of obs Netherlands Spain Mean = 127.6987 144.52 34.00 Western France 299.99 60.00 Western Germany 141.80 50.00 Western Belgium 246.15 67.00 Western Netherlands 30.64 34.00 Southern Spain 84.14 34.00 Southern Italy 9.87 17.00 Southern Greece 293.32 73.00 Northern Sweden 309.09 51.00 Northern Finland 124.19 56.00 Northern UK 135.77 60.00 Northern Norway 79.87 40.00 Northern Ireland 2.78 12.00 Northern Lithuania 12.04 19.00 Eastern Czech Republic 1.31 6.00 Eastern Romania EPO Internet_Acc region country
5. 5. Back to basics…. Considering one variable The mean can be used to make a “prediction” about EPO for a generic country without any further information. To evaluate the reliability of the mean as a synthesis of the observed data , we can consider for each observed value the error incurred when substituting it with the sample mean. Netherlands Spain Mean = 127.6987 In the plot: errors incurred when substituting the mean to the values observed for Netherlands and Spain respectively. The TOTAL SUM OF SQUARES is the sum of the squared errors Variable of interest: EPO
6. 6. Back to basics…. Considering one variable A synthesis of the errors, and a measure of the reliability of the mean as a synthesis of the observed data , is the (sample) variance This is the average of the squared errors we incur when substituting the observed values with the sample mean. It is obtained by dividing the Total SS by the number of observations (minus 1) The variance of EPO turns out to be 12646.5814 . Hence the error we can expect to incur for a generic observation is the square root of the variance, which is called standard deviation Variable of interest: EPO
7. 7. Back to basics…. Considering one variable In statistics we are mainly concerned with the explanation of variance , i.e., we are interested in explaining why a phenomenon varies and, also, we are considering predictive tools characterized by low prediction errors. So the question now is: Can we do better than the mean? i.e., can we use external information (other vars) related to EPO, and hence proving useful to predict the values of EPO with a lower error? In the following we will consider two supporting variables having different characteristics: The Region (a categorical variable) Internet_Access (a numerical variable) and we will show how it is possible to evaluate the extent to which one external variable provides information about the variable of interest Variable of interest: EPO
8. 8. Back to basics…. Considering one variable If we consider the region , our prediction on EPO can be better? General Mean = 127.6987 We can use the conditional means rather than the general one . It is worth only if the prediction error is considerably lower (it can be shown that it is lower by construction) Netherlands Spain Values observed within the regions 144.52 Western France 299.99 Western Germany 141.80 Western Belgium 246.15 Western Netherlands 30.64 Southern Spain 84.14 Southern Italy 9.87 Southern Greece 293.32 Northern Sweden 309.09 Northern Finland 124.19 Northern UK 135.77 Northern Norway 79.87 Northern Ireland 2.78 Northern Lithuania 12.04 Eastern Czech Republic 1.31 Eastern Romania EPO region country
9. 9. Back to basics…. Considering one variable Consider the region to improve prediction on EPO  Use the conditional means Netherlands Spain To evaluate the reliability of the conditional means as syntheses of the observed EPO data, we can consider the squared difference between each value and the proper conditional mean. In the plot: errors for Netherlands and Spain The WITHIN SUM OF SQUARES of EPO given Region is the sum of the squared errors incurred when using the conditional means (by region) to predict EPO
10. 10. Back to basics…. Considering one variable If we use the region, our improvement as compared to the general mean is The R 2 ranges from 0 to 1. It measures the ability of the categorical var as a predictor of the numerical one. % of variance of EPO accounted for by Region Compare general mean / conditional means as predictors of EPO TOTAL SS EPO = 177052.1395 282.9561 29684.2921 198.8467 14030.7105 9420.3912 1897.3603 13883.6025 27430.415 32902.8037 12.311 65.1459 2287.5845 15604.6816 13376.9349 15974.1035 Squared errors WITHIN SS EPO | REGION = 94296.85 4044.324 208.115 127.6987 144.52 Western France 8441.016 208.115 127.6987 299.99 Western Germany 4397.679 208.115 127.6987 141.8 Western Belgium 1446.661 208.115 127.6987 246.15 Western Netherlands 119.0281 41.55 127.6987 30.64 Southern Spain 1813.908 41.55 127.6987 84.14 Southern Italy 1003.622 41.55 127.6987 9.87 Southern Greece 18446.18 157.5033 127.6987 293.32 Northern Sweden 22978.53 157.5033 127.6987 309.09 Northern Finland 1109.776 157.5033 127.6987 124.19 Northern UK 472.3363 157.5033 127.6987 135.77 Northern Norway 6026.929 157.5033 127.6987 79.87 Northern Ireland 23939.3 157.5033 127.6987 2.78 Northern Lithuania 28.7832 6.675 127.6987 12.04 Eastern Czech Republic 28.7832 6.675 127.6987 1.31 Eastern Romania Squared errors Conditional means General mean EPO region country
11. 11. Back to basics…. Considering one variable If we consider Internet_Access , our prediction on EPO can be better? When considering numerical variables, we are interested in evaluating the existence of a linear association between them. To evaluate if a linear relationship exists and to determine its direction we refer to the sample covariance (absolute measure of linear association) 144.52 34.00 France 299.99 60.00 Germany 141.80 50.00 Belgium 246.15 67.00 Netherlands 30.64 34.00 Spain 84.14 34.00 Italy 9.87 17.00 Greece 293.32 73.00 Sweden 309.09 51.00 Finland 124.19 56.00 UK 135.77 60.00 Norway 79.87 40.00 Ireland 2.78 12.00 Lithuania 12.04 19.00 Czech Republic 1.31 6.00 Romania EPO Internet_Acc country
12. 12. Back to basics…. Considering one variable If we consider Internet_Access , our prediction on EPO can be better? The covariance between the two variables is: Cov(EPO, Int_Acc) = 1868.5152 This measure only indicates that a linear relationship exists and that it is direct (an inspection of the scatter plot confirms this). Nevertheless, the value of the covariance depends upon the unit of measurement of the considered variables. A relative measure of linear association is the correlation coefficient. The correlation coefficient ranges from – 1 to +1. Values close to 1 indicate strong direct linear association, values close to –1 denote strong inverse association. Values close to zero indicate no relationship. Here we have Corr(EPO, Int_Acc) = 0.8527 (strong association)
13. 13. Back to basics…. Considering one variable If we consider Internet_Access , our prediction on EPO can be better? EPO = –60.018 + 4.5934*Int_Acc The high value of the correlation tells us that observations tend to cluster around a line having a positive slope. This line, evidenced in the scatterplot is called regression line. Its analytical expression can be easily determined
14. 14. Back to basics…. Considering one variable EPO = –60.018 + 4.5934*Int_Acc For each observation we can calculate the difference between the observed EPO value and the value predicted using the regression line. In the plot the error is evidenced for the Spain. Spain The MODEL SUM OF SQUARES of EPO given Int_Acc is the sum of the squared errors incurred when using the line to predict EPO. Consider Internet_Access to improve prediction on EPO  Use the regression line
15. 15. Back to basics…. Considering one variable Notice that we have a considerable decrease of the prediction errors. Compare general mean / regression line as predictors of EPO TOTAL SS EPO = 177052.1395 282.9561 29684.2921 198.8467 14030.7105 9420.3912 1897.3603 13883.6025 27430.415 32902.8037 12.311 65.1459 2287.5845 15604.6816 13376.9349 15974.1035 Squared errors MODEL SS EPO | Int_Acc = 48309.46 2338.922 96.1576 127.6987 144.52 34.00 France 7124.035 215.586 127.6987 299.99 60.00 Germany 775.7339 169.652 127.6987 141.8 50.00 Belgium 2.5275 247.7398 127.6987 246.15 67.00 Netherlands 4292.556 96.1576 127.6987 30.64 34.00 Spain 144.4227 96.1576 127.6987 84.14 34.00 Italy 67.2367 18.0698 127.6987 9.87 17.00 Greece 324.7132 275.3002 127.6987 293.32 73.00 Sweden 18183.07 174.2454 127.6987 309.09 51.00 Finland 5332.271 197.2124 127.6987 124.19 56.00 UK 6370.594 215.586 127.6987 135.77 60.00 Norway 1922.647 123.718 127.6987 79.87 40.00 Ireland 58.9394 -4.8972 127.6987 2.78 12.00 Lithuania 231.5449 27.2566 127.6987 12.04 19.00 Czech Republic 1140.251 -32.4576 127.6987 1.31 6.00 Romania Squared errors Prediction using the line = 4.5934*Int_Acc-60.018 Gen mean EPO Int_Acc country
16. 16. Back to basics…. Considering one variable The R 2 index ranges from 0 to 1 and it measures the ability of the numerical var to predict the other one. It can be shown that the index coincides with the squared correlation coefficient. Hence the correlation measures the extent of linear association , whereas its square measures the percentage of the variance of one variable which can be explained by the other variable (numerical) . If we use the line (function of Int_Acc), our improvement as compared to the general mean is % of variance of EPO accounted for by Int_Acc If we consider Internet_Access , our prediction on EPO can be better?
17. 17. Data Matrices (Numerical variables only)
18. 18. Data matrices Example1 (continued). Innovation and Research in Europe. For the sake of simplicity, we limit attention to few variables and to few observations The country variable is useful to identify the statistical units but it is not object of analysis. At the moment we consider only numerical variables For each observation we have information collected on p variables For each variable we have information collected on n observations The data matrix contains information available for the n cases ( rows) on the p variables ( columns ) Here we have 15 rows (cases, n ) and 7 columns (vars, p ) 50.00 144.52 19.50 34.00 36.90 54.20 2.20 Western France 47.00 299.99 8.10 60.00 31.40 65.70 2.46 Western Germany 35.00 141.80 10.50 50.00 22.00 63.40 2.08 Western Belgium 32.00 246.15 6.60 67.00 35.80 51.90 1.80 Western Netherlands 55.00 30.64 11.90 34.00 39.90 47.20 0.91 Southern Spain 53.00 84.14 7.40 34.00 46.80 47.20 1.09 Southern Italy 32.00 9.87 8.00 17.00 46.60 33.00 0.64 Southern Greece 74.00 293.32 13.30 73.00 21.30 71.50 4.25 Northern Sweden 67.00 309.09 17.40 51.00 25.50 70.80 3.30 Northern Finland 59.00 124.19 20.30 56.00 28.80 45.60 1.83 Northern UK 56.00 135.77 7.70 60.00 39.80 51.60 1.60 Northern Norway 50.00 79.87 20.50 40.00 25.60 66.70 1.10 Northern Ireland 40.00 2.78 14.60 12.00 56.30 37.10 0.67 Northern Lithuania 30.00 12.04 6.00 19.00 43.60 52.50 1.20 Eastern Czech Republic 25.00 1.31 5.80 6.00 43.00 47.60 0.39 Eastern Romania E_gov_avail EPO ST_ grad Internet_Acc GERD_govern GERD_ industry GERD region country
19. 19. Data matrices Example1 (continued). Innovation and Research in Europe. (subset) To each observation a collection of p values is associated. These values are the realizations observed for each variables corresponding to the considered obs. Similarly, to each variable , a collection of n values can be associated (values observed for all the cases) A collection of k values is usually called a vector . To avoid confusion, we will only consider column vectors, with dimension ( k  1) – i.e., a collection of values arranged in k rows and in 1 column . A row (1  k ) vector can always be seen as the transpose of a column ( k  1) vector. 50.00 144.52 19.50 34.00 36.90 54.20 2.20 France 47.00 299.99 8.10 60.00 31.40 65.70 2.46 Germany 35.00 141.80 10.50 50.00 22.00 63.40 2.08 Belgium 32.00 246.15 6.60 67.00 35.80 51.90 1.80 Netherlands 55.00 30.64 11.90 34.00 39.90 47.20 0.91 Spain 53.00 84.14 7.40 34.00 46.80 47.20 1.09 Italy 32.00 9.87 8.00 17.00 46.60 33.00 0.64 Greece 74.00 293.32 13.30 73.00 21.30 71.50 4.25 Sweden 67.00 309.09 17.40 51.00 25.50 70.80 3.30 Finland 59.00 124.19 20.30 56.00 28.80 45.60 1.83 UK 56.00 135.77 7.70 60.00 39.80 51.60 1.60 Norway 50.00 79.87 20.50 40.00 25.60 66.70 1.10 Ireland 40.00 2.78 14.60 12.00 56.30 37.10 0.67 Lithuania 30.00 12.04 6.00 19.00 43.60 52.50 1.20 Czech Republic 25.00 1.31 5.80 6.00 43.00 47.60 0.39 Romania E_gov_avail EPO ST_ grad Internet_Acc GERD_govern GERD_ industry GERD
20. 20. Data matrices x i = vector ( p  1) containing measurements on the p vars for the i -th case. x ( j ) = vector ( n  1) containing the n measurements on the j -th variable Data matrix ( n individuals and p variables) Transposition operation A data matrix can be seen as a collection of n row (transposed) vectors (cases) and/or as a collection of p column vectors (variables)
21. 21. Data matrices Example1 (continued). Innovation and Research in Europe. (subset) Row vector associated to “Belgium” (measurements on 7 vars) Column vector associated to EPO (measurements on 15 obs) The element in the i- th row and in the j- th column, x ij is the value observed for the i -th case corresponding to the j- th variable. In this simple example, x 13 6 is the value of EPO (6° variable) for Belgium (13° observation). 50.00 144.52 19.50 34.00 36.90 54.20 2.20 France 47.00 299.99 8.10 60.00 31.40 65.70 2.46 Germany 35.00 141.80 10.50 50.00 22.00 63.40 2.08 Belgium 32.00 246.15 6.60 67.00 35.80 51.90 1.80 Netherlands 55.00 30.64 11.90 34.00 39.90 47.20 0.91 Spain 53.00 84.14 7.40 34.00 46.80 47.20 1.09 Italy 32.00 9.87 8.00 17.00 46.60 33.00 0.64 Greece 74.00 293.32 13.30 73.00 21.30 71.50 4.25 Sweden 67.00 309.09 17.40 51.00 25.50 70.80 3.30 Finland 59.00 124.19 20.30 56.00 28.80 45.60 1.83 UK 56.00 135.77 7.70 60.00 39.80 51.60 1.60 Norway 50.00 79.87 20.50 40.00 25.60 66.70 1.10 Ireland 40.00 2.78 14.60 12.00 56.30 37.10 0.67 Lithuania 30.00 12.04 6.00 19.00 43.60 52.50 1.20 Czech Republic 25.00 1.31 5.80 6.00 43.00 47.60 0.39 Romania E_gov_avail EPO ST_ grad Internet_Acc GERD_govern GERD_ industry GERD
22. 22. Data matrices – Vectors A ( K  1) vector is as an oriented line in a K -dimensional space v 1 v 2 v 3 v 1 v 2 A two-dimensional vector A three-dimensional vector Vectors of higher dimension cannot be represented in this way A one-dimensional vector (scalar) v 1
23. 23. Data matrices – Vectors (length) For a given vector in the k -dimensional space, we define its length as: It is the length of the line connecting v to the origin, 0 : v 1 v 2 v 3 v 1 v 2 v 1 0 0 0
24. 24. Data matrices – Vectors (Distance) 0 v v 1 v 2 u u 1 u 2 Given two vectors, v and u in the k -dimensional space, we define the Euclidean Distance between v and u as the length of the line connecting v to u : | v 1 – u 1 | | v 2 – u 2 | !!! the length of a vector v coincides with its distance from the origin, 0. Example in the two-dimensional space
25. 25. Analyze multivariate samples in a geometrical perspective Describe distances in the Euclidean Space
26. 26. Data matrices A data matrix can be see as a collection of two kind of vectors: Row vectors: x i lie in the p -dimensional space Column vectors: x ( j) lie in the n -dimensional space Hence two dimensional spaces can be considered to analyze/describe a data matrix. Of course, these spaces will be related one to each other. For the sake of simplicity, we will analyze in depth only the space of the observations.
27. 27. Syntheses of variables The position. The sample mean (unbiased estimator for the population mean) f or the j -th variable (column) is: It may be seen as the vector associated to the “artificial case” “mean” – an unobserved case being in the average with respect to all the vars Remember: the mean is not robust (sensitive to extreme values) How to arrange syntheses of p variables, i.e., how to synthesize the elements of the column vectors? Vector of the sample means ( centroid ).
28. 28. The space of the observations Consider a graphical representation we are used to: the 2-dimensional space Note: axes adjusted to have the same scale. Mean of E_gov_indiv Mean of Internet_Acc The centroid (vector whose elements are the sample means) is the centre of gravity of the cloud. It is the point which is globally less distant from all the points.
29. 29. Synthesis of variables Notice that it is the average of the squared distances between the observed values and the sample mean <ul><li>The sample standard deviation for the j -th variable (column) is </li></ul><ul><li>The dispersion around the mean. </li></ul><ul><li>The sample variance (unbiased estimator for the population variance) for the j -th variable (column) is: </li></ul>The Std. Dev has the same unit of measurement as the variable taken into account. It measures of the expected error (below or above the mean) we incur when substituting the mean to a generic case. Moreover it can be considered as the average distance between a generic value and the mean . It is the expected distance from mean. Being based upon averages, both the variance and the standard deviation are not robust (sensitive to extreme values) Average of the squared errors we incur when substituting the observed values with the sample mean.
30. 30. The space of the observations Consider again the 2-dimensional space Let us consider the distance from Iceland (IS) to the centroid Note: axes adjusted to have the same scale. Absolute Difference between the Iceland E_gov_Indiv value and the mean of E_gov_Indiv Absolute Difference between the Iceland Internet_Acc value and the mean of Internet_Acc
31. 31. The space of the observations Consider, in the 2-dimensional space, ALL THE DISTANCES FROM POINTS TO THE CENTROID. Note: axes adjusted to have the same scale. Var(E_gov_indiv) + Var(Internet_cc) = SUM of the variances of THE TWO VARIABLES is proportional to the sum of the squared distances from the obs to the centroid
32. 32. Synthesis of association between vars The linear association. The sample covariance for the j -th and the h -th variables (columns) is The sample correlation coefficient for the j -th and the h -th variables is (absolute measure of linear association) (relative measure of linear association; it ranges from – 1 to +1). Remember: being based upon averages, the correlation coefficient is not robust (sensitive to extreme values)
33. 33. The space of the observations Consider again the 2-dimensional space Since the covariance and the correlations are actually measuring the concentration of points around a line, both the indices give us information about the ORIENTATION of the scatter. Note: axes adjusted to have the same scale.
34. 34. Variance and Covariance Matrix Variances and covariances are arranged in the so called variance and covariance matrix S is a square matrix (number of rows equals the number of columns) The diagonal elements of S , s jj , are the variances (notice that the variance can be regarded as the covariance between one variable and itself) The extra-diagonal elements of S , s jh , are the covariances Since s jh = s hj , S is a symmetric matrix.
35. 35. Correlation Matrix Correlations are arranged in the correlation matrix R is also a square matrix, and its diagonal elements are 1’s (the correlation between one variable and itself is 1) Its extra-diagonal elements , r jh , are the correlations , and of course, R is a symmetric matrix. Due to the relationship between covariances and correlations: R can be simply obtained from the variance and covariance matrix
36. 36. The space of the observations The centroid (vector whose elements are the sample means) is the centre of gravity of the p -dimensional cloud The elements of the variance and covariance matrix give us information about the dispersion around the centroid ( remember the 2-dimension example) and on the orientation of the cloud
37. 37. Measuring dispersion <ul><li>How to synthesize the dispersion of the n cases in the p- dimensional space? Two proposals. </li></ul><ul><li>TOTAL VARIANCE </li></ul><ul><li>As we saw before, the sum of all the variances is proportional to the sum of the squared distances from the points to the centroid. Thus, a first method to evaluate the dispersion of the points in the p -dimensional space is the so called Total Variance . </li></ul>The Total Variance is the sum of the diagonal elements of the var/cov matrix, S. The sum of the diagonal elements of a square matrix is defined to be its trace . Hence, we have: Notice that we are not taking into account the interrelationships between vars, i.e. the orientation of the cloud.
38. 38. The space of the observations To motivate the second measure of multivariate dispersion, consider the “portion” of the space which is occupied by data (area of the ellipse). We will come back to this concept later, but can intuitively understand that the area of the ellipse (in higher-dimensional space, the volume of an ellipsoid) is somehow related to the variances and to the covariances, i.e., to all the entries of the var/cov matrix, S
39. 39. <ul><li>THE GENERALIZED VARIANCE </li></ul><ul><li>The volume of the ellipsoid containing points in the p- dimensional space can be shown to be related to a particular synthesis of the elements of S , the so called determinant of S, |S| . </li></ul><ul><li>The determinant is a number which can be calculated for a square matrix. It equals zero if two column of the matrix are proportional, i.e., if they do share information. </li></ul><ul><li>This measure is called Generalized Variance </li></ul><ul><li>Generalized Variance = det(S)=|S| </li></ul><ul><li>Hence, to synthesize the dispersion of points in a p- dimensional space, two measure can be used, both related to the elements of the variance and covariance matrix, S . </li></ul><ul><li>The Total Variance takes into account only the diagonal elements of S , whilst the Generalized variance is calculated by referring to all the elements of S. </li></ul>Measuring dispersion
40. 40. The space of the observations The variances and covariance matrix contains relevant information to describe the points in a p -dimensional space, and, also information about their distances. We now consider different measures of distances between cases in the p -dimensional space, related to particular transformations of the original vars. Notice first that if the variables are centred on their mean nothing changes as concerns the dispersion of the points. This operation only consists in a change of the origin
41. 41. Multivariate Samples - Transformations Centroid = Origin = 0 Var/Cov Matrix: S Corr Matrix: R TRASFORMATION: VARS CENTRED ON THEIR MEANS Original Data Matrix Centred Data Matrix The centred matrix is obtained by subtracting to each observation on a given variable the mean of the variable itself. This means that to all the observations on a given column, say the j- th, the mean of the j- th variable is subtracted. Centroid = x Var/Cov Matrix: S Corr Matrix: R
42. 42. A closer look at the distance The Euclidean distance is the length of the line connecting a point to the origin. Consider, in the plot of the centred variables, Cyprus and Italy: their distance from the origin, 0, is (almost) the same. This similar distance is due to different combinations of x- and y- deviations from 0. Should the x- and y- deviations be evaluated in the same manner ? Notice that the distance of Slovakia from the origin is higher. We will consider this later
43. 43. A closer look at the distance Remember: the standard deviation of a variable is the typical deviation from the mean. Here Std.Dev.(E_gov_Avail)=15, Std.Dev.(Int_Acc) = 21.31. To compare adequately the deviations from the origin (data are centred) , we should take into account the Std.Dev (of course, squared deviations should be compared with variances ). Internet_Acc has an higher std.dev. Hence, a deviation D from the origin along the horizontal axis should “count less” than a deviation D from the origin along the vertical axis.
44. 44. A closer look at the distance In the Euclidean distance, the deviations are considered in absolute terms . When we are considering variables having different Std.Dev, we should consider relative deviations. To remove the effect of Std. Dev, thus obtaining comparable deviations, we have to standardize the variables. The Euclidean Distance between two standardized observations is: Statistical Distance: A different weight is assigned to the squared deviation of each variable in the calculation of the distance (1/ s jj ). The statistical distance is proportional to the Euclidean one only if the variances are all equal. Standardization of the j -th variable:
45. 45. A closer look at the distance The statistical distance (visualization in the original/centred space). x- deviations are penalized less than y- deviations, since the x -axis is characterized by an higher dispersion . Hence Cyprus, which is showing an higher y- deviation from the origin as compared to Italy is characterized by a statistical distance from the origin which is higher than that characterizing Italy. Points having the same statistical distance from the origin Notice that Slovakia has a stat. distance from 0 which is now similar to that of Cyprus.
46. 46. Multivariate Samples - Transformations Centroid = Origin = 0 Var/Cov Matrix: R Corr Matrix: R TRASFORMATION: STANDARDIZED VARS Original Data Matrix Standardized Data Matrix The standardized matrix is obtained by subtracting to each observation on a given variable the mean of the variable itself and by dividing this difference by the Std.Dev. The centred vars have null mean, the standardized vars have variances all equal to 1 (the unit of measurement is removed). Since Variance=Std.Dev= 1 for each variable, the covariances coincide with correlations (Corr=Cov/Product of Std.Dev’s). Centroid = x Var/Cov Matrix: S Corr Matrix: R
47. 47. A closer look at the distance Euclidean distance in the standardized space. The standardization makes all the differences comparable, so now the Euclidean distance coincides with the statistical distance calculated in the original space. Notice that the cloud still has orientation Euclidean distance in the original space Statistical distance in the original space
48. 48. A closer look at the distance In statistical distance deviations are adjusted by taking into account dispersions of the variables. But no attention is posed on the “coherence” between each point and the cloud of points ( standardization does not involve correlations ) Slovakia and Cyprus are equally statistically distant from the origin. Notice that Lithuania is more statistically distant from the origin. Consider the orientation of the cloud: the line connecting Lithuania to 0 has the same direction of the cloud. This is less true for Slovakia. The line connecting Cyprus to the origin is in countertendency
49. 49. A closer look at the distance In Statistical distance, the coherence with the orientation of the cloud is not considered. A transformation of data which removes the effect of Std. Dev, and also penalizes deviations by considering the orientation of the cloud of points id the so called Mahalanobis transformation . We do not enter into details here. The so called Mahalanobis distance is defined as the Euclidean distance calculated on Mahalanobis transformed observations: Mahalanobis transf. of the j -th variable: The Mahalanobis transformation is a particular linear combination of the considered variables.
50. 50. Multivariate Samples - Transformations TRASFORMATION: MAHALANOBIS Centroid = Origin = 0 Var/Cov Matrix: I Corr Matrix: I Original Data Matrix Mahalanobis Data Matrix The Mahalanobis distance is the Euclidean distance evaluated by previously transforming data according to the Mahalanobis transformation. The variables transformed according to the Mahalanobis transformation have null means, variances all equal to 1 (unit of measurement is removed), and null correlations (orientation of the cloud is removed). Centroid = x Var/Cov Matrix: S Corr Matrix: R
51. 51. A closer look at the distance Mahalanobis Distance: deviations from the origin are adjusted by taking into account both the dispersions of variables and their correlations (orientation). Now Cyprus, being in countertendency with respect to the orientation of the cloud is characterized by a Mahalanobis distance from 0 which is higher than that characterizing Slovakia. Notice that Lithuania has a Mahalan. distance from 0 similar to that of Slovakia. Points having the same Mahalanobis distance from the origin
52. 52. A closer look at the distance Euclidean distance (original space Statistical distance (original space) Mahalanobis distance (original space) Euclidean distance in the Mahalanobis space. By removing both dispersion and correlation differences are comparable also with respect to their orientation, so now the Euclidean distance coincides with the mahalanobis distance calculated in the original space. Notice that the cloud has no orientation.
53. 53. Multivariate samples – Transformations Conclusion: By transforming data via standardization or Mahalanobis transformation we are simply defining a new space such that the Euclidean Distance calculated on the transformed points coincides respectively with: Statistical distance - standardization , deviations are differently evaluated depending on their Std.Dev Mahalanobis distance - Mahalanobis transformation , deviations are differently evaluated depending on the Std.Dev.’s and to the orientation of the cloud - correlations/covariances ). As for now the latter transformation was not explicitly defined due to its analytical complexity, but we will see later how to obtain Mahalanobis-transformed data. 0 r jk r jk r jk Correlations Mahalanobis Statistical Euclidean Euclidean Euclidean distance 0 r jk s jk s jk Covariances 1 1 s jj s jj Variances 0 0 0 Means Z M Z X MAHALANOBIS STANDARDIZATION CENTRED ON MEAN ORIGINAL