8323 Stats - Lesson 1 - 02 Introduction General 2008
STATISTICS FOR ECONOMICS AND BUSINESS The course I loved to hate… ( S.B. )
STATISTICS FOR ECONOMICS AND BUSINESS <ul><li>The goals </li></ul><ul><li>The key aim is providing you with basic skills in multivariate data analysis . In particular, we focus on techniques useful to analyze and synthesize data sets with many variables and/or many observations. </li></ul><ul><li>Great attention is devoted to applications. You will learn to identify a proper multivariate technique for a given problem, to analyze the data with the statistical software SAS , to interpret results and to formulate the conclusions of the statistical analysis . We will (try to) refer to datasets relevant to your studies. </li></ul>We briefly present the course. A document with a more detailed description of rules and criteria has been already uploaded in Learning space
STATISTICS FOR ECONOMICS AND BUSINESS The tools <ul><li>Frontal Lessons ( Theory ) </li></ul><ul><li>Power point slides on-line before each lesson </li></ul><ul><li>2. Lab classes ( Applications ) </li></ul><ul><li>familiarize with the statistical software SAS, interpret results. </li></ul><ul><li>Extended solutions on-line after each lesson </li></ul><ul><li>Word documents with a detailed descriptions of SAS programs </li></ul><ul><li>4. Tutor: Chiara Castellano CLEMIT grad student (2 times a week) </li></ul><ul><li>5. Discussion List (LS or specific. please avoid personal email) </li></ul><ul><li>5. SAS installed on your laptop (see the Library for details) </li></ul><ul><li>6. Textbooks (see the Library for details; my slides should be sufficient but if you are not present at the lessons, reference to textbook is recommended) </li></ul>
STATISTICS FOR ECONOMICS AND BUSINESS Changes and Enhancements <ul><li>Introduction of graded assignments/group works </li></ul><ul><li>Some students experienced problems due to the postponement of study. Many students asked for (graded) incentives to day by day study. An assessment methods specific for attending students has been introduced and is strongly recommended. </li></ul><ul><li>2 Some variations in the organization of the lessons </li></ul>
STATISTICS FOR ECONOMICS AND BUSINESS Assessment Methods For attending students the course grade is based on: The analysis of a real data set (Pc-lab session – 4 hours). Here the focus is on the proper use of statistical techniques and adequacy of economic conclusions drawn on the basis of the obtained results. Documents with SAS procedures can be used during the exam (no other material is allowed). A written exam concerning the methodological issues discussed during the course (content of the theoretical slides). The two exams will be graded separately (max grades = 21 and 6 respectively) 2 Assignments – group work Lessons (at least 2) dedicated to discussion of the 2 assignments. All groups members present at discussion. In these lessons one person picked at random for each group will illustrate (part of) the obtained results (material may be consulted). If the group-person answer reasonably, the assignment of the group will be graded ( 0-2 for each assignment). Otherwise, 0 . for all group members. Not attending students (did not hand in both assignments): extended practical and theoretical exams (max grades=23 and 8 respectively)
STATISTICS FOR ECONOMICS AND BUSINESS Prerequisites <ul><li>Univariate Descriptive Statistics . Synthesis Measures (mean, median, quartiles, percentiles, variance, standard deviation). Graphical tools (histogram, box-plot). Extreme values </li></ul><ul><li>2. Bivariate Descriptive Statistics . Contingency table, joint, marginal and conditional distributions, measures of association. Conditional means and variances. Scatterplots, covariance, correlation coefficient </li></ul><ul><li>Inference : random sample, estimators (point and interval) of the mean and of the variance. Hypothesis testing: notion of p-value. </li></ul>
Multivariate Data Analysis Techniques to analyze/synthesize data sets with many variables and/or many observations. MOTIVATION
Multivariate Data Analysis – Motivation Example1. Innovation and Research in Europe (Source: Eurostat) Country code Geo Country name Country european region Region E-government on-line availability - Online availability of 20 basic public services E_gov_avail Exports of high technology products as a share of total exports HT_Exports % of males 20-24 having completed at least upper 2° educ. Y_Educ__Lev_m % of fem. 20-24 having completed at least upper 2° educ. Y_Educ_Lev_f Youth education attainment level - total - % of the population 20-24 who completed at least upper secondary education Y_Educ_Lev Expenditure on Telecommunications as a % of GDP Telec_Expenditure Expenditure on Information Technology as a % of GDP IT_Expenditure No patents granted by the US Patent and Trademark Office per million inhabitants USTPO No patent applications to the European Patent Office per million inhabitants EPO Male tertiary graduates in S&T per 1000 of males aged 20-29 ST_grad_m Female tertiary graduates in S&T per 1000 of females aged 20-29 ST_grad_f Science and technology - Tertiary graduates in S&T x 1000 persons aged 20-29 ST_grad Level of Internet access - % of households who have Internet access at home Internet_Acc GERD - abroad - % of GERD financed by abroad GERD_abroad GERD - government - % of GERD financed by government GERD_govern GERD - industry - % of GERD financed by industry GERD_industry Gross domestic expenditure on R&D (GERD) - As a % of GDP GERD Spending on Human Resources (total public expen. on education) - % of GDP Educ_Exp
Multivariate Data Analysis – Motivation Example1 (continued). Innovation and Research in Europe. For the sake of simplicity, we limit attention to few observations and to few variables, transformed so that variables have all the same unit of measurement (we will show later how we obtain this result) How can we study the relationships among all the variables to understand which are the main tendencies of data, i.e. if there are groups of variables acting in the same or in the opposite direction? 0.84 0.21 0.15 1.41 -0.33 -0.14 0.07 0.04 0.47 0.51 Western France 0.11 0.00 1.53 -0.69 0.92 -1.01 -0.47 1.02 0.72 -0.47 Western Germany -0.62 -0.84 0.13 -0.25 0.44 0.77 -1.38 0.83 0.36 0.63 Western Belgium 0.53 -1.04 1.05 -0.97 1.25 0.56 -0.04 -0.16 0.09 -0.18 Western Netherlands -0.83 0.56 -0.86 0.01 -0.33 -0.05 0.36 -0.56 -0.75 -0.81 Southern Spain -0.62 0.42 -0.39 -0.82 -0.33 -0.60 1.03 -0.56 -0.58 -0.45 Southern Italy -0.72 -1.04 -1.05 -0.71 -1.14 1.94 1.01 -1.77 -1.01 -1.10 Southern Greece 0.01 1.88 1.47 0.27 1.54 -0.85 -1.45 1.52 2.42 1.79 Northern Sweden 0.74 1.39 1.61 1.03 0.49 -1.01 -1.04 1.46 1.52 1.00 Northern Finland 1.57 0.84 -0.03 1.56 0.72 2.20 -0.72 -0.70 0.12 0.08 Northern United Kingdom -0.93 0.63 0.07 -0.76 0.92 -0.16 0.35 -0.18 -0.10 1.91 Northern Norway 2.20 0.21 -0.43 1.60 -0.04 -0.36 -1.03 1.11 -0.57 -0.72 Northern Ireland -1.25 -0.49 -1.11 0.51 -1.38 -0.25 1.95 -1.42 -0.98 -0.09 Northern Lithuania -0.20 -1.18 -1.03 -1.08 -1.05 -1.07 0.72 -0.11 -0.48 -0.60 Eastern Czech Republic -0.83 -1.53 -1.12 -1.11 -1.67 0.04 0.66 -0.52 -1.24 -1.51 Eastern Romania HT_Exports E_gov_avail EPO ST_grad Internet_Acc GERD_abroad GERD_govern GERD_industry GERD Educ_Exp region country
Multivariate Data Analysis – Motivation 2) Obtain a line plot for VARIABLES Example1 (continued). Innovation and Research in Europe (subset) How can we study the relationships among all the variables? A line is associated to each variable. We can observe groups of vars with similar tendencies with respect to some variables, for example the orange-red ones, or the green ones or the blue ones. These three groups of vars show different tendencies
Multivariate Data Analysis – Motivation Example1 (continued). Innovation and Research in Europe. (subset) How can we combine the information provided by all the vars to compare innovation/ research performance for each country? Should we consider the means for the previously observed groups OF VARIABLES? Are they sufficient to explain ALL the vars? Should we consider the 3 means, one for each group and compare obs on the basis of them? Which is the most important index/mean? Should the 3 indices have the same weight when comparing variables? What if we want a single index? Is it possible, how much information we loose? Group 1: GERD, GERD_industry, Internet_Acc, EPO, Educ_Exp, E_gov_avail Group 2: ST_grad, HT_Exports Group 3 : GERD_govern
Multivariate Data Analysis – Motivation Things become complicated when we consider more vars/obs. FINDING GROUPS OF VARIABLES WITH SIMILAR PATTERN IS DIFFICULT Example1 (continued). Innovation and Research in Europe. How can we study the relationships among all the variables?
Multivariate Data Analysis – Motivation - Vars High number of ( numerical ) variables: <ul><ul><li>Analyzing the relationships among variables </li></ul></ul><ul><ul><li>Synthesizing the variables </li></ul></ul>Principal Component Analysis Factor Analysis
Multivariate Data Analysis – Motivation Example1 (continued). Innovation and Research in Europe. (subset) How can we describe the main tendencies of European countries with respect to innovation? Are there countries with similar characteristics? Which are the main pattern/profiles in this data set? Obtain a line plot FOR OBSERVATIONS A line is associated to each observation . We can observe groups of obs with similar tendencies (for example the orange-red ones). Tendencies are similar only with respect to some vars. Which vars should be mostly considered? Who is “close” to who? How can we describe in a simple way similarity or dissimilarity between countries?
Multivariate Data Analysis – Motivation Sometimes the grouping is obtained on the basis of a priori knowledge. In this case, for example, we can group by referring to the region Example1 (continued). Innovation and Research in Europe (subset) How can we individuate groups of cases (countries) with similar characteristics? Grouping obs according to the region is not a good idea: countries in the same region show different patterns.
Multivariate Data Analysis – Motivation Example1 (continued). Innovation and Research in Europe. How can we describe the main tendencies of European countries wrt innovation? Things become complicated when we consider more vars/obs. FINDING GROUPS OF OBSERVATIONS WITH SIMILAR PATTERNS IS DIFFICULT
Multivariate Data Analysis – Motivation <ul><ul><li>Describing cases </li></ul></ul><ul><ul><li>Analysis of similarity/dissimilarity between cases </li></ul></ul><ul><ul><li>Individuation of the main tendencies (groups of cases) in a data base </li></ul></ul>High number of observations (either numerical or categorical) Finding groups Cluster Analysis Visualizing differences Factor Analysis/Multidimensional Scaling
Multivariate Data Analysis – Motivation Example 2. Information about projects financed by EU in 1995-1996 Number of organisations involved in the project Size Topic of the project Topic Information about the P roject Number of projects coordinated by the Responsible before 1995 Proj_resp_1995 Number of projects coordinated by the Responsible ended before 1995 Proj_resp_end_1995 Evaluation of the activity of the Responsible as a partner in other projects before 1995 (8 point scale; 1=very poor, 8=excellent) Activity_partner Information about the Responsible (organization which is coordinating the project) Project id Record Duration EMP REV Type Country Duration of the project Employees of the Responsible Revenues of the Responsible Type of organisation (Industry, Education, Research, Commercial) of the Responsible Nationality of the responsible
Multivariate Data Analysis Example 2 (continued). Projects financed by EU in 1995-1996 (partial input) Is there an association between the country, the type of organization and the topic? Are there organizations/countries specialized in particular topics? If there is association, what is it due to? Who is attracted by what? STANDARDS 5 30 0 3 1 2633 248824 Industry UK 27410 TELECOMMUNICATIONS 4 18 1 1 1 1394 208312 Industry Netherlands 24175 TELECOMMUNICATIONS 3 18 4 6 2 363 18947 Industry UK 24174 TELECOMMUNICATIONS 6 24 0 2 2 259 5706 Industry Italy 24171 SAFETY 7 33 0 1 1 199 23859 Industry Italy 23988 SAFETY 4 24 0 7 7 53164 15297220 Education UK 23985 SAFETY 10 24 0 10 7 594 168066 Non Commercial France 23806 NATURAL_RESOURCES 6 36 0 2 2 12 974 Research France 23770 ENERGY 4 24 1 3 6 10343 4547875 Education Germany 23682 ENERGY 3 24 0 1 2 78701 15930801 Education Germany 23611 NATURAL_RESOURCES 7 36 0 1 2 163 18400 Research Netherlands 23601 NATURAL_RESOURCES 5 24 0 3 1 572 99404 Research UK 23596 ENERGY 5 18 1 6 6 34217 9969376 Industry Italy 23590 MATERIALS TECHNOLOGY 15 24 2 10 6 310 39707 Education Belgium 23386 MATERIALS TECHNOLOGY 6 24 0 1 2 49 6353 Research France 23376 TOPIC SIZE DURATION PROJ_RESP_END_1995 PROJ_RESP_ 1995 ACTIVITY_ PARTNER EMP REV TYPE COUNTRY RECORD
Multivariate Data Analysis – Motivation <ul><ul><li>Describing of the association between categorical variables, i.e., understanding the main attraction/repulsion forces between categories </li></ul></ul><ul><ul><li>Individuation of profiles of categories (i.e., typical combinations of categories </li></ul></ul>Categorical Variables (two or more) with many values Correspondence Analysis Simple and Multiple
Multivariate Data Analysis <ul><li>When dealing with many vars and/or obs it may be difficult to </li></ul><ul><li>Describe, analyze synthesize obs taking into account all the vars , individuating “typical” cases or tendencies in OBS </li></ul><ul><li>Study the relationships among vars and/or synthesize them jointly </li></ul><ul><li>Grouping of vars and/or obs according to some “natural” or somehow “intuitive” rules (e.g., the mean for the variables, the region or the richness for countries a.s.o.) </li></ul><ul><li>These approaches: Are subjective </li></ul><ul><li> Reproduce what we already know about data and do not help in further knowledge about them </li></ul><ul><li> Sometimes can not be applied (no natural grouping available) / difficulty in individuating similar patterns </li></ul>
Multivariate Data Analysis The aim of Multivariate Statistical Techniques is to Extract information contained in a given data set, by simplifying and summarizing observations and/or variables by using DATA DRIVEN TOOLS The tool – i.e., the compression/simplification/synthesis of data – used to make information available depends upon the aim of the analysis and on the nature of the variables taken into account