SlideShare a Scribd company logo
1 of 51
Data Analysis II: Explaining Observed Differences – Cross Tabulation, Correlation and Regression Lesson 16
Explaining Variation with Dependent and Independent Variables Why there are differences or variations in the data? Product usage, preferences and attitudes are partially dependent upon the marketing activities hence such variables are called Dependent Variables. Independent Variable is one which the researcher believes can explain the differences or variations which occur in dependent variables. Eg., Brand’s price, package and advertising.
Assumptions The data to be analyzed are obtained from descriptive studies, not from experiments. The data are from very large samples, usually in excess of 300 and frequently as large as 1000. The data includes measures on a number of variables for each respondent.
Method of Analysis Analyze patterns of change hich are common to both a dependent variable and one or more independent variables. Cross-tabulation is applicable to data in which the dependent variable and the indpendent variables are categorical variables or continuous variables which have been placed into categories. Correlation and regression analysis are commonly applied to situations where both the dependent variable and the independent variables are continuous.
Cross-Tabulation Independent variables in cross-tabulations include the respondents’ age, amount of education, household income, size of family and occupation. In the cells of a cross-tabulation researchers typically show percentages as well as actual counts of the number of different responses given.
Constructing and Interpreting a Cross-Tabulation Cross-tabulation is used on both types of categorical variables. Assign categories associated with the dependent variable to the rows of the cross-tabulation and assign categories associated with with the independent variable(s) to the columns of the cross-tabulation. Assign top row to the dependent variable category with the largest quantifiable number. Each succeeding row is assigned a category with a progressively lower quantifiable number. Alternately, top row can be assigned to the highest or most desirable category while the bottom row can be assigned to the lowest or most undesirable category. The Independent variable category with the largest quantifiable number is assigned to the rightmost column and the category with the smallest quantifiable number is assigned to the leftmost column. Alternately, rightmost column to highest or most desirable category and leftmost column to the lowest or most undesirable category. Each column percentages total 100%. To interpret the cross-tabulation, analyze the pattern of percentages across each row separately.  If the percentages increase from left to right, the dependent variable category is positively associated with the independent variable.  If the percentages decrease from left to right, the dependent variable category is negatively associated with the independent variable.
Three Useful Questions Does the cross-tabulation show a valid or spurious relationship? How many independent variables should be used in the cross-tabulation? Are the differences seen in the cross-tabulation statistically significant, or could they have occurred by chance due to sampling variation?
Cross-Tabulation shows a valid explanation Changes in the independent variables are believed to cause changes or to explain variation In the dependent variable
Cross-Tabulation shows a spurious explanation Spurious relationship if the implied relationship between the dependent and  independent variables does not seem to be logical.
How many independent variables should be used? 3 or 4 variables at best.
Are the differences statistically significant? Expected count in the ith cell =(Total Number of Observations in the row in which cell I is located x the total number of observations in the column in which cell I is located)/the total number of observations in all cells If a cross-tabulation reveals an interesting relationship between a dependent and an independent variable, a Chi-square analysis can be used to test whether the observed differences are statistically significant.
Creating Expected Count 100X54/200=27 26                         23                    24 27                        26                         23                    24
Introductory Comments on Correlation and Regression Analysis Correlation and regression analysis can be used in situations where both the dependent and independent variables are of the continuous type. For example examine the relationship between milk consumption and both age and income. More accurate representations of the relationships between variables More objectively arrived at than similar results from cross tabulations
CORRELATION Data to which regression and correlation can be applied are : They are continuous variables More than one variable is measured for each respondent The number of respondent is greater than the number of variables
Correlation Analysis A positive relationship is said to exist between two variables when larger values of one variable are associated with larger values of the other variable. Thus, wine consumption and income are said to be positively related if higher levels of wine consumption tend to be associated with higher income. Wine consumption and income are said to be negatively related if lower levels of wine consumption tend to be associated with higher income.
Correlation Coefficient r A measure of the relationship between two variables is the correlation coefficient r. If there is perfect positive correlation r=+1.00;  Perfect negative correlation is indicated by r=-1.00 No relationship is shown by r=0
r R= ∑((Yi-Y)(Xi-X))           √(∑(Yi-Y)² )(∑Xi-X)²) Where Y = average wine consumption               X = average income                Yi=wine consumption of individual HHs                 Xi= income from individual HHs
Household Wine Consumption and Income : City A & B
Scatter Diagrams of Wine Consumption and Income For City A Annual Income ($000)
Scatter Diagrams of Wine Consumption and Income For City B Annual Income ($000)
Calculations
Applying correlation coefficient formula for City A & B R = ∑((Yi-Y)(Xi-X))           √(∑(Yi-Y)² )(∑Xi-X)²) =8/√(10)X(10) =8/10 =+0.8 for City A r=10/√(10)(10) =10/10 =+1.00 for City B (perfect positive correlation)
R interpretation If r=0.8 or larger, there is a very strong or high relationship between variables If r is between 0.4 and 0.8 (disregarding sign) the relationship between the variables is considered moderate to high. For lower values of r, the relationship is small to insignificant. When a correlation analysis results in an r of less than 0.4 researchers do not have strong evidence that there is a relationship between the dependent and independent variables. The + or – on the correlation coefficient indicates whether the relationship is positive or negative. For eg., if r= -0.8 or -1.00 it means that wine consumption is lower among households with higher income. Correlation analysis, like cross-tabulation attempts to identify ptterns of variations common to a dependent variable and an independent variable.
REGRESSION ANALYSIS The correlation coefficient is a summary measure which indicates the relative strength and the + or – direction of a relationship between two variables. But it does not describe the underlying relationship. It cannot be used to predict the size of change to expect in the dependent variable if the independent variable is changed by one unit ie., how  large a change in wine consumption would occur with a given increase in income? Hence, an equation is needed.
Regression Analysis Regression Analysis is a technique whereby a mathematical equation is “fitted” to a set of data. The Set of Data The Mathematical Equation The technique which fits the equation to the data How the equation is evaluated to see how well it “describes” the data.
City C Data
Data Set Data set must consist of measures of two or more continuous variables and the sample size must be at least two or three times as large as the number of measured variables, preferably much larger.
Mathematical Equation In most applications of regression analysis, the equation which is used is that of a straight line. The general form of the equation of a straight line is Y=a + bx. Where :  Y=the dependent variable X=the independent variable b=a coefficient which indicates the effect on Y of a one unit change in X. That is if b=+8.2, a one unit increase in X will result in an 8.2 unit increase in Y. a=the coefficient which identifies the value of Y when the value of X is equal to 0. That is, a is the value of Y at which the straight line intersects the Y axis.
Linear Regression Equation A simple linear regression equation can be applied to the wine consumption data available from City C Y = a +bX Where Y=the predicted values of the dependent variable, that is, monthly household wine consumption in quarts X=The observed values of the independent variable ie., the annual household income in thousands of dollars. b= a coefficient which indicates by how much household wine consumption in quarts is expected to increase with a $1000 increase in annual household income. a=point at which the straight line intersects the y axis.
Pictorial Presentation of Regression Analysis Household wine consumption Y Regression Line . . Y .  .  .   .   .   . The location of each dot  one Households wine consumption  and its Annual income Xi Household Annual Income X
Fitting the equation to the observed data Plot all wine consumption and annual income data on a scatter diagram from a very large sample. Envision fitting a line through the points in such a manner which results in “the best possible fit” The fitted regression line can be viewed as a “predictor line” in the sense that it “predicts” household wine consumption for each different value of annual household income. For each and every observed value of an annual household income Xi, the regression line provides a predicted value of wine consumption Yi. The difference between the wine consumption reported by household i (Yi) and its predicted wine consumption (Yi) is (Yi-Yi) and this difference is called a residual. The procedure commonly used to calculate the regression line which “best fits” a particular set of data is one called the “least square method” This procedure identifies the one equation which when fitted to the observed data, minimizes the sum of the square of all residuals. That is the procedure minimizes  ∑ (Yi-Yi)²  for all i.
Calculating a and b values X=∑Xi/n    =70/5    =14 Y=∑Yi/n    =15/5    =3
Calculating a and b b=n(∑XiYi) – (∑ Xi)(∑ Yi)            n (∑ Xi²) – (∑ Xi)²         =5(226)-(70)(15)          5(1020)-(70)²       =1130-1050          5100-4900       =80/200       =+0.40
Substituting  a=Y-bX a=3- (0.40)(14)    =3-5.6     =-2.60 Completing the regression equation Y=-2.60+0.40X
a, b and slope It should be noted that the regression line intersects the Y axis at -2.60, that is at the value of the a coefficient. The b coefficient indicates that for each $1000 increase in annual household income, monthly wine consumption is predicted to increase by 0.40 quarts. This is demonstrated by setting the slope of the regression line at a value of +0.40, which takes it in an upward direction to the right.
Advantages of Regression The result demonstrates the advantage of a regression analysis over a correlation analysis. With an r=+0.80, the correlation analysis only identifies the presence of a moderately strong  positive relationship between wine consumption and income. The regression analysis leads to a more complete description of the relationship, for example, Predicrted monthly wine consumption for selected annual incomes are shown in above table. Because the b coefficient indicates by how much the dependent variable will change for a  Given change in the independent variable, a regression equation is a type of descriptive  relationship which can help researchers arrive at better understanding of variation in  Dependent variable.
Evaluating the Regression Equation All regression procedures also calculate a measure called the “coefficient of determination” which is identified as R². This coefficient takes on a maximum value of 1.00 but can be 0 also. An R² value of 1.00 indicates that the regression equation “explains” 100 percent of the variance in the dependent variable about its mean.  This variance would be explained perfectly if every dot in the scatter diagram fell precisely on the regression line, that is, if all of the residuals were equal to 0. When the regression equation does not fit the data perfectly, some of the residuals will be greater than 0. Those residuals form a distribution around the regression equation and this distribution can be used as a measure of how much variance is “unexplained” by the regression equation.
Regression Equation R² = (Total Variance in the Dependent Variable) – (Variance unexplained by the regression equation)             Total Variance in the Dependent Variable If the regression line explains all the variance in Y, all the residuals will be 0, the variance “unexplained” by the regression equation will be 0, and the coefficient of determination will be  R² = Total Variance in the dependent variable -0           Total Variance in the dependent variable      =  1.00 R² values in the 0.50-1.00 range are usually interpreted to mean that the regression equation does a good job of explaining the Y variation.   = ∑ (Yi-Y)² - ∑(Yi-Y*)²             ∑ (Yi-Y)²
Calculating R² for the Wine Consumption Regression Equation for City C Y = ∑Yi/n =15/5=3 R² = (10-3.6)/10       = 0.64
Interpretation of R² Values in Explaining Variance Y*i=-2.60+0.40Xi is capable of explaining about 64 percent of the total variance observed in the dependent variable – monthly household wine consumption. Or 36per cent of the total variance in household wine consumption is “unexplained” by the regression equation.
R² Explained If the regression line does not explain any of the variance in Y, all the residuals will be large, and the variance “unexplained” by the regression equation will be approximately equal to total Y variance. The two terms in the numerator of the R² formula will be equal, the numerator will be 0 and R² will be 0. An R² value approximating 0 indicates that the regression equation does not explain any of the variance observed in Y.  In general, R² values of 0.25 or less indicate that the regression equation is of little use in explaining variance.  Regression equations with R² values in the 0.25 – 0.50 range are typically judged to be of only moderate use in explaining the variance observed in a dependent variable.
Multiple Linear Regression Simple Linear Regression uses one independent variable. When two or more independent variables are used in a linear regression analysis, it is called multiple linear regression. Y=a+bX1+cX2+dX3+… Where Y is the dependent variable and X1,X2,X3… are independent variables. The additional coefficients (c,d) are similar to the b coefficient, except that they are associated with independent variables X2 and X3.
Calculations needed for the Multiple Linear Regression of Wine Consumption in City C Y=  15/5            X1=70/5       X2=150/5   =   3                     =14                =30
Calculations
Calculating b Y=Y-Y X1=X1-X1 X2=X2-X2 b=(∑yx1)(∑x2²)-(∑yx2)(∑x1x2)              (∑x1²)(∑x2²)-(∑x1x2)²        = (16)(74)-(25)(46)            (40)(74) –(46)²       =1184-1150/2960-2116=34/844       =0.0402
Calculating c c=(∑yx2)(∑x1²)-(∑yx1)(∑x1x2)              (∑x1²)(∑x2²)-(∑x1x2)²       =(25)(40)-(16)(46)           (40)(74)-(46)²       =1000-736/2960-2116=264/844       =0.312
Calculating a a=Y-bX1-cX3   = 3.0-(0.0402)(14)   =(0.312)(30)   =-6.923
The REGRESSION EQUATION Y*=-6.923+0.0402X1+0.312X2 From this equation, researchers see that a greater increase in wine consumption is associated with a one-year increase in age(0.312 quarts) than is associated with a $1000 increase in annual income(0.0402 quarts) As in simple linear regression, the a coefficient (-6.923) should be interpreted as a structural coefficient.
Stepwise Multiple Linear Regression Five independent variables (x1,x2,x3,x4,X5) Stepwise multiple regression first evaluates each independent variable separately to determine which one results in the largest R² - that is , which one explains most of the variation in the dependent variable. If it is x3, that independent variable is selected for the regression equation.
Stepwise Multiple Linear Regression Next, the stepwise regression evaluates X3, in combination with each of the remaining independent variables (one at a time) to determine which one of the latter results in the largest increase in R². If it is X5, that variable becomes the second independent variable selected for the regression equation. Likewise, remaining variables are evaluated one at a time with the two already selected ones to determine which one is the third variable. This continues until R² can no longer be increased significantly by adding another independent variable to the regression equation.  The independent variables which are not selected do not become part of the regression equation.
Problems in using Regression Analysis Inadequate sample size. Should be 2-3 times the number of variables. Independent variables measured do not have direct effect on the dependent variable. Independent Variables are highly correlated. Their effect will be same as single variable which has been used twice in equation. Relationship between dependent and independent variable is not linear ie unusual shape. Hence, it cannot be analyzed by regression techniques.

More Related Content

What's hot

correlation_and_covariance
correlation_and_covariancecorrelation_and_covariance
correlation_and_covariance
Ekta Doger
 

What's hot (19)

Correlation and regression impt
Correlation and regression imptCorrelation and regression impt
Correlation and regression impt
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
Simple correlation
Simple correlationSimple correlation
Simple correlation
 
Correlation & regression analysis
Correlation & regression analysisCorrelation & regression analysis
Correlation & regression analysis
 
Correlation analysis
Correlation analysis  Correlation analysis
Correlation analysis
 
correlation_and_covariance
correlation_and_covariancecorrelation_and_covariance
correlation_and_covariance
 
Correlation analysis notes
Correlation analysis notesCorrelation analysis notes
Correlation analysis notes
 
Correlation and Regression
Correlation and RegressionCorrelation and Regression
Correlation and Regression
 
Correlation analysis
Correlation analysisCorrelation analysis
Correlation analysis
 
Applications of regression analysis - Measurement of validity of relationship
Applications of regression analysis - Measurement of validity of relationshipApplications of regression analysis - Measurement of validity of relationship
Applications of regression analysis - Measurement of validity of relationship
 
Correlation Analysis
Correlation AnalysisCorrelation Analysis
Correlation Analysis
 
Correlation
CorrelationCorrelation
Correlation
 
Linear Correlation
Linear Correlation Linear Correlation
Linear Correlation
 
Correlation and Regression
Correlation and RegressionCorrelation and Regression
Correlation and Regression
 
Partial correlation
Partial correlationPartial correlation
Partial correlation
 
Regression & correlation coefficient
Regression & correlation coefficientRegression & correlation coefficient
Regression & correlation coefficient
 
Correlation Statistics
Correlation StatisticsCorrelation Statistics
Correlation Statistics
 
Linear regression and correlation analysis ppt @ bec doms
Linear regression and correlation analysis ppt @ bec domsLinear regression and correlation analysis ppt @ bec doms
Linear regression and correlation analysis ppt @ bec doms
 

Viewers also liked

Rm 1 Intro Types Research Process
Rm   1   Intro Types   Research ProcessRm   1   Intro Types   Research Process
Rm 1 Intro Types Research Process
itsvineeth209
 

Viewers also liked (20)

QT
QTQT
QT
 
Critical appraisal
Critical appraisalCritical appraisal
Critical appraisal
 
Using Spss Correlations
Using Spss   CorrelationsUsing Spss   Correlations
Using Spss Correlations
 
3 descriptive statistics with R
3 descriptive statistics with R3 descriptive statistics with R
3 descriptive statistics with R
 
Freq distribution
Freq distributionFreq distribution
Freq distribution
 
Correlation analysis
Correlation analysis Correlation analysis
Correlation analysis
 
Introduction to basic statistics
Introduction to basic statisticsIntroduction to basic statistics
Introduction to basic statistics
 
Crosstabs
CrosstabsCrosstabs
Crosstabs
 
Research Methodology and Research Design
Research Methodology and Research DesignResearch Methodology and Research Design
Research Methodology and Research Design
 
Critical appraisal guideline
Critical appraisal guidelineCritical appraisal guideline
Critical appraisal guideline
 
Basics of SPSS, Part 2
Basics of SPSS, Part 2Basics of SPSS, Part 2
Basics of SPSS, Part 2
 
cross tabulation
 cross tabulation cross tabulation
cross tabulation
 
Tabulation
Tabulation Tabulation
Tabulation
 
SDTM (Study Data Tabulation Model)
SDTM (Study Data Tabulation Model)SDTM (Study Data Tabulation Model)
SDTM (Study Data Tabulation Model)
 
An Introduction to Factor analysis ppt
An Introduction to Factor analysis pptAn Introduction to Factor analysis ppt
An Introduction to Factor analysis ppt
 
Tabulation
TabulationTabulation
Tabulation
 
Lecture 01 - Research Methods
Lecture 01 - Research MethodsLecture 01 - Research Methods
Lecture 01 - Research Methods
 
Rm 1 Intro Types Research Process
Rm   1   Intro Types   Research ProcessRm   1   Intro Types   Research Process
Rm 1 Intro Types Research Process
 
Hypothesis
HypothesisHypothesis
Hypothesis
 
Project Report Format
Project Report FormatProject Report Format
Project Report Format
 

Similar to Lesson 16 Data Analysis Ii

Covariance and correlation
Covariance and correlationCovariance and correlation
Covariance and correlation
Rashid Hussain
 

Similar to Lesson 16 Data Analysis Ii (20)

Chapter 10
Chapter 10Chapter 10
Chapter 10
 
Chapter 10
Chapter 10Chapter 10
Chapter 10
 
SPSS
SPSSSPSS
SPSS
 
Linear Regression | Machine Learning | Data Science
Linear Regression | Machine Learning | Data ScienceLinear Regression | Machine Learning | Data Science
Linear Regression | Machine Learning | Data Science
 
PPT Correlation.pptx
PPT Correlation.pptxPPT Correlation.pptx
PPT Correlation.pptx
 
Correlation analysis
Correlation analysisCorrelation analysis
Correlation analysis
 
Covariance and correlation
Covariance and correlationCovariance and correlation
Covariance and correlation
 
Correlationanalysis
CorrelationanalysisCorrelationanalysis
Correlationanalysis
 
2-20-04.ppt
2-20-04.ppt2-20-04.ppt
2-20-04.ppt
 
data analysis
data analysisdata analysis
data analysis
 
Measure of Association
Measure of AssociationMeasure of Association
Measure of Association
 
Multivariate Analysis Degree of association between two variable - Test of Ho...
Multivariate Analysis Degree of association between two variable- Test of Ho...Multivariate Analysis Degree of association between two variable- Test of Ho...
Multivariate Analysis Degree of association between two variable - Test of Ho...
 
Regression
RegressionRegression
Regression
 
Regression
RegressionRegression
Regression
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
Data analysis test for association BY Prof Sachin Udepurkar
Data analysis   test for association BY Prof Sachin UdepurkarData analysis   test for association BY Prof Sachin Udepurkar
Data analysis test for association BY Prof Sachin Udepurkar
 
ders 8 Quantile-Regression.ppt
ders 8 Quantile-Regression.pptders 8 Quantile-Regression.ppt
ders 8 Quantile-Regression.ppt
 
Quantitative Methods for Lawyers - Class #17 - Scatter Plots, Covariance, Cor...
Quantitative Methods for Lawyers - Class #17 - Scatter Plots, Covariance, Cor...Quantitative Methods for Lawyers - Class #17 - Scatter Plots, Covariance, Cor...
Quantitative Methods for Lawyers - Class #17 - Scatter Plots, Covariance, Cor...
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Lesson 16 Data Analysis Ii

  • 1. Data Analysis II: Explaining Observed Differences – Cross Tabulation, Correlation and Regression Lesson 16
  • 2. Explaining Variation with Dependent and Independent Variables Why there are differences or variations in the data? Product usage, preferences and attitudes are partially dependent upon the marketing activities hence such variables are called Dependent Variables. Independent Variable is one which the researcher believes can explain the differences or variations which occur in dependent variables. Eg., Brand’s price, package and advertising.
  • 3. Assumptions The data to be analyzed are obtained from descriptive studies, not from experiments. The data are from very large samples, usually in excess of 300 and frequently as large as 1000. The data includes measures on a number of variables for each respondent.
  • 4. Method of Analysis Analyze patterns of change hich are common to both a dependent variable and one or more independent variables. Cross-tabulation is applicable to data in which the dependent variable and the indpendent variables are categorical variables or continuous variables which have been placed into categories. Correlation and regression analysis are commonly applied to situations where both the dependent variable and the independent variables are continuous.
  • 5. Cross-Tabulation Independent variables in cross-tabulations include the respondents’ age, amount of education, household income, size of family and occupation. In the cells of a cross-tabulation researchers typically show percentages as well as actual counts of the number of different responses given.
  • 6. Constructing and Interpreting a Cross-Tabulation Cross-tabulation is used on both types of categorical variables. Assign categories associated with the dependent variable to the rows of the cross-tabulation and assign categories associated with with the independent variable(s) to the columns of the cross-tabulation. Assign top row to the dependent variable category with the largest quantifiable number. Each succeeding row is assigned a category with a progressively lower quantifiable number. Alternately, top row can be assigned to the highest or most desirable category while the bottom row can be assigned to the lowest or most undesirable category. The Independent variable category with the largest quantifiable number is assigned to the rightmost column and the category with the smallest quantifiable number is assigned to the leftmost column. Alternately, rightmost column to highest or most desirable category and leftmost column to the lowest or most undesirable category. Each column percentages total 100%. To interpret the cross-tabulation, analyze the pattern of percentages across each row separately. If the percentages increase from left to right, the dependent variable category is positively associated with the independent variable. If the percentages decrease from left to right, the dependent variable category is negatively associated with the independent variable.
  • 7. Three Useful Questions Does the cross-tabulation show a valid or spurious relationship? How many independent variables should be used in the cross-tabulation? Are the differences seen in the cross-tabulation statistically significant, or could they have occurred by chance due to sampling variation?
  • 8. Cross-Tabulation shows a valid explanation Changes in the independent variables are believed to cause changes or to explain variation In the dependent variable
  • 9. Cross-Tabulation shows a spurious explanation Spurious relationship if the implied relationship between the dependent and independent variables does not seem to be logical.
  • 10. How many independent variables should be used? 3 or 4 variables at best.
  • 11. Are the differences statistically significant? Expected count in the ith cell =(Total Number of Observations in the row in which cell I is located x the total number of observations in the column in which cell I is located)/the total number of observations in all cells If a cross-tabulation reveals an interesting relationship between a dependent and an independent variable, a Chi-square analysis can be used to test whether the observed differences are statistically significant.
  • 12. Creating Expected Count 100X54/200=27 26 23 24 27 26 23 24
  • 13. Introductory Comments on Correlation and Regression Analysis Correlation and regression analysis can be used in situations where both the dependent and independent variables are of the continuous type. For example examine the relationship between milk consumption and both age and income. More accurate representations of the relationships between variables More objectively arrived at than similar results from cross tabulations
  • 14. CORRELATION Data to which regression and correlation can be applied are : They are continuous variables More than one variable is measured for each respondent The number of respondent is greater than the number of variables
  • 15. Correlation Analysis A positive relationship is said to exist between two variables when larger values of one variable are associated with larger values of the other variable. Thus, wine consumption and income are said to be positively related if higher levels of wine consumption tend to be associated with higher income. Wine consumption and income are said to be negatively related if lower levels of wine consumption tend to be associated with higher income.
  • 16. Correlation Coefficient r A measure of the relationship between two variables is the correlation coefficient r. If there is perfect positive correlation r=+1.00; Perfect negative correlation is indicated by r=-1.00 No relationship is shown by r=0
  • 17. r R= ∑((Yi-Y)(Xi-X)) √(∑(Yi-Y)² )(∑Xi-X)²) Where Y = average wine consumption X = average income Yi=wine consumption of individual HHs Xi= income from individual HHs
  • 18. Household Wine Consumption and Income : City A & B
  • 19. Scatter Diagrams of Wine Consumption and Income For City A Annual Income ($000)
  • 20. Scatter Diagrams of Wine Consumption and Income For City B Annual Income ($000)
  • 22. Applying correlation coefficient formula for City A & B R = ∑((Yi-Y)(Xi-X)) √(∑(Yi-Y)² )(∑Xi-X)²) =8/√(10)X(10) =8/10 =+0.8 for City A r=10/√(10)(10) =10/10 =+1.00 for City B (perfect positive correlation)
  • 23. R interpretation If r=0.8 or larger, there is a very strong or high relationship between variables If r is between 0.4 and 0.8 (disregarding sign) the relationship between the variables is considered moderate to high. For lower values of r, the relationship is small to insignificant. When a correlation analysis results in an r of less than 0.4 researchers do not have strong evidence that there is a relationship between the dependent and independent variables. The + or – on the correlation coefficient indicates whether the relationship is positive or negative. For eg., if r= -0.8 or -1.00 it means that wine consumption is lower among households with higher income. Correlation analysis, like cross-tabulation attempts to identify ptterns of variations common to a dependent variable and an independent variable.
  • 24. REGRESSION ANALYSIS The correlation coefficient is a summary measure which indicates the relative strength and the + or – direction of a relationship between two variables. But it does not describe the underlying relationship. It cannot be used to predict the size of change to expect in the dependent variable if the independent variable is changed by one unit ie., how large a change in wine consumption would occur with a given increase in income? Hence, an equation is needed.
  • 25. Regression Analysis Regression Analysis is a technique whereby a mathematical equation is “fitted” to a set of data. The Set of Data The Mathematical Equation The technique which fits the equation to the data How the equation is evaluated to see how well it “describes” the data.
  • 27. Data Set Data set must consist of measures of two or more continuous variables and the sample size must be at least two or three times as large as the number of measured variables, preferably much larger.
  • 28. Mathematical Equation In most applications of regression analysis, the equation which is used is that of a straight line. The general form of the equation of a straight line is Y=a + bx. Where : Y=the dependent variable X=the independent variable b=a coefficient which indicates the effect on Y of a one unit change in X. That is if b=+8.2, a one unit increase in X will result in an 8.2 unit increase in Y. a=the coefficient which identifies the value of Y when the value of X is equal to 0. That is, a is the value of Y at which the straight line intersects the Y axis.
  • 29. Linear Regression Equation A simple linear regression equation can be applied to the wine consumption data available from City C Y = a +bX Where Y=the predicted values of the dependent variable, that is, monthly household wine consumption in quarts X=The observed values of the independent variable ie., the annual household income in thousands of dollars. b= a coefficient which indicates by how much household wine consumption in quarts is expected to increase with a $1000 increase in annual household income. a=point at which the straight line intersects the y axis.
  • 30. Pictorial Presentation of Regression Analysis Household wine consumption Y Regression Line . . Y . . . . . . The location of each dot one Households wine consumption and its Annual income Xi Household Annual Income X
  • 31. Fitting the equation to the observed data Plot all wine consumption and annual income data on a scatter diagram from a very large sample. Envision fitting a line through the points in such a manner which results in “the best possible fit” The fitted regression line can be viewed as a “predictor line” in the sense that it “predicts” household wine consumption for each different value of annual household income. For each and every observed value of an annual household income Xi, the regression line provides a predicted value of wine consumption Yi. The difference between the wine consumption reported by household i (Yi) and its predicted wine consumption (Yi) is (Yi-Yi) and this difference is called a residual. The procedure commonly used to calculate the regression line which “best fits” a particular set of data is one called the “least square method” This procedure identifies the one equation which when fitted to the observed data, minimizes the sum of the square of all residuals. That is the procedure minimizes ∑ (Yi-Yi)² for all i.
  • 32. Calculating a and b values X=∑Xi/n =70/5 =14 Y=∑Yi/n =15/5 =3
  • 33. Calculating a and b b=n(∑XiYi) – (∑ Xi)(∑ Yi) n (∑ Xi²) – (∑ Xi)² =5(226)-(70)(15) 5(1020)-(70)² =1130-1050 5100-4900 =80/200 =+0.40
  • 34. Substituting a=Y-bX a=3- (0.40)(14) =3-5.6 =-2.60 Completing the regression equation Y=-2.60+0.40X
  • 35. a, b and slope It should be noted that the regression line intersects the Y axis at -2.60, that is at the value of the a coefficient. The b coefficient indicates that for each $1000 increase in annual household income, monthly wine consumption is predicted to increase by 0.40 quarts. This is demonstrated by setting the slope of the regression line at a value of +0.40, which takes it in an upward direction to the right.
  • 36. Advantages of Regression The result demonstrates the advantage of a regression analysis over a correlation analysis. With an r=+0.80, the correlation analysis only identifies the presence of a moderately strong positive relationship between wine consumption and income. The regression analysis leads to a more complete description of the relationship, for example, Predicrted monthly wine consumption for selected annual incomes are shown in above table. Because the b coefficient indicates by how much the dependent variable will change for a Given change in the independent variable, a regression equation is a type of descriptive relationship which can help researchers arrive at better understanding of variation in Dependent variable.
  • 37. Evaluating the Regression Equation All regression procedures also calculate a measure called the “coefficient of determination” which is identified as R². This coefficient takes on a maximum value of 1.00 but can be 0 also. An R² value of 1.00 indicates that the regression equation “explains” 100 percent of the variance in the dependent variable about its mean. This variance would be explained perfectly if every dot in the scatter diagram fell precisely on the regression line, that is, if all of the residuals were equal to 0. When the regression equation does not fit the data perfectly, some of the residuals will be greater than 0. Those residuals form a distribution around the regression equation and this distribution can be used as a measure of how much variance is “unexplained” by the regression equation.
  • 38. Regression Equation R² = (Total Variance in the Dependent Variable) – (Variance unexplained by the regression equation) Total Variance in the Dependent Variable If the regression line explains all the variance in Y, all the residuals will be 0, the variance “unexplained” by the regression equation will be 0, and the coefficient of determination will be R² = Total Variance in the dependent variable -0 Total Variance in the dependent variable = 1.00 R² values in the 0.50-1.00 range are usually interpreted to mean that the regression equation does a good job of explaining the Y variation. = ∑ (Yi-Y)² - ∑(Yi-Y*)² ∑ (Yi-Y)²
  • 39. Calculating R² for the Wine Consumption Regression Equation for City C Y = ∑Yi/n =15/5=3 R² = (10-3.6)/10 = 0.64
  • 40. Interpretation of R² Values in Explaining Variance Y*i=-2.60+0.40Xi is capable of explaining about 64 percent of the total variance observed in the dependent variable – monthly household wine consumption. Or 36per cent of the total variance in household wine consumption is “unexplained” by the regression equation.
  • 41. R² Explained If the regression line does not explain any of the variance in Y, all the residuals will be large, and the variance “unexplained” by the regression equation will be approximately equal to total Y variance. The two terms in the numerator of the R² formula will be equal, the numerator will be 0 and R² will be 0. An R² value approximating 0 indicates that the regression equation does not explain any of the variance observed in Y. In general, R² values of 0.25 or less indicate that the regression equation is of little use in explaining variance. Regression equations with R² values in the 0.25 – 0.50 range are typically judged to be of only moderate use in explaining the variance observed in a dependent variable.
  • 42. Multiple Linear Regression Simple Linear Regression uses one independent variable. When two or more independent variables are used in a linear regression analysis, it is called multiple linear regression. Y=a+bX1+cX2+dX3+… Where Y is the dependent variable and X1,X2,X3… are independent variables. The additional coefficients (c,d) are similar to the b coefficient, except that they are associated with independent variables X2 and X3.
  • 43. Calculations needed for the Multiple Linear Regression of Wine Consumption in City C Y= 15/5 X1=70/5 X2=150/5 = 3 =14 =30
  • 45. Calculating b Y=Y-Y X1=X1-X1 X2=X2-X2 b=(∑yx1)(∑x2²)-(∑yx2)(∑x1x2) (∑x1²)(∑x2²)-(∑x1x2)² = (16)(74)-(25)(46) (40)(74) –(46)² =1184-1150/2960-2116=34/844 =0.0402
  • 46. Calculating c c=(∑yx2)(∑x1²)-(∑yx1)(∑x1x2) (∑x1²)(∑x2²)-(∑x1x2)² =(25)(40)-(16)(46) (40)(74)-(46)² =1000-736/2960-2116=264/844 =0.312
  • 47. Calculating a a=Y-bX1-cX3 = 3.0-(0.0402)(14) =(0.312)(30) =-6.923
  • 48. The REGRESSION EQUATION Y*=-6.923+0.0402X1+0.312X2 From this equation, researchers see that a greater increase in wine consumption is associated with a one-year increase in age(0.312 quarts) than is associated with a $1000 increase in annual income(0.0402 quarts) As in simple linear regression, the a coefficient (-6.923) should be interpreted as a structural coefficient.
  • 49. Stepwise Multiple Linear Regression Five independent variables (x1,x2,x3,x4,X5) Stepwise multiple regression first evaluates each independent variable separately to determine which one results in the largest R² - that is , which one explains most of the variation in the dependent variable. If it is x3, that independent variable is selected for the regression equation.
  • 50. Stepwise Multiple Linear Regression Next, the stepwise regression evaluates X3, in combination with each of the remaining independent variables (one at a time) to determine which one of the latter results in the largest increase in R². If it is X5, that variable becomes the second independent variable selected for the regression equation. Likewise, remaining variables are evaluated one at a time with the two already selected ones to determine which one is the third variable. This continues until R² can no longer be increased significantly by adding another independent variable to the regression equation. The independent variables which are not selected do not become part of the regression equation.
  • 51. Problems in using Regression Analysis Inadequate sample size. Should be 2-3 times the number of variables. Independent variables measured do not have direct effect on the dependent variable. Independent Variables are highly correlated. Their effect will be same as single variable which has been used twice in equation. Relationship between dependent and independent variable is not linear ie unusual shape. Hence, it cannot be analyzed by regression techniques.