A brief introduction on how to conduct growth curve statistical analyses using SPSS software, including some sample syntax. Originally presented at IWK Statistics Seminar Series at the IWK Health Center, Halifax, NS, May 1, 2013.
This document provides an overview of entering and defining variables in SPSS. It discusses opening SPSS, what variables are, defining variables by entering their names and labels, and entering data. It also covers saving data files, importing data from Excel, and handling missing data. Various examples are provided to illustrate defining categorical, ordinal, and continuous variables as well as entering different data types including questionnaires, experiments, and longitudinal studies.
The document discusses multiple linear regression and partial correlation. It explains that multiple regression allows one to analyze the unique contribution of predictor variables to an outcome variable after accounting for the effects of other predictor variables. Partial correlation similarly examines the relationship between two variables while controlling for a third, but only considers two variables, whereas multiple regression examines the effects of multiple predictor variables simultaneously. Examples are given comparing the correlation between height and weight with and without controlling for other relevant variables like gender, age, exercise habits, etc.
The document defines and provides examples for calculating the coefficient of variation, which is a measure used to compare the dispersion of data sets. It gives the formula for coefficient of variation as the standard deviation divided by the mean, expressed as a percentage. Two examples are shown comparing the stability of prices between two cities and production between two manufacturing plants, with the data set having the lower coefficient of variation considered more consistent or stable.
- Simple linear regression is used to predict values of one variable (dependent variable) given known values of another variable (independent variable).
- A regression line is fitted through the data points to minimize the deviations between the observed and predicted dependent variable values. The equation of this line allows predicting dependent variable values for given independent variable values.
- The coefficient of determination (R2) indicates how much of the total variation in the dependent variable is explained by the regression line. The standard error of estimate provides a measure of how far the observed data points deviate from the regression line on average.
- Prediction intervals can be constructed around predicted dependent variable values to indicate the uncertainty in predictions for a given confidence level, based on the
Regression analysis is a statistical technique for investigating relationships between variables. Simple linear regression defines a relationship between two variables (X and Y) using a best-fit straight line. Multiple regression extends this to model relationships between a dependent variable Y and multiple independent variables (X1, X2, etc.). Regression coefficients are estimated to define the regression equation, and R-squared and the standard error can be used to assess the goodness of fit of the regression model to the data. Regression analysis has applications in pharmaceutical experimentation such as analyzing standard curves for drug analysis.
A repeated measures ANOVA is used to test whether a single group of people change over time by comparing distributions from the same group at different time periods, rather than comparing distributions from different groups. The overall F-ratio reveals if there are differences among time periods, and post hoc tests identify exactly where the differences occurred. In contrast, a one-way ANOVA compares distributions between two or more different groups to determine if there are statistical differences between them.
Applications of regression analysis - Measurement of validity of relationshipRithish Kumar
This document provides a summary of regression analysis in 9 steps: 1) Specify dependent and independent variables, 2) Check for linearity with scatter plots, 3) Transform variables if nonlinear, 4) Estimate the regression model, 5) Test the model fit with R2, 6) Perform a joint hypothesis test of the coefficients, 7) Test individual coefficients, 8) Check for violations of assumptions like autocorrelation and heteroscedasticity, 9) Interpret the intercept and slope coefficients. Regression analysis is used to determine relationships between variables and estimate how changes in independents impact dependents.
Correlation and regression analysis are statistical methods used to determine relationships between variables. Correlation determines if a linear relationship exists between variables but does not imply causation. While correlation between age and height in children suggests a causal relationship, correlation between mood and health is less clear on causality. Regression analysis helps understand how changes in independent variables impact a dependent variable when other independent variables are held fixed. Linear regression models the dependent variable as a linear combination of parameters, while non-linear regression uses iterative procedures when the model is non-linear in parameters.
This document provides an overview of entering and defining variables in SPSS. It discusses opening SPSS, what variables are, defining variables by entering their names and labels, and entering data. It also covers saving data files, importing data from Excel, and handling missing data. Various examples are provided to illustrate defining categorical, ordinal, and continuous variables as well as entering different data types including questionnaires, experiments, and longitudinal studies.
The document discusses multiple linear regression and partial correlation. It explains that multiple regression allows one to analyze the unique contribution of predictor variables to an outcome variable after accounting for the effects of other predictor variables. Partial correlation similarly examines the relationship between two variables while controlling for a third, but only considers two variables, whereas multiple regression examines the effects of multiple predictor variables simultaneously. Examples are given comparing the correlation between height and weight with and without controlling for other relevant variables like gender, age, exercise habits, etc.
The document defines and provides examples for calculating the coefficient of variation, which is a measure used to compare the dispersion of data sets. It gives the formula for coefficient of variation as the standard deviation divided by the mean, expressed as a percentage. Two examples are shown comparing the stability of prices between two cities and production between two manufacturing plants, with the data set having the lower coefficient of variation considered more consistent or stable.
- Simple linear regression is used to predict values of one variable (dependent variable) given known values of another variable (independent variable).
- A regression line is fitted through the data points to minimize the deviations between the observed and predicted dependent variable values. The equation of this line allows predicting dependent variable values for given independent variable values.
- The coefficient of determination (R2) indicates how much of the total variation in the dependent variable is explained by the regression line. The standard error of estimate provides a measure of how far the observed data points deviate from the regression line on average.
- Prediction intervals can be constructed around predicted dependent variable values to indicate the uncertainty in predictions for a given confidence level, based on the
Regression analysis is a statistical technique for investigating relationships between variables. Simple linear regression defines a relationship between two variables (X and Y) using a best-fit straight line. Multiple regression extends this to model relationships between a dependent variable Y and multiple independent variables (X1, X2, etc.). Regression coefficients are estimated to define the regression equation, and R-squared and the standard error can be used to assess the goodness of fit of the regression model to the data. Regression analysis has applications in pharmaceutical experimentation such as analyzing standard curves for drug analysis.
A repeated measures ANOVA is used to test whether a single group of people change over time by comparing distributions from the same group at different time periods, rather than comparing distributions from different groups. The overall F-ratio reveals if there are differences among time periods, and post hoc tests identify exactly where the differences occurred. In contrast, a one-way ANOVA compares distributions between two or more different groups to determine if there are statistical differences between them.
Applications of regression analysis - Measurement of validity of relationshipRithish Kumar
This document provides a summary of regression analysis in 9 steps: 1) Specify dependent and independent variables, 2) Check for linearity with scatter plots, 3) Transform variables if nonlinear, 4) Estimate the regression model, 5) Test the model fit with R2, 6) Perform a joint hypothesis test of the coefficients, 7) Test individual coefficients, 8) Check for violations of assumptions like autocorrelation and heteroscedasticity, 9) Interpret the intercept and slope coefficients. Regression analysis is used to determine relationships between variables and estimate how changes in independents impact dependents.
Correlation and regression analysis are statistical methods used to determine relationships between variables. Correlation determines if a linear relationship exists between variables but does not imply causation. While correlation between age and height in children suggests a causal relationship, correlation between mood and health is less clear on causality. Regression analysis helps understand how changes in independent variables impact a dependent variable when other independent variables are held fixed. Linear regression models the dependent variable as a linear combination of parameters, while non-linear regression uses iterative procedures when the model is non-linear in parameters.
The document discusses the normal distribution, which produces a symmetrical bell-shaped curve. It has two key parameters - the mean and standard deviation. According to the empirical rule, about 68% of values in a normal distribution fall within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations. The normal distribution is commonly used to model naturally occurring phenomena that tend to cluster around an average value, such as heights or test scores.
The document discusses simple linear regression. It defines key terms like regression equation, regression line, slope, intercept, residuals, and residual plot. It provides examples of using sample data to generate a regression equation and evaluating that regression model. Specifically, it shows generating a regression equation from bivariate data, checking assumptions visually through scatter plots and residual plots, and interpreting the slope as the marginal change in the response variable from a one unit change in the explanatory variable.
This document provides information on performing a one-way analysis of variance (ANOVA). It discusses the F-distribution, key terms used in ANOVA like factors and treatments, and how to calculate and interpret an ANOVA test statistic. An example demonstrates how to conduct a one-way ANOVA to determine if three golf clubs produce different average driving distances.
Logistic Regression in Case-Control StudySatish Gupta
This document provides an introduction to using logistic regression in R to analyze case-control studies. It explains how to download and install R, perform basic operations and calculations, handle data, load libraries, and conduct both conditional and unconditional logistic regression. Conditional logistic regression is recommended for matched case-control studies as it provides unbiased results. The document demonstrates how to perform logistic regression on a lung cancer dataset to analyze the association between disease status and genetic and environmental factors.
The document provides an overview of multiple linear regression (MLR). MLR allows predicting a dependent variable from multiple independent variables. It extends simple linear regression by incorporating additional predictors. Key points covered include: purposes of MLR for explanation and prediction; assumptions of the method; interpreting R-squared values; comparing unstandardized and standardized regression coefficients; and testing the statistical significance of predictors.
Here are the steps to solve this problem:
1) State the null and alternative hypotheses:
H0: σ1^2 = σ2^2 (the variances are equal)
Ha: σ1^2 ≠ σ2^2 (the variances are unequal)
2) Specify the significance level: α = 0.05
3) Calculate the F-statistic:
F = (0.0428/120) / (0.0395/80) = 1.0833
4) Find the p-value:
This is a left-tailed test since s1 < s2. From the F-distribution table with degrees of freedom v1 = 80-1
This document discusses two-way analysis of variance (ANOVA), which analyzes the relationship between two categorical independent variables and a continuous dependent variable. It provides an example using IQ scores categorized by sex and blood lead level. Two-way ANOVA tests for an interaction effect between the factors and also tests whether each factor individually has an effect. In this example, there is no significant interaction effect or individual effects of sex or blood lead level on IQ scores.
Regression analysis is a statistical technique used to estimate the relationships between variables. It allows one to predict the value of a dependent variable based on the value of one or more independent variables. The document discusses simple linear regression, where there is one independent variable, as well as multiple linear regression which involves two or more independent variables. Examples of linear relationships that can be modeled using regression analysis include price vs. quantity, sales vs. advertising, and crop yield vs. fertilizer usage. The key methods for performing regression analysis covered in the document are least squares regression and regressions based on deviations from the mean.
Introduces and explains the use of multiple linear regression, a multivariate correlational statistical technique. For more info, see the lecture page at http://goo.gl/CeBsv. See also the slides for the MLR II lecture http://www.slideshare.net/jtneill/multiple-linear-regression-ii
1. Multinomial logistic regression allows modeling of nominal outcome variables with more than two categories by calculating multiple logistic regression equations to compare each category's probability to a reference category.
2. The document provides an example of using multinomial logistic regression to model student program choice (academic, general, vocational) based on writing score and socioeconomic status.
3. The model results show that writing score significantly impacts the choice between academic and general/vocational programs, while socioeconomic status also influences general versus academic program choice.
This document discusses measures of variability used to describe how spread out data values are from the mean or average. It defines and provides formulas for calculating range, variance, standard deviation, sample variance, sample standard deviation, population variance, population standard deviation, estimated population variance, and estimated population standard deviation. These measures are important in statistical analysis to understand the distribution of data values.
Simple Linear Regression: Step-By-StepDan Wellisch
This presentation was made to our meetup group found here.: https://www.meetup.com/Chicago-Technology-For-Value-Based-Healthcare-Meetup/ on 9/26/2017. Our group is focused on technology applied to healthcare in order to create better healthcare.
This document discusses how to use SPSS to transform and compute new variables from existing variables. It shows how to compute a new variable called "PARTICPN" by summing the values of variables p1, p2, p3, p4, and p5. It then demonstrates computing another new variable called "MEANPART" which calculates the mean of variables p1 through p5 by summing them and dividing by 5.
Confidence interval & probability statements DrZahid Khan
This document discusses confidence intervals and probability. It defines confidence intervals as a range of values that provide more information than a point estimate by taking into account variability between samples. The document provides examples of how to calculate 95% confidence intervals for a proportion, mean, odds ratio, and relative risk using sample data and the appropriate formulas. It explains that confidence intervals convey the level of uncertainty associated with point estimates and allow estimation of how close a sample statistic is to the unknown population parameter.
This chapter summary covers simple linear regression models. Key topics include determining the simple linear regression equation, measures of variation such as total, explained, and unexplained sums of squares, assumptions of the regression model including normality, homoscedasticity and independence of errors. Residual analysis is discussed to examine linearity and assumptions. The coefficient of determination, standard error of estimate, and Durbin-Watson statistic are also introduced.
Reporting a one way repeated measures anovaKen Plummer
The document provides guidance on reporting the results of a one-way repeated measures ANOVA in APA style. It includes templates for reporting the main ANOVA results and any post-hoc pairwise comparisons between conditions. Key sections are highlighted to fill in values from an example SPSS output to generate a complete APA-style results section reporting a significant effect of time of season on pizza consumption.
The document describes how to report a partial correlation in APA format. It provides a template for reporting that there is a significant positive partial correlation of .82 between intense fanaticism for a professional sports team and proximity to the city the team resides when controlling for age, with a p-value of .000.
This document provides an overview of two-way analysis of variance (ANOVA). It explains that two-way ANOVA involves two categorical independent variables and one continuous dependent variable. The document outlines the objectives of two-way ANOVA, which are to analyze interactions between the two factors, and evaluate the effects of each factor. It then provides examples of how to set up and perform two-way ANOVA calculations and interpretations.
This document presents information about regression analysis. It defines regression as the dependence of one variable on another and lists the objectives as defining regression, describing its types (simple, multiple, linear), assumptions, models (deterministic, probabilistic), and the method of least squares. Examples are provided to illustrate simple regression of computer speed on processor speed. Formulas are given to calculate the regression coefficients and lines for predicting y from x and x from y.
This document discusses panel data analysis. Some key points:
- Panel data combines cross-sectional and time series data to observe multiple subjects over time in balanced and unbalanced panels.
- Panel data is useful for reducing noise, studying dynamic changes, and addressing issues with limited data availability.
- Choosing between fixed effects and random effects models depends on tests like the Hausman test and whether the unobserved effects are correlated with regressors.
- Panel data regression techniques like pooled mean group allow for heterogeneity across subjects while assuming some parameters are the same.
This document discusses correlation analysis in time and space using random fields and field meta-models. It provides examples of:
1) Parameterizing a dynamic process using random fields to model variations over time with only a few parameters.
2) Using field meta-models to perform sensitivity analysis on signals and identify which inputs most influence variation at different points in time.
3) Applying these methods to spatial variations, such as modeling geometric imperfections based on laser scans to generate random designs for robustness analysis.
The document discusses the normal distribution, which produces a symmetrical bell-shaped curve. It has two key parameters - the mean and standard deviation. According to the empirical rule, about 68% of values in a normal distribution fall within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations. The normal distribution is commonly used to model naturally occurring phenomena that tend to cluster around an average value, such as heights or test scores.
The document discusses simple linear regression. It defines key terms like regression equation, regression line, slope, intercept, residuals, and residual plot. It provides examples of using sample data to generate a regression equation and evaluating that regression model. Specifically, it shows generating a regression equation from bivariate data, checking assumptions visually through scatter plots and residual plots, and interpreting the slope as the marginal change in the response variable from a one unit change in the explanatory variable.
This document provides information on performing a one-way analysis of variance (ANOVA). It discusses the F-distribution, key terms used in ANOVA like factors and treatments, and how to calculate and interpret an ANOVA test statistic. An example demonstrates how to conduct a one-way ANOVA to determine if three golf clubs produce different average driving distances.
Logistic Regression in Case-Control StudySatish Gupta
This document provides an introduction to using logistic regression in R to analyze case-control studies. It explains how to download and install R, perform basic operations and calculations, handle data, load libraries, and conduct both conditional and unconditional logistic regression. Conditional logistic regression is recommended for matched case-control studies as it provides unbiased results. The document demonstrates how to perform logistic regression on a lung cancer dataset to analyze the association between disease status and genetic and environmental factors.
The document provides an overview of multiple linear regression (MLR). MLR allows predicting a dependent variable from multiple independent variables. It extends simple linear regression by incorporating additional predictors. Key points covered include: purposes of MLR for explanation and prediction; assumptions of the method; interpreting R-squared values; comparing unstandardized and standardized regression coefficients; and testing the statistical significance of predictors.
Here are the steps to solve this problem:
1) State the null and alternative hypotheses:
H0: σ1^2 = σ2^2 (the variances are equal)
Ha: σ1^2 ≠ σ2^2 (the variances are unequal)
2) Specify the significance level: α = 0.05
3) Calculate the F-statistic:
F = (0.0428/120) / (0.0395/80) = 1.0833
4) Find the p-value:
This is a left-tailed test since s1 < s2. From the F-distribution table with degrees of freedom v1 = 80-1
This document discusses two-way analysis of variance (ANOVA), which analyzes the relationship between two categorical independent variables and a continuous dependent variable. It provides an example using IQ scores categorized by sex and blood lead level. Two-way ANOVA tests for an interaction effect between the factors and also tests whether each factor individually has an effect. In this example, there is no significant interaction effect or individual effects of sex or blood lead level on IQ scores.
Regression analysis is a statistical technique used to estimate the relationships between variables. It allows one to predict the value of a dependent variable based on the value of one or more independent variables. The document discusses simple linear regression, where there is one independent variable, as well as multiple linear regression which involves two or more independent variables. Examples of linear relationships that can be modeled using regression analysis include price vs. quantity, sales vs. advertising, and crop yield vs. fertilizer usage. The key methods for performing regression analysis covered in the document are least squares regression and regressions based on deviations from the mean.
Introduces and explains the use of multiple linear regression, a multivariate correlational statistical technique. For more info, see the lecture page at http://goo.gl/CeBsv. See also the slides for the MLR II lecture http://www.slideshare.net/jtneill/multiple-linear-regression-ii
1. Multinomial logistic regression allows modeling of nominal outcome variables with more than two categories by calculating multiple logistic regression equations to compare each category's probability to a reference category.
2. The document provides an example of using multinomial logistic regression to model student program choice (academic, general, vocational) based on writing score and socioeconomic status.
3. The model results show that writing score significantly impacts the choice between academic and general/vocational programs, while socioeconomic status also influences general versus academic program choice.
This document discusses measures of variability used to describe how spread out data values are from the mean or average. It defines and provides formulas for calculating range, variance, standard deviation, sample variance, sample standard deviation, population variance, population standard deviation, estimated population variance, and estimated population standard deviation. These measures are important in statistical analysis to understand the distribution of data values.
Simple Linear Regression: Step-By-StepDan Wellisch
This presentation was made to our meetup group found here.: https://www.meetup.com/Chicago-Technology-For-Value-Based-Healthcare-Meetup/ on 9/26/2017. Our group is focused on technology applied to healthcare in order to create better healthcare.
This document discusses how to use SPSS to transform and compute new variables from existing variables. It shows how to compute a new variable called "PARTICPN" by summing the values of variables p1, p2, p3, p4, and p5. It then demonstrates computing another new variable called "MEANPART" which calculates the mean of variables p1 through p5 by summing them and dividing by 5.
Confidence interval & probability statements DrZahid Khan
This document discusses confidence intervals and probability. It defines confidence intervals as a range of values that provide more information than a point estimate by taking into account variability between samples. The document provides examples of how to calculate 95% confidence intervals for a proportion, mean, odds ratio, and relative risk using sample data and the appropriate formulas. It explains that confidence intervals convey the level of uncertainty associated with point estimates and allow estimation of how close a sample statistic is to the unknown population parameter.
This chapter summary covers simple linear regression models. Key topics include determining the simple linear regression equation, measures of variation such as total, explained, and unexplained sums of squares, assumptions of the regression model including normality, homoscedasticity and independence of errors. Residual analysis is discussed to examine linearity and assumptions. The coefficient of determination, standard error of estimate, and Durbin-Watson statistic are also introduced.
Reporting a one way repeated measures anovaKen Plummer
The document provides guidance on reporting the results of a one-way repeated measures ANOVA in APA style. It includes templates for reporting the main ANOVA results and any post-hoc pairwise comparisons between conditions. Key sections are highlighted to fill in values from an example SPSS output to generate a complete APA-style results section reporting a significant effect of time of season on pizza consumption.
The document describes how to report a partial correlation in APA format. It provides a template for reporting that there is a significant positive partial correlation of .82 between intense fanaticism for a professional sports team and proximity to the city the team resides when controlling for age, with a p-value of .000.
This document provides an overview of two-way analysis of variance (ANOVA). It explains that two-way ANOVA involves two categorical independent variables and one continuous dependent variable. The document outlines the objectives of two-way ANOVA, which are to analyze interactions between the two factors, and evaluate the effects of each factor. It then provides examples of how to set up and perform two-way ANOVA calculations and interpretations.
This document presents information about regression analysis. It defines regression as the dependence of one variable on another and lists the objectives as defining regression, describing its types (simple, multiple, linear), assumptions, models (deterministic, probabilistic), and the method of least squares. Examples are provided to illustrate simple regression of computer speed on processor speed. Formulas are given to calculate the regression coefficients and lines for predicting y from x and x from y.
This document discusses panel data analysis. Some key points:
- Panel data combines cross-sectional and time series data to observe multiple subjects over time in balanced and unbalanced panels.
- Panel data is useful for reducing noise, studying dynamic changes, and addressing issues with limited data availability.
- Choosing between fixed effects and random effects models depends on tests like the Hausman test and whether the unobserved effects are correlated with regressors.
- Panel data regression techniques like pooled mean group allow for heterogeneity across subjects while assuming some parameters are the same.
This document discusses correlation analysis in time and space using random fields and field meta-models. It provides examples of:
1) Parameterizing a dynamic process using random fields to model variations over time with only a few parameters.
2) Using field meta-models to perform sensitivity analysis on signals and identify which inputs most influence variation at different points in time.
3) Applying these methods to spatial variations, such as modeling geometric imperfections based on laser scans to generate random designs for robustness analysis.
Large Scale Automatic Forecasting for Millions of ForecastsAjay Ohri
This document discusses techniques for large-scale automatic forecasting of time-series data from transactional databases. It proposes accumulating time-stamped data into time-series and using diagnostic techniques to select appropriate forecasting models for each series. Candidate models would be fitted to recent data and the best model selected to forecast future values. This allows efficiently generating millions of forecasts from time-stamped data without human interaction.
1) The document describes a unit on repeated measures designs, including a review of standard repeated measures analyses using linear models and multi-level modeling, as well as an alternative approach.
2) Key features of repeated measures designs are discussed, such as having more than one observation per participant. Advantages and challenges like order effects are also reviewed.
3) Methods for analyzing repeated measures data using linear models by first transforming the data into wide format using differences and averages are described and compared to a multi-level modeling approach.
This document provides an introduction to panel data analysis and regression models for panel data. It defines panel data as longitudinal data collected on the same units (like individuals, firms, countries) over multiple time periods. Panel data allow researchers to study changes over time and estimate causal effects. The document outlines common panel data structures, reasons for using panel data analysis, and basic estimation techniques like fixed effects and random effects models to account for unobserved heterogeneity across units. It also discusses assumptions and limitations of different panel data models.
The document discusses MARS (Multivariate Adaptive Regression Splines), a new tool for regression analysis. MARS can automatically select variables, detect interactions between variables, and produce models that are protected against overfitting. It was developed by Jerome Friedman and produces smooth curves rather than step functions like CART. The document provides an introduction to MARS concepts and guidelines for using MARS in practice.
This document describes two "poor man's" methods for imputing missing values in large datasets with tight deadlines: 1) Univariate cumulative empirical distribution imputes values based on the distribution of each variable individually. 2) Multivariate cumulative empirical distribution selects the top 5 correlated variables and imputes based on patterns across those variables to capture some multivariate structure, though it has limitations. Both aim to be fast methods that can be easily implemented at scale for database marketing tasks with short deadlines.
CONTINUOUSLY IMPROVE THE PERFORMANCE OF PLANNING AND SCHEDULING MODELS WITH P...Alkis Vazacopoulos
Continuously improving the accuracy and precision of planning and scheduling models is not new; unfortunately it is not institutionalized in practice. The intent of this paper is to highlight a relatively simple approach to historize or memorize past and present actual planning and scheduling data collected into what we call the past rolling horizon (PRH). The PRH is identical to the future rolling horizon (FRH) used in hierarchical production planning and model predictive control to manage omnipresent uncertainty in the model and data. Instead of optimizing future decisions such as throughputs, operating-modes and conditions we now optimize or estimate key model parameters. Although bias-updating using a single time-sample of data is common practice in advanced process control and optimization to incorporate “parameter” feedback, this is only realizable for real-time applications with comprehensive measurement systems. Proposed in this paper is the use of multiple synchronous or asynchronous time-samples in the past in conjunction with simultaneous reconciliation and regression to update a subset of the model parameters on a past rolling horizon basis to improve the performance of planning and scheduling models.
The document discusses various measures of central tendency and variation used in statistics. It defines and provides examples of calculating the mean, median, mode, range, average deviation, variance and standard deviation. The mean is the sum of all values divided by the number of values and is useful when data is symmetric. The median is the middle value when values are arranged in order. The mode is the most frequent value. Range is the difference between highest and lowest values. Variance and standard deviation quantify how spread out values are from the mean.
Panel data combines cross-sectional and time-series data by observing the same cross-sectional units (e.g. firms, countries) over time. This allows for more data variation and better study of dynamic changes. The document discusses fixed and random effects models for panel data, the Hausman test for choosing between them, and evaluating models for autocorrelation and heteroskedasticity.
This document discusses an automatic method for selecting non-linear econometric models. It begins by outlining a general strategy for testing for non-linearity, specifying a general non-linear model, and then simplifying it using encompassing tests. It then identifies five specific problems that arise when selecting non-linear models: testing for non-linearity, collinearity between non-linear transformations, non-normal errors, excess variables when approximating non-linearity, and retaining irrelevant variables. Solutions to address each of these problems are proposed. The document concludes by applying this non-linear automatic selection method to an empirical example on returns to education.
This document discusses various time series forecasting methods. It begins by defining time series forecasting as making projections about future performance based on historical and current data. The goals of time series analysis are identified as identifying patterns in observed data and making forecasts. Smoothing techniques are then discussed as a way to remove random noise from time series data to better identify trends and seasonality for forecasting. Several smoothing methods are covered in detail, including simple moving averages, weighted moving averages, simple exponential smoothing, Holt's trend exponential smoothing, and Holt-Winters methods for seasonal data. The components of time series data and advantages/disadvantages of different methods are also summarized.
Exploring the Impact of Magnitude- and Direction-based Loss Function on the P...Dr. Amarjeet Singh
Researches on predicting prices (as time series) from deep learning models usually use a magnitude-based error measurement (such as ). However, in trading, the error in the predicted direction could affect trading results much more than the magnitude error. Few works consider the impact of ill-predicted trading direction as part of the error measurement.
In this work, we first find parameter sets of LSTM and TCN models with low magnitude-based error measurement, and then calculate the profitability using program trading. Relationships between profitability and error measurements are analyzed.
We also propose a new loss function considering both directional and magnitude error for previous models for re-evaluation. Three commodities are tested: gold, soybean, and crude oil (from GLOBEX). Our findings are: with given parameter sets, if merchandise (gold and soybean) is of low averaged magnitude error, then its profitability is more stable. The proposed loss function can further improve profitability. If it is of larger magnitude error (crude oil), then its profitability is unstable, and the proposed loss function cannot improve nor stabilize the profitability.
Furthermore, the relationship between profitability and error measurement for models of LSTM and TCN with or without customized loss function is not, as commonly believed, highly positively correlated (i.e., the more precise the predicted value, the more trading profit) since the correlation coefficients are rarely higher than 0.5 in all our experiments. However, the customized loss functions perform better in TCN than in LSTM.
The document discusses the importance of data quality, proper use of statistics, and correct interpretation of results in statistical analysis. It provides a 3 step approach: 1) Ensuring high quality data by addressing issues like missing values and outliers. 2) Appropriate use of statistical techniques after defining the variables and objectives clearly. Considering issues like correlation, normality, and model assumptions. 3) Careful interpretation of results while preserving the multidimensional nature of phenomena and considering partial correlations between variables. It emphasizes the need for collaboration between data miners, statisticians and domain experts for successful knowledge discovery.
This document discusses the use of autoregressive integrated moving average (ARIMA) models in statistical analysis beyond just time series data. It provides examples of using ARIMA models with non-temporal data, where the independent variable is something other than time, such as temperature or longitude. Key points include:
1) ARIMA models only require evenly spaced intervals for the independent variable and do not necessarily need time as the variable. Examples of non-temporal ARIMA models are given for white dwarf star populations and the distribution of attorneys.
2) Temperature can act as a "time proxy" for white dwarf stars since temperature and time are monotonically related as the stars cool.
3) ARIM
This document proposes a new method for measuring long-term memory in time series using a generalized correlation coefficient called SEV (sure explained variable). SEV can measure both linear and nonlinear correlations, addressing limitations of Pearson's correlation coefficient. The document defines generalized autocorrelation using SEV and proposes that if generalized autocorrelation decays slowly with lag, the time series has long-term memory. Empirical analysis on two artificial and one real financial time series demonstrates that generalized autocorrelation using SEV can better identify long-term memory than classical autocorrelation.
This document discusses uncertainty in dispersion models used for air quality predictions. It notes that uncertainties should be routinely tracked for policy decisions, as in climate models. However, uncertainties are not commonly assessed for atmospheric pollution dispersion models. It recommends propagating uncertainties through parametric sampling and sensitivity analysis to determine influential parameters. Global sensitivity methods can evaluate model complexity and parameter importance. Closing the modeling loop through refinement informed by sensitivity analysis could help reduce prediction uncertainties.
This document discusses various methods for analyzing quantitative data, including coding data, creating a codebook, entering data into a grid format for analysis, checking data for accuracy, and using computers and statistical software to analyze data. It covers descriptive statistics for one and two variables, such as frequency distributions, measures of central tendency and variation, scatterplots, cross-tabulations, and measures of association between two variables.
Similar to A gentle introduction to growth curves using SPSS (20)
Using Cloud-based statistics applications to enhance statistics educationsmackinnon
Slides to accompany my 2019 presentation at the CPA. Discusses my approach to teaching statistics using online applications and active learning workshops.
These are slides I use when teaching my second year undergraduate statistics course. They are designed more for conceptual understanding, and do not have syntax for programs like SPSS or R. So it is a more conceptual and mathematical review, rather than a "how-to" computer guide.
Generalized Linear Models for Between-Subjects Designssmackinnon
This document provides an overview of generalized linear models (GLiM) for analyzing between-subjects designs. It discusses key assumptions of between-subjects ANOVA such as normality and homogeneity of variance. It then explains how GLiM in SPSS can be used as an alternative approach that describes the distribution of the outcome variable, specifies a link function, and uses maximum likelihood estimation rather than ordinary least squares. The document walks through an example comparing models with different distributions and link functions, and demonstrates interpreting output including parameter estimates, tests of effects, and estimated marginal means.
Increasing Power without Increasing Sample Sizesmackinnon
This is an invited presentation I gave at a symposium "Making your research more reproducible" at the 27th Annual Conference of the Association for Psychological Science, New York. It talks about increasing statistical power without increasing sample size.
An introduction to mediation analysis using SPSS software (specifically, Andrew Hayes' PROCESS macro). This was a workshop I gave at the Crossroads 2015 conference at Dalhousie University, March 27, 2015.
These are some slides I use in my Multivariate Statistics course to teach psychology graduate student the basics of structural equation modeling using the lavaan package in R. Topics are at an introductory level, for someone without prior experience with the topic.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
2. When to use a growth curve
Growth curves measure patterns of change over time
Specifically, mean-level changes over time
Patterns can be linear, quadratic, cubic, etc.
Time 1 Time 2 Time 3
John 10 7 5
Mary 8 5 4
Zoe 7 9 9
Sarah 5 2 1
Bill 2 4 3
MEAN 6.4 5.4 4.4
Mean-Level Change**
3. Limitations of RM-ANOVA
Requires a balanced design (i.e., no missing data)
Requires equal spacing between time points
Requires independence of observations (not often
possible in longitudinal data)
Requires homogeneity of variance
4. Growth Curves overcome these limitations
Accounts for missing data using a full information
maximum likelihood (FIML) approach
Does not require equal spacing between time points
(can specify unequal time points, e.g., 1, 2, 5, 7, 10)
Does not require independence of observations (can
model different types of correlated error structures)
Is robust to violations of homogeneity of variance
assumptions required by RM-ANOVA
5. So… what are growth curves?
Growth curves are a type of mixed (or multilevel)
model
Simply put, multilevel models are a way of dealing
with clustered data
For example…
7. Growth Curves are Multilevel Models
All multilevel models (MLMs) partition variance into
their appropriate levels
E.g., students nested within schools
Multilevel models also use maximum likelihood
estimation, which is better when there’s missing data
and are more flexible when dealing with real data
Growth curves are a specific type of MLM where:
The lowest level of observation is repeated measures
The predictor variable is TIME
8. Application to a clinical context
The RCT is a
common
design
Growth curves
can be used
instead of
ANOVA
The time*interv
interaction is
most important
Leiter et al., 2012
9. How do you do this in SPSS?
First, you need to convert your data from “WIDE”
format to “LONG” format
Wide Format
10. Long Format
(Use the syntax provided in the handout to get this):
Long Format
11. Coding the Time Variable is Important
The choices you make for your time variable will
influence your analyses!
If relationships are linear, need to be equidistant
1, 2, 3 OR -1, 0, 1, etc.
If you are expecting a quadratic relationship, need to also
calculate time-squared
1, 4, 9 OR 1, 0, 1
Unequal time points
1 month, 3 month, 12 month
1, 3, 12
12. Decision 1: ML vs REML
Maximum Likelihood Estimation (ML)
vs
Restricted Maximum Likelihood Estimation (REML)
REML is generally preferred because it provides
more unbiased estimates
ML would be preferred if you need to compare
nested models, as REML is not adequate for this
13. Decision 2: Fixed vs Random
Random vs. Fixed Slopes & Intercepts
Random (varying): Allow to vary across people
Fixed (constant): Force them to be equal across people
Random vs. Fixed has no single, agreed-upon
definition (Gelman, 2005); I’m presenting a practical
conceptualization
Fixed (constant) intercepts and slopes are more
parsimonious and less computationally intensive, but
may not be as good a fit to the data. Select the most
parsimonious model that fits the data best.
14. Random (varying) Intercepts
Random (varying) Slopes
http://www.spss.ch/upload/1126184451_Linear%20Mixed%20Effects%20Modeling%20in%20SPSS.pdf
15. Random (varying) Intercepts
Fixed (constant) Slopes
http://www.spss.ch/upload/1126184451_Linear%20Mixed%20Effects%20Modeling%20in%20SPSS.pdf
16. Fixed (constant) Intercepts
Random (varying) Slopes
http://www.spss.ch/upload/1126184451_Linear%20Mixed%20Effects%20Modeling%20in%20SPSS.pdf
17. Decision 3: Linear, Quadratic, or Cubic?
If slopes are allowed to be random (varying), then
you need at least:
3 time points for linear
4 time points for quadratic
Add time*time as a predictor
5 time points for cubic
Add time*time and time*time*time as predictors
One less time point needed if using fixed slopes
Today, I’m focusing on LINEAR relationships
18. Decision 4: Covariance Structure
Is there a predictable pattern to the errors?
If you are unsure, specify an “unstructured” matrix
Less parsimony because it lets things freely vary
AR(1) correlated error structure is also fairly common
Autoregressive correlated errors, getting smaller as
timepoints get more distant
You can test multiple models with different plausible
structures, and choose the one that fits the data best
19. Annotated Syntax
MIXED ASItotal WITH time interv
/METHOD = REML
/FIXED = time interv time*interv | SSTYPE(3)
/RANDOM = INTERCEPT time interv |
SUBJECT(id) COVTYPE(UN)
/PRINT = SOLUTION TESTCOV HISTORY.
*Mixed model, dependant variable
predicted by time and intervention
*Restricted Maximum Likelihood
Estimation (usually better than ML)
*Put all predictors after FIXED.
Indicate interactions by Var1*Var2
*The intercept, and the slopes for
time and interv are random. The
slope for the interaction is fixed
because I omitted it from this part.
*”UN” Specifies an unstructured
covariance matrix (other types are
possible, but require thought)
20. Annotated Output: Model Comparison
Use the BIC values to compare nested models (e.g.,
random slopes vs fixed slopes)
Lower absolute values are better (∆BIC > 4)
21. Annotated Output: Covariance Parameters
UN(1,1) = Variance of the Intercept. Significant, so
random intercepts are important to include.
UN(2,2) = Variance of the slope for time. Non-significant,
which suggests that a more parsimonious model with
fixed slopes for time would fit the data better.
22. Annotated Output
Interpret like ANOVA; parameters adjusted for clustering
Time -> Main effect for time (linear, in this case)
Interv -> Main effect for intervention
Time * interv -> 2-way Interaction
Graphing the interaction is usually important to understand
Dummy coding (0, 1) intervention helps a LOT
23. Graphing the interaction
Can graph the interaction
using tools meant for
moderation in linear
regression with this kind of
model
The parameters in the output
are interpreted the same way,
they’re just adjusted so that
you’re accounting for the
clustering due to repeated
measurement and missing
data
http://www.jeremydawson.co.
uk/slopes.htm
24. A few closing points
Other software can implement this (e.g., SAS,
Mplus, HLM)
Non-normal data may be better modeled with
different distributional assumptions (e.g., poisson)
Modeling of covariance structures may be important,
but can be challenging to figure out
Some programs (e.g., Mplus) may use a latent
variable approach
25. Questions? Comments?
Thank you!
P.S. In the handout I provided, there is some syntax
and instructions which may be helpful!
Email me if you want an electronic copy of the
presentation:
mackinnon.sean@dal.ca
26. Appendix: Syntax
*Convert data from LONG to WIDE format
SORT CASES BY id time.
CASESTOVARS
/ID=id
/INDEX=time
/GROUPBY=VARIABLE.
*Convert data from WIDE to LONG format
VARSTOCASES
/MAKE ASItotal FROM ASItotal.0 ASItotal.1 ASItotal.2
/INDEX=time(3)
/KEEP=id interv
/NULL=KEEP.
27. Appendix: Syntax
*Linear Growth Curve with Intervention Group as
Moderator (Random Intercept, Random Slopes)
MIXED ASItotal WITH time interv
/METHOD = REML
/FIXED = time interv time*interv | SSTYPE(3)
/RANDOM = INTERCEPT time interv time*interv |
SUBJECT(id) COVTYPE(UN)
/PRINT = SOLUTION TESTCOV HISTORY.
28. Appendix: Syntax
*Linear Growth Curve with Intervention Group as
Moderator (Random Intercept, Fixed Slopes)
MIXED ASItotal WITH time interv
/METHOD = REML
/FIXED = time interv time*interv | SSTYPE(3)
/RANDOM = INTERCEPT | SUBJECT(id)
COVTYPE(UN)
/PRINT = SOLUTION TESTCOV HISTORY.
29. Appendix: Syntax
*Linear Growth Curve with Intervention Group as
Moderator (Fixed Intercept, Random Slopes)
MIXED ASItotal WITH time interv
/METHOD = REML
/FIXED = time interv time*interv | SSTYPE(3)
/RANDOM = time interv time*interv | SUBJECT(id)
COVTYPE(UN)
/PRINT = SOLUTION TESTCOV HISTORY.
30. Appendix: Syntax
*Quadratic Growth Curve with Intervention Group as
Moderator (Random Intercept, Fixed Slopes)
COMPUTE quadtime = time*time.
EXECUTE.
MIXED ASItotal WITH time interv
/METHOD = REML
/FIXED = time quadtime interv time*interv quadtime*interv |
SSTYPE(3)
/RANDOM = INTERCEPT | SUBJECT(id) COVTYPE(UN)
/PRINT = SOLUTION TESTCOV HISTORY.