The document describes applying multiple linear regression and logistic regression analyses to predict life expectancy using various predictor variables. For multiple linear regression, the model explained 68.9% of variance in life expectancy. Only pollution (pm25) and universal health coverage (uhc) were statistically significant. For logistic regression, the model correctly predicted life expectancy binary outcome for 79.7% of cases, with only uhc and pm25 as significant predictors. Model diagnostics and evaluations indicated both models satisfied assumptions and were good fits for the data.
4. Performed statistical analysis on a chosen data table and understood relationship amongst different data fields using IBM SPSS software.
Methodologies: Multi linear regression, Logistic linear regression
IBM SPSS
Performed statistical analysis on different datasets using multiple and logistic regression using R Studio and SPSS on Gender Inequality ratio data and Employment to Population data respectively.
Multiple Regression and Logistic Regression performed on data to evaluate the relation between birth rate and abortion rate for male and female using SPSS
Contains
a.Statistics-1
b. SAS-1
c. Statistics-2
d. Market Research
e. MS Excel
f. SAS-2
g. Data Audit & Data Sanitization
h. SQL
i. Model Building
j. HR
4. Performed statistical analysis on a chosen data table and understood relationship amongst different data fields using IBM SPSS software.
Methodologies: Multi linear regression, Logistic linear regression
IBM SPSS
Performed statistical analysis on different datasets using multiple and logistic regression using R Studio and SPSS on Gender Inequality ratio data and Employment to Population data respectively.
Multiple Regression and Logistic Regression performed on data to evaluate the relation between birth rate and abortion rate for male and female using SPSS
Contains
a.Statistics-1
b. SAS-1
c. Statistics-2
d. Market Research
e. MS Excel
f. SAS-2
g. Data Audit & Data Sanitization
h. SQL
i. Model Building
j. HR
The project aims at predicting healthcare cost against actual data as provided by US survey of hospital, The dataset on which analysis has been done is a sample dataset used for educational purposes only.
Multiple Regression and Logistic RegressionKaushik Rajan
1) Multiple Regression to predict Life Expectancy using independent variables Lifeexpectancymale, Lifeexpectancyfemale, Adultswhosmoke, Bingedrinkingadults, Healthyeatingadults and Physicallyactiveadults.
2) Binomial Logistic Regression to predict the Gender (0 - Male, 1 - Female) with the help of independent variables such as LifeExpectancy, Smokingadults, DrinkingAdults, Physicallyactiveadults and Healthyeatingadults.
Tools used:
> RStudio for Data pre-processing and exploratory data analysis
> SPSS for building the models
> LATEX for documentation
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Qualitative Analysis of a Discrete SIR Epidemic Modelijceronline
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
Brief notes on heteroscedasticity, very helpful for those who are bigners to econometrics. i thought this course to the students of BS economics, these notes include all the necessary proofs.
The project aims at predicting healthcare cost against actual data as provided by US survey of hospital, The dataset on which analysis has been done is a sample dataset used for educational purposes only.
Multiple Regression and Logistic RegressionKaushik Rajan
1) Multiple Regression to predict Life Expectancy using independent variables Lifeexpectancymale, Lifeexpectancyfemale, Adultswhosmoke, Bingedrinkingadults, Healthyeatingadults and Physicallyactiveadults.
2) Binomial Logistic Regression to predict the Gender (0 - Male, 1 - Female) with the help of independent variables such as LifeExpectancy, Smokingadults, DrinkingAdults, Physicallyactiveadults and Healthyeatingadults.
Tools used:
> RStudio for Data pre-processing and exploratory data analysis
> SPSS for building the models
> LATEX for documentation
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Qualitative Analysis of a Discrete SIR Epidemic Modelijceronline
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
Brief notes on heteroscedasticity, very helpful for those who are bigners to econometrics. i thought this course to the students of BS economics, these notes include all the necessary proofs.
Customer Churn is a burning problem for Telecom companies. In this project, we simulate one such case of customer churn where we work on a data of postpaid customers with a contract. The data has information about the customer usage behavior, contract details and the payment details. The data also indicates which were the customers who canceled their service. Based on this past data, we need to build a model which can predict whether a customer will cancel their service in the future or not.
30REGRESSION Regression is a statistical tool that a.docxtarifarmarie
30
REGRESSION
Regression is a statistical tool that allows you to predict the value of one continuous variable
from one or more other variables. When you perform a regression analysis, you create a
regression equation that predicts the values of your DV using the values of your IVs. Each IV is
associated with specific coefficients in the equation that summarizes the relationship between
that IV and the DV. Once we estimate a set of coefficients in a regression equation, we can use
hypothesis tests and confidence intervals to make inferences about the corresponding parameters
in the population. You can also use the regression equation to predict the value of the DV given a
specified set of values for your IVs.
Simple Linear Regression
Simple linear regression is used to predict the value of a single continuous DV (which we will
call Y) from a single continuous IV (which we will call X). Regression assumes that the
relationship between IV and the DV can be represented by the equation
Yi = β0 + β 1Xi + εi,
where Yi is the value of the DV for case i, Xi is the value of the IV for case i, β0 and β1 are
constants, and εi is the error in prediction for case i. When you perform a regression, what you
are basically doing is determining estimates of β0 and β1 that let you best predict values of Y
from values of X. You may remember from geometry that the above equation is equivalent to a
straight line. This is no accident, since the purpose of simple linear regression is to define the
line that represents the relationship between our two variables. β0 is the intercept of the line,
indicating the expected value of Y when X = 0. β1 is the slope of the line, indicating how much
we expect Y will change when we increase X by a single unit.
The regression equation above is written in terms of population parameters. That indicates that
our goal is to determine the relationship between the two variables in the population as a whole.
We typically do this by taking a sample and then performing calculations to obtain the estimated
regression equation
Yi = b0 + b1Xi .
Once you estimate the values of b0 and b1, you can substitute in those values and use the
regression equation to predict the expected values of the DV for specific values of the IV.
Predicting the values of Y from the values of X is referred to as regressing Y on X. When
analyzing data from a study you will typically want to regress the values of the DV on the values
of the IV. This makes sense since you want to use the IV to explain variability in the DV. We
typically calculate b0 and b1 using least squares estimation. This chooses estimates that minimize
the sum of squared errors between the values of the estimated regression line and the actual
observed values.
In addition to using the estimated regression equation for prediction, you can also perform
hypothesis tests regarding the individual regression parameters. The slope of the reg.
Data Science - Part IV - Regression Analysis & ANOVADerek Kane
This lecture provides an overview of linear regression analysis, interaction terms, ANOVA, optimization, log-level, and log-log transformations. The first practical example centers around the Boston housing market where the second example dives into business applications of regression analysis in a supermarket retailer.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
2. Multiple Linear Regression Analysis
Objective: The objective of this analysis is to apply multiple linear regression analysis on our life expectancy
dataset to predict the life expectancy based on different predictors such as population, pollution, alcohol
consumption etc. and run diagnostic tests to check if all these predictors are significant in the prediction of life
expectancy. We also need to check if our model satisfies all the assumptions of a multiple linear regression
model like linearity, homoscedasticity, etc.
Background on Data:
For Multiple linear regression analysis, multiple datasets have been sourced from ‘who.int’ website’s, public
health and environment data,1 and then pre-processed and merged in R, into a single file.
- Data has been merged by country.
- ‘Life_exp’ is the dependent variable we are trying to predict using other independent variable as
defined in the data dictionary below.
- Data has 4 independent and 1 dependent variable.
- After merging and cleaning the data, we are left with a sample size of 182 unique observations.
Data dictionary:
Variabl
e
Meas
ure
Type Description URL
life_exp scale Dependen
t Variable
Life expectancy at the age(60) years.
This will be the predicted variable
http://apps.who.int/gho/data/nod
e.main.SDG2016LEX?lang=en
alc_con
sumptio
n
scale Independe
nt
Variable
Alcohol consumption per capita http://apps.who.int/gho/data/nod
e.main.SDG35?lang=en
pm25 scale Independe
nt
Variable
Concentrations of fine particulate
matter(PM2.5) in the country
http://apps.who.int/gho/data/nod
e.main.SDG116?lang=en
populati
on
scale Independe
nt
Variable
population in thousands of the
country
http://apps.who.int/gho/data/nod
e.main.SDGPOP?lang=en
uhc scale Independe
nt
Variable
Universal health coverage index of
the country
http://apps.who.int/gho/data/nod
e.main.SDG38?lang=en
Below is sample of the data:
Country alc_consumption life_exp pm25 population uhc
Afghanistan 0.2 16.3 59.9 34 656 34
Albania 7.5 20.8 18.2 2926 58
Algeria 0.9 21.9 34.5 40 606 76
Angola 6.4 17.3 28.4 28 813 38
Antigua and Barbuda 7 19.7 18 101 73
Argentina 9.8 21.8 11.7 43 847 76
Armenia 5.5 19.6 32.9 2925 66
Australia 10.6 25.6 7.3 24 126 86
1 http://apps.who.int/gho/data/node.main.1?lang=en
3. Assumptions of Multiple Linear Regression Analysis:
1. Linearity: While doing a multiple linear regression analysis, we need to check whether our dependent
variable has a linear relationship with our independent variables. We can do this by looking at
scatterplots of DV and IVs. Graph 1.1 shows our outcome variable(life_exp) has a strong linear
relationship with the our predictors (uhc , pm25 & alc_consumption). We can also validate this by
looking at the residuals vs predicted value graph(Graph 1.2). The graph shows no evidence of a systemic
relationship. Hence, we can assume our model is a linear model.
Graph 1.1
Graph 1.2
2. Homoscedasticity: Homoscedasticity states that errors should have constant variances. This can be
validated by plotting the residuals vs fitted value graphs. If the residuals in the graph look like noise i.e.
have no obvious pattern, then we can say we have homoscedasticity.
In graph 1.2, we can see there is no obvious patter between the residuals and the fitted values and
hence conclude that our model has homoscedasticity.
4. 3. Autocorrelation between errors: we can check for autocorrelation or independence of error terms by
checking the Durbin-Watson statistic. A Durbin-Watson value close to 2 indicates independence of
errors.
In Table 3.1, we can see the Durbin-Watson for our model is 1.954, hence we can assume there is no
autocorrelation between errors in our model.
Table 3.1
4. Normally distributed error: One of the assumptions of a linear model is that the residuals should be
normally distributed with a mean of 0. To check this, we can plot a histogram or check the probability
plot of the residuals. Histogram should be normally distributed and the probability plot should have the
points on the straight 45-degree line.
We can check this for our model in the graphs 4.1 and 4.2 below, thus proving our assumptions to be
correct.
Graph 4.1
5. Graph 4.2
5. Multicollinearity(Absence): Multicollinearity is when two or more independent variables have a strong
relationship/collinearity with each other. In Linear regression models, there should not be any
multicollinearity between the independent variables.
A Pearson correlation matrix can give an estimate of multicollinearity between variables, if any variable
has a value of |0.8| or above. From table 5.1 we can see, none of the independent variables have
r>|0.8| with any other independent variable, therefore, we can assume that they are not correlated.
Another test for checking multicollinearity is the VIF test. If VIF for a predictor is greater than 10, then
it will be collinear with other predictors. From table 5.2, we can see, none of the predictor variables
have a VIF>10, so we can conclude, there is no multicollinearity in our model.
Table 5.1
6. Table 5.2
6. Influential data points: On its own, a data point may be an outlier but not necessarily have an effect
on the regression line, similarly a data point may have leverage, but on its own, does not influence the
regression line. However, if a data point has both leverage and is an outlier, it becomes an influential
data point. We check for an influential data point by measuring its Cook’s distance. If Cook’s distance
is 1 or greater, data point in considered to be an influential data point.
For our model, we can check the residual statistics Cook’s distance in the table 6.2 below and see its
maximum value is .205. Hence, we do not have any influential points in our data that need to be
removed. Also, I have individually checked the cook’s distances of all the independent variables, and
have not found any point to be of any significant influence (table 6.1)
Table 6.1
Table 6.2
7. Model Evaluation and Selection
To evaluate our regression model, we will have to look at the summary output of the model.
Table 7
Here, we look at the R square value of the model, which is 0.689. R square of the model explains the variance in
the predicted variable that is explained by the model. Our model explains 68.9% of variance in the predicted
values. Adjusted R square is a modified version of R square which penalizes the model for introducing more
independent variables. Both R square and adjusted R square values can only lie between 0 and 1 and the closer
the value is to 1, the better our model is at predicting actual values.
ANOVA
F statistic tells us whether our model is better at predicting values than just the mean. The significance value
tells us if our null hypotheses of all the coefficients being equal to zero is true. As the significance value is <0.001,
we can reject our null hypotheses that all coefficients are zero
Table 8
Evaluating independent variables:
Regression coefficients (β) determines the factor by which the dependent variable changes based on 1 unit
change in one independent variable, given all the other independent variables are kept constant. Based on table
9, we can determine the unstandardized coefficients(β) of all the independent variables and the y-intercept from
the column ‘Unstandardized B’
Also, we can see from the table that only “pm25” and “uhc” are statistically significant at a 95% C.I(Sig<0.05).
Based on this, we can remove “alc_consumption” and “population” from our regression equation.
Our regression equation for predicting life expectancy(Y) would thus be:
Y = 11.176 - 0.024*pm25 + 0.147*uhc
8. Table 9
Summary: Multiple regression model was applied to our dataset of 182 records to predict the ‘life
expectancy’(life_exp) based on the independent variables – ‘alcohol consumption’(alc_consumption),
‘pollution’(pm25), ‘population’, ‘universal health coverage’(uhc).
A preliminary analysis to check the assumptions of multiple linear regression, such as multicollinearity,
homoscedasticity, etc., was conducted and found to be satisfying all the assumptions.
Our model was then applied and produced an adjusted R squared value of 0.682 and based on a confidence
interval of 95%, only 2 of the variables, uhc & pm25, were found to be significant, with coefficients of 0.147 and
-0.024 respectively.
9. Logistic Regression Analysis
Objective:
The objective of the analysis is to apply logistic binary regression analysis to predict the binary outcome variable
‘life_exp_binary’(full information on data below) and check if our model satisfies all the assumptions of the
model and perform diagnostics if it doesn’t.
Based on the results obtained, we will further evaluate our model, by using evaluation methods such as Hosmer-
Lemshow test, classification matrix, etc.
Background on data:
For performing Logistic regression analysis, the same dataset sourced from ‘who.int’ website, which was earlier
used to conduct multiple linear regression, is being used. The predictor variable ‘life_exp’ has been logically
converted in R, to a binary variable ‘life_exp_binary’ based on the median of ‘life_exp’ as below:
- Life_exp_binary (>median(life_exp)) = 1 (indicates a high life expectancy)
- Life_exp_binary (<=median(life_exp)) = 0 (indicates a low life expectancy)
All the other independent variables are again being used to predict the outcome variable ‘life_exp_binary’.
Assumptions of Logistic Regression Analysis:
1. Sample size: Logistic regression model assumes that our sample has at least 60 cases and 20 cases per
predictor variable. Our model satisfies this assumption, as we have 4 predictor variables, so our model
should have a minimum sample size of 80, which is met as our sample size is 182.
2. Multicollinearity: As we are working with the same dataset which was used in Multiple linear
regression analysis, we can say there is no multicollinearity in the data, based on our earlier analysis.
3. Outlier: We can again say that there are no outliers in the data based on the analysis conducted in
multiple regression tests using the same data.
Model Evaluation:
To evaluate our logistic regression model, we will look at the following factors:
1. Block 0: Block 0 is our null model or a baseline model against which our final model may be compared.
Null model contains no independent variables. In the table 10 below, we can see our model has an
accuracy of 52.2, when no predictor variables have been used.
10. Table 10
2. Omnibus test(Block 1): Block 1 is our model having all the independent variables. Omnibus test here
explains if our full model has improved over the null model. Here p<0.001(sig) indicates that our full
model is an improvement over the null model, so adding predictors enhances the model.
Table 11
3. Model Summary: From model summary, we can estimate a variance of 50.5% to 67.4% in the predicted
variable using this model by using the Cox & Snell and Nagelkerke R square statistics which are
analogous to the R square statistics used in Linear regression.
11. Table 12
4. Hosmer-Lemeshow test: is an indicator of the goodness of fit of the model. For Hosmer-Lemesor test,
non-significance is an indicator of good fit. So, for our model, p(Sig.) should be greater than 0.05 , which
can be seen in the table below. With a sig. value of 0.3, our model proves to be a good fit.
Table 13
5. Classification Table: From the classification table, we can check the accuracy, specificity, sensitivity etc.
of the model. From table 13 below, we can see that, our model- block 1 has an improved
accuracy(correctly predicted values) of 79.7% which is a considerable improvement over the null model
– block 0 value of 52.2%
Table 13
6. Interpretation of variables in the model: Table 14 explains the influence and importance of each
variable in the logistic regression model. Here, we can use the Wald statistic, which is similar to the t-
statistic used in the regression model, to check the significance of the independent variables. If the Sig.
is <0.05 , we can say our predictor variables are significant at 95% confidence intervals.
From table 14, we can see only 2 predictors, uhc and pm25, are the only significant variables so we
can drop the other two variables, population and alc_consumption, from our model.
The column EXP(B) in the table gives us the odds ratio of the predictor variable. So if the Odds-ratio >
1, then the odds of that outcome occurring increase if we increase the value of the predictor variable.
Alternatively, if Odd-ratio <1, then the odds of the outcome occurring decreases if we increase the
value of the predictor. For example, we can say the odds of having a high life expectancy increases
1.207 times with an increase in the ‘universal health coverage’(uhc) of the country.
12. Table 14
Based on the above, we can form the below equation for our model.
Y = e-10.86+0.188*uhc-0.046*pm25
/1+ e-10.86+0.188*uhc-0.046*pm25
Summary: By applying binary logistic regression model on our dataset to predict the life_exp_binary variable,
we were able to correctly predict 79.7% of the values. We were also able to conclude with 95% confidence
that only uhc and pm25 variables were significant at predicting our outcome variable.
References:
1. Grande, D. T., n.d. Interpreting output for multiple linear regression in SPSS. [Online]
Available at: https://www.youtube.com/watch?v=WQeAsZxsXdQ
2. Anon., n.d. [Online]
Available at:
https://www.sheffield.ac.uk/polopoly_fs/1.233565!/file/logistic_regression_using_SPSS_level1_MAS
H.pdf