SlideShare a Scribd company logo
1 of 16
Download to read offline
Data Analytics-Linear Regression Report
By Jonathan Chauwa
Best Crime Predictor
In this report, I will be using statistical concepts such as linear regression and the R
programming language to analyze some data to find out whether crime has a relationship with
population size, non-white communities and density. For example, does more population in a
given area mean more crime? Is it true that nonwhite communities here in the USA have higher
crime rate? What about density in a given area? I will use the Freedman dataset provided by the
ISLR library through R-Studio and try to find relationships that might or might not be significant
given the response variable which is crime and the predictors which are population, nonwhite
communities and density. I will then come up with a model that can reasonably describe any
significant relationship that can exist among the mentioned variables. The model, if properly
formed, can then be used to predict crime rate based on one or more of these predictors. I will
take snapshots of important information throughout this report to give a more clear explanation
of the process.
Before I do anything, I will begin by exploring the freedman dataset to get a feel of the data. I
will focus on knowing how many observations are available and whether they are missing
values. I will also perform summary statistics to know the range, mean, median and other
information that gives a high level insight into the data.
Below is a quick summary of the Freedmen dataset
From the snapshot above, I can conclude that the dataset has four predictors namely,
population, nonwhite, density and crime with 110 observations and there are some missing
values in the dataset. I will get rid of missing values by using the na.omit() function in R
programming language before proceeding.
The following R-programming function na.omit(Freedmen) will omit missing values.
Now that we have taken care of missing values, we can run summary statistics on the dataset to
understand it better.The following table gives summary statistics of each variable stating the
mean, median and range.
As we can see in the table above is self-explanatory, for example, population ranges from 270
to about 11551.0 with a mean of 1136 and median of 664. The same table can be used to
understand the measures of central tendency and other information about the variables in the
dataset.
Plotting the data
The next step I have done is to visualize a plot of the dataset so that I can see some
relationships between the different variables since my goal is to find relationships between
crime and the rest of the variables in the dataset. I started with a scatterplot for all variables.
Pairwise Plots Snapshot
Visually, I can see some relationships between some variables such as crime and population in
the table above but I think it will be great to take two variables at a time and plot them. This will
make it easier to see clearly what kind of relationship may exist between two variables. I will go
further and draw a line through the variables to increase clarity. I will use the abline function in R
to fit a line and then I will also use lowess function which uses a complex and efficient algorithm
to fit a line through the points. The lowess function in particular uses locally weighted polynomial
regression and will give a much more better line than the one i would normally get by fitting a
line using the abline method. For purposes of clarity,the lowess fitted line will be in blue and the
normal linear regression line will be in red in each plot.
Plot of Population and Crime
Form the graph above, I think population and crime might have a relationship. The linear fit
represented by the redline suggests that the relationship between population and crime is linear
and the blue line, which the lowess function fits suggests an otherwise polynomial relationship.
In any case, I can notice that there is definitely relationship between population and crime.
Population and Density Plot
From the graph above, it looks like crime and density might have some form of linear
relationship but the results are not convincing as the blue fitted line is almost at zero. Owing to
the ambiguity in this plot, I will not conclude anything about the relationship between crime and
density until I conduct further tests.
Crime and Non-White Populations Plot
I could not fit a linear model on this plot due to how the data is structured. But the lowess
function did give me something. Visually, the data seems to have a quadratic relationship. I
could draw a curve through the points but that would mean the data in between my curve will
have no explanations. It’s too early to conclude but this does not show a good linear relationship
between crime and nonwhite population.
Correlation Analysis
In continuing to explore the relationships between the variables in the dataset, I will try to find
the correlation between each variable and crime. Correlation analysis will enable me to study
the strength of relationship between two variables.
Correlation
Most relationships in these results are not fairly strong. For example, there is no significant
correlation between crime and density as the correlation value is significantly closer to 0. Crime
and nonwhite and crime and population have fairly reasonable correlation even though they are
not very strong. From my Pearson product-moment correlation analysis, I think that population
is the strongest predictor of crime at 0.3957592. So far, population and crime do seem to have a
relationship from both the pairwise plots earlier and the correlation analysis conducted.
Linear Regression
In the next section, I will regress crime onto all the predictors and will assess which predictors
have relevant P-values and how much variance in crime is dependent on those predictors. This
will further help me to decide which variables are strong predictors of crime.
The first linear fit which includes all the variables shows that density is insignificant. It not only
has a negative relationship with crime but also has a high P-value. In this case,we can
confidently remove density as it is not a good predictor of crime owing to the earlier tests we
conducted which showed signs of a weak relationship between density and crime. However, the
P-values for nonwhite and population are significant as they are lower than the required 0.05
mark.
Further, from the linear regression results, only 22.8 percent of the variation in crime is
explained by the predictors. We can expect this to improve since we will remove relevant
predictors such as density before rerunning the linear regression model. Further, our R-squared
and adjusted R-Squared values are also generally low but low R-squared values do not always
mean we should reject the model. This is because R-squared has some limitations as it does
not determine whether the coefficient estimates are biased. We will have to asses the residual
plots to get more information. R-squared also has a disadvantage of not telling us whether our
model is adequate like in this case where some variables have significant P-values but with
insignificant R-squared values. In any case, we expect all these results to improve as we
remove all the irrelevant predictors.
More diagnostic tests for significant relationships.
In the next step, I compute and State the 95% interval for all predictors and remove the
irrelevant ones.
Confidence Intervals Snap Shot
From the results above, density is insignificant as its range includes 0. Crime and density have
in the earlier tests shown a weak relationship from the pairwise scatter plots, the regression
model, the correlation test and finally the confidence interval test. So I am confidently removing
density as a predictor and hope this makes my model better. I will keep nonwhite and population
and run the final model with those two as predictors as they show some reasonable relationship.
Second Linear Regression Fit
Now that that we have removed density as one of our predictors, we expect our model to
improve when we re run linear regression with the remaining two remaining variables.
As we can see now, all the variables do have significant P-values and the model is significantly
better than the first one because we removed an irrelevant predictor.
If I had to explain the model above, the regression equation in this case is as follows;
Crime=2.185e+03 + 2.376e-01 * Population + 2.611e+01 * Nonwhite. This means, for an
increase of 1 in the population, the model predicts that crime rate will increase by approximately
2.376e-01 Similarly, for a an increase of 1 in nonwhite, the model predicts that crime rate will
increase by 2.611e+01 . Taken together, a one unit increase in both predictors can be
determined by substituting 1 for each variable and adding the results as follows 2.185e+03 +
2.376e-01 * (1) + 2.611e+01 * (1) which gives us the total increase in crime rate.Our intercept
simply tells us that the crime rate remains around 2.185e+03 when there is a 0 increase in both
the predictors.
Regression Diagnostics Test Plot
In the following section, I will diagnose the regression model by testing for normality, linearity,
homoscedasticity and outliers in the model. These tests are good in that they help us confirm
some of the results we had when we ran our linear regression model.
Diagnostic Plots
Residuals versus Fitted-Linearity Test
This plot above shows error residuals vs fitted values. Our fit line is the dotted line at y=0 and
the red line gives us an idea of our residual pattern movement. All points on the line have zero
residuals while those above have positive and those below have negative residuals. It looks like
our residuals do not have a normal linear relationship. I like the fact that the points are fairly
distributed above and below the line except for the fact that the frequency of the points above
and below our line reduces as fitted values increase. However, we have a curve so I could say
that the relationship may not be linear but maybe quadratic. But we will explore that further.
Normal QQ Plot-Normality Test
Our normal QQ plot tells us if our residuals follow a normal distribution or not. We check this by
observing if our points follow the dotted line closely. I can say that most of our points follow the
line closely despite a few outliers at the beginning and the end of the line. Philadelphia and
Johnstown are examples of outliers. We can further check this by performing the shapiro wilks
test. The null hypothesis of this test is that the population is normally distributed. Below are the
results.
With 95 percent confidence, our shapiro test tells us that our data does not follow the normal
distribution as the p-values are less than 0.05. We reject the null.
Scale Location Test For Homoscedasticity and Heteroscedasticity
The scale location plot below gives us information about the spread of our points across
predicted values ranges. We are looking for a horizontal line which will tell us that we have
uniform variations.
From the graph above, it is clear that our data is homoscedastic until about the 3000 mark
then it becomes heteroskedastic there afterwards. Usually, we look for a horizontal red line in
this plot and the redline is fairly horizontal even though the frequency of the points as the fitted
values increase reduce. The model definitely fails this test.
In my next step, I will test the independence of error terms with the Durbin Watson
test.(Autocorrelation)
Because our p value is greater than 0.05, we fail to reject the null hypothesis and conclude that
the independence assumption holds true for our model. This is good news!
Residuals vs Leverage Plot
In the next section, we look at a plot of the residuals vs the Leverage. We want to know the
influence of our observations using Cook's distance law and we also want to know the leverage
of some observations. In the diagram below, we look at the points outside the dotted line if any
observations outside the lines, we can say they have high leverage.
Seems there is no points outside our dotted lines. We can run an outlier test before I can
conclude anything but I will proceed to the next step.
Since our model does not pass all the diagnostic tests, in the next section, I begin to correct the
model using interaction terms. It could be that our model is not exactly linear but maybe
quadratic or polynomial. I will begin by taking the log of one function in the regression formula to
see what patterns it will give me.
Linear regression with interaction terms ( log)
From the table, taking the log of the population gives me a significant P-value even though it is
not very strong. The R-squared has also significantly improved as it now explains 42.1 percent
of the variations in crime compared to the initial model which had no transformations. However,
My intercept is also negative. I will check the diagnostic plots for this data just to see what it
could have.
The following diagnostic plots might not be accurate but very interesting to look at.
The graph shows that clearly we have fixed the linearity problem as our red line in the Residuals
vs Fitted is almost horizontal. Our Normal qq plot also passes the test for normality in this
model. The scale location does pass the homoscedastic test. Even though the frequency of the
points are less for lower fitted values, the distribution above and below the line is fair and does
not show signs of heteroscedasticity as fitted values increase. Our outlier test also passes. I am
not sure at this point if there is an error in my modelling but i will go ahead and take away the
insignificant predictors according to this model and take the log of each relevant predictor. I am
mostly interested in the diagnostic plots at this point.
Linear Regression Model 3 Diagnostic Plots
Based off these charts, we can accept the linearity test as the line is fairly straight and the plot's
fairly distributed above and below the red line in the residuals vs fitted plot. Our normal qq
plot shows that a significant part of our data is normally distributed except a few points.The
scale location passes the homoscedastic test and our residuals vs leverage chart looks
good. I would say that we have fixed the linear problem and the heteroscedastic problem we
had with our earlier models.
As we can see, we can now confidently model the relationship between crime and population
and nonwhite communities. We have not only improved the model significantly from the first but
have also diagnosed it using acceptable statistical tools to see if it is good enough. The next
phase would be to predict crime rate based on the relevant variables in the final model and to
figure out how to test the accuracy of the predictions.
That is a small insight into one of my class projects in my Data Analytics class. It may
not be the most accurate but I decided to share this because I am interested in the Data
Engineering program.
Thanks.
Jonathan Chauwa

More Related Content

What's hot

Predictive Policing on Gun Violence Using Open Data
Predictive Policing on Gun Violence Using Open DataPredictive Policing on Gun Violence Using Open Data
Predictive Policing on Gun Violence Using Open DataPredPol, Inc
 
Predictive Policing - How Emerging Technologies Are Helping Prevent Crimes?
Predictive Policing - How Emerging Technologies Are Helping Prevent Crimes?Predictive Policing - How Emerging Technologies Are Helping Prevent Crimes?
Predictive Policing - How Emerging Technologies Are Helping Prevent Crimes?Sunil Jagani
 
statistical learning theory
statistical learning theorystatistical learning theory
statistical learning theoryHarshKumar943076
 
Linear Regression
Linear Regression Linear Regression
Linear Regression Rupak Roy
 
Machine learning session5(logistic regression)
Machine learning   session5(logistic regression)Machine learning   session5(logistic regression)
Machine learning session5(logistic regression)Abhimanyu Dwivedi
 
Reconsidering baron and kenny
Reconsidering baron and kennyReconsidering baron and kenny
Reconsidering baron and kennyMinhwan Lee
 

What's hot (9)

Predictive Policing on Gun Violence Using Open Data
Predictive Policing on Gun Violence Using Open DataPredictive Policing on Gun Violence Using Open Data
Predictive Policing on Gun Violence Using Open Data
 
Predictive Policing - How Emerging Technologies Are Helping Prevent Crimes?
Predictive Policing - How Emerging Technologies Are Helping Prevent Crimes?Predictive Policing - How Emerging Technologies Are Helping Prevent Crimes?
Predictive Policing - How Emerging Technologies Are Helping Prevent Crimes?
 
Correlation
CorrelationCorrelation
Correlation
 
statistical learning theory
statistical learning theorystatistical learning theory
statistical learning theory
 
Linear Regression
Linear Regression Linear Regression
Linear Regression
 
Machine learning session5(logistic regression)
Machine learning   session5(logistic regression)Machine learning   session5(logistic regression)
Machine learning session5(logistic regression)
 
Movie Ticket Post
Movie Ticket PostMovie Ticket Post
Movie Ticket Post
 
Reconsidering baron and kenny
Reconsidering baron and kennyReconsidering baron and kenny
Reconsidering baron and kenny
 
QRM Assignment
QRM AssignmentQRM Assignment
QRM Assignment
 

Similar to Crime Prediction Using Linear Regression

MLR Project (Onion)
MLR Project (Onion)MLR Project (Onion)
MLR Project (Onion)Chawal Ukesh
 
Hph7310week2winter2009narr
Hph7310week2winter2009narrHph7310week2winter2009narr
Hph7310week2winter2009narrSarah
 
The future is uncertain. Some events do have a very small probabil.docx
The future is uncertain. Some events do have a very small probabil.docxThe future is uncertain. Some events do have a very small probabil.docx
The future is uncertain. Some events do have a very small probabil.docxoreo10
 
Classification via Logistic Regression
Classification via Logistic RegressionClassification via Logistic Regression
Classification via Logistic RegressionTaweh Beysolow II
 
Correlation Example
Correlation ExampleCorrelation Example
Correlation ExampleOUM SAOKOSAL
 
Partial correlation
Partial correlationPartial correlation
Partial correlationDwaitiRoy
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regressionAntony Raj
 
Predicting deaths from COVID-19 using Machine Learning
Predicting deaths from COVID-19 using Machine LearningPredicting deaths from COVID-19 using Machine Learning
Predicting deaths from COVID-19 using Machine LearningIdanGalShohet
 
For this assignment, use the aschooltest.sav dataset.The d
For this assignment, use the aschooltest.sav dataset.The dFor this assignment, use the aschooltest.sav dataset.The d
For this assignment, use the aschooltest.sav dataset.The dMerrileeDelvalle969
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regressionAntony Raj
 
Interpreting Regression Results - Machine Learning
Interpreting Regression Results - Machine LearningInterpreting Regression Results - Machine Learning
Interpreting Regression Results - Machine LearningKush Kulshrestha
 
8 Statistical SignificanceOK, measures of association are one .docx
8 Statistical SignificanceOK, measures of association are one .docx8 Statistical SignificanceOK, measures of association are one .docx
8 Statistical SignificanceOK, measures of association are one .docxevonnehoggarth79783
 
BUS 308 – Week 4 Lecture 2 Interpreting Relationships .docx
BUS 308 – Week 4 Lecture 2 Interpreting Relationships .docxBUS 308 – Week 4 Lecture 2 Interpreting Relationships .docx
BUS 308 – Week 4 Lecture 2 Interpreting Relationships .docxcurwenmichaela
 
BUS 308 – Week 4 Lecture 2 Interpreting Relationships .docx
BUS 308 – Week 4 Lecture 2 Interpreting Relationships .docxBUS 308 – Week 4 Lecture 2 Interpreting Relationships .docx
BUS 308 – Week 4 Lecture 2 Interpreting Relationships .docxjasoninnes20
 
Statistics for Data Analytics
Statistics for Data AnalyticsStatistics for Data Analytics
Statistics for Data AnalyticsTushar Dalvi
 
PPT Correlation.pptx
PPT Correlation.pptxPPT Correlation.pptx
PPT Correlation.pptxMahamZeeshan5
 

Similar to Crime Prediction Using Linear Regression (20)

MLR Project (Onion)
MLR Project (Onion)MLR Project (Onion)
MLR Project (Onion)
 
Hph7310week2winter2009narr
Hph7310week2winter2009narrHph7310week2winter2009narr
Hph7310week2winter2009narr
 
The future is uncertain. Some events do have a very small probabil.docx
The future is uncertain. Some events do have a very small probabil.docxThe future is uncertain. Some events do have a very small probabil.docx
The future is uncertain. Some events do have a very small probabil.docx
 
Classification via Logistic Regression
Classification via Logistic RegressionClassification via Logistic Regression
Classification via Logistic Regression
 
X18136931 statistics ca2_updated
X18136931 statistics ca2_updatedX18136931 statistics ca2_updated
X18136931 statistics ca2_updated
 
2-20-04.ppt
2-20-04.ppt2-20-04.ppt
2-20-04.ppt
 
Correlation Example
Correlation ExampleCorrelation Example
Correlation Example
 
Partial correlation
Partial correlationPartial correlation
Partial correlation
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
Predicting deaths from COVID-19 using Machine Learning
Predicting deaths from COVID-19 using Machine LearningPredicting deaths from COVID-19 using Machine Learning
Predicting deaths from COVID-19 using Machine Learning
 
For this assignment, use the aschooltest.sav dataset.The d
For this assignment, use the aschooltest.sav dataset.The dFor this assignment, use the aschooltest.sav dataset.The d
For this assignment, use the aschooltest.sav dataset.The d
 
Ch 7 correlation_and_linear_regression
Ch 7 correlation_and_linear_regressionCh 7 correlation_and_linear_regression
Ch 7 correlation_and_linear_regression
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
Interpreting Regression Results - Machine Learning
Interpreting Regression Results - Machine LearningInterpreting Regression Results - Machine Learning
Interpreting Regression Results - Machine Learning
 
8 Statistical SignificanceOK, measures of association are one .docx
8 Statistical SignificanceOK, measures of association are one .docx8 Statistical SignificanceOK, measures of association are one .docx
8 Statistical SignificanceOK, measures of association are one .docx
 
BUS 308 – Week 4 Lecture 2 Interpreting Relationships .docx
BUS 308 – Week 4 Lecture 2 Interpreting Relationships .docxBUS 308 – Week 4 Lecture 2 Interpreting Relationships .docx
BUS 308 – Week 4 Lecture 2 Interpreting Relationships .docx
 
BUS 308 – Week 4 Lecture 2 Interpreting Relationships .docx
BUS 308 – Week 4 Lecture 2 Interpreting Relationships .docxBUS 308 – Week 4 Lecture 2 Interpreting Relationships .docx
BUS 308 – Week 4 Lecture 2 Interpreting Relationships .docx
 
Statistics for Data Analytics
Statistics for Data AnalyticsStatistics for Data Analytics
Statistics for Data Analytics
 
9. parametric regression
9. parametric regression9. parametric regression
9. parametric regression
 
PPT Correlation.pptx
PPT Correlation.pptxPPT Correlation.pptx
PPT Correlation.pptx
 

Recently uploaded

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 

Recently uploaded (20)

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 

Crime Prediction Using Linear Regression

  • 1. Data Analytics-Linear Regression Report By Jonathan Chauwa Best Crime Predictor In this report, I will be using statistical concepts such as linear regression and the R programming language to analyze some data to find out whether crime has a relationship with population size, non-white communities and density. For example, does more population in a given area mean more crime? Is it true that nonwhite communities here in the USA have higher crime rate? What about density in a given area? I will use the Freedman dataset provided by the ISLR library through R-Studio and try to find relationships that might or might not be significant given the response variable which is crime and the predictors which are population, nonwhite communities and density. I will then come up with a model that can reasonably describe any significant relationship that can exist among the mentioned variables. The model, if properly formed, can then be used to predict crime rate based on one or more of these predictors. I will take snapshots of important information throughout this report to give a more clear explanation of the process. Before I do anything, I will begin by exploring the freedman dataset to get a feel of the data. I will focus on knowing how many observations are available and whether they are missing values. I will also perform summary statistics to know the range, mean, median and other information that gives a high level insight into the data. Below is a quick summary of the Freedmen dataset
  • 2. From the snapshot above, I can conclude that the dataset has four predictors namely, population, nonwhite, density and crime with 110 observations and there are some missing values in the dataset. I will get rid of missing values by using the na.omit() function in R programming language before proceeding. The following R-programming function na.omit(Freedmen) will omit missing values. Now that we have taken care of missing values, we can run summary statistics on the dataset to understand it better.The following table gives summary statistics of each variable stating the mean, median and range. As we can see in the table above is self-explanatory, for example, population ranges from 270 to about 11551.0 with a mean of 1136 and median of 664. The same table can be used to understand the measures of central tendency and other information about the variables in the dataset. Plotting the data The next step I have done is to visualize a plot of the dataset so that I can see some relationships between the different variables since my goal is to find relationships between crime and the rest of the variables in the dataset. I started with a scatterplot for all variables. Pairwise Plots Snapshot
  • 3. Visually, I can see some relationships between some variables such as crime and population in the table above but I think it will be great to take two variables at a time and plot them. This will make it easier to see clearly what kind of relationship may exist between two variables. I will go further and draw a line through the variables to increase clarity. I will use the abline function in R to fit a line and then I will also use lowess function which uses a complex and efficient algorithm to fit a line through the points. The lowess function in particular uses locally weighted polynomial regression and will give a much more better line than the one i would normally get by fitting a line using the abline method. For purposes of clarity,the lowess fitted line will be in blue and the normal linear regression line will be in red in each plot. Plot of Population and Crime
  • 4. Form the graph above, I think population and crime might have a relationship. The linear fit represented by the redline suggests that the relationship between population and crime is linear and the blue line, which the lowess function fits suggests an otherwise polynomial relationship. In any case, I can notice that there is definitely relationship between population and crime. Population and Density Plot From the graph above, it looks like crime and density might have some form of linear relationship but the results are not convincing as the blue fitted line is almost at zero. Owing to the ambiguity in this plot, I will not conclude anything about the relationship between crime and density until I conduct further tests.
  • 5. Crime and Non-White Populations Plot I could not fit a linear model on this plot due to how the data is structured. But the lowess function did give me something. Visually, the data seems to have a quadratic relationship. I could draw a curve through the points but that would mean the data in between my curve will have no explanations. It’s too early to conclude but this does not show a good linear relationship between crime and nonwhite population. Correlation Analysis In continuing to explore the relationships between the variables in the dataset, I will try to find the correlation between each variable and crime. Correlation analysis will enable me to study the strength of relationship between two variables. Correlation
  • 6. Most relationships in these results are not fairly strong. For example, there is no significant correlation between crime and density as the correlation value is significantly closer to 0. Crime and nonwhite and crime and population have fairly reasonable correlation even though they are not very strong. From my Pearson product-moment correlation analysis, I think that population is the strongest predictor of crime at 0.3957592. So far, population and crime do seem to have a relationship from both the pairwise plots earlier and the correlation analysis conducted. Linear Regression In the next section, I will regress crime onto all the predictors and will assess which predictors have relevant P-values and how much variance in crime is dependent on those predictors. This will further help me to decide which variables are strong predictors of crime. The first linear fit which includes all the variables shows that density is insignificant. It not only has a negative relationship with crime but also has a high P-value. In this case,we can confidently remove density as it is not a good predictor of crime owing to the earlier tests we conducted which showed signs of a weak relationship between density and crime. However, the P-values for nonwhite and population are significant as they are lower than the required 0.05 mark. Further, from the linear regression results, only 22.8 percent of the variation in crime is explained by the predictors. We can expect this to improve since we will remove relevant
  • 7. predictors such as density before rerunning the linear regression model. Further, our R-squared and adjusted R-Squared values are also generally low but low R-squared values do not always mean we should reject the model. This is because R-squared has some limitations as it does not determine whether the coefficient estimates are biased. We will have to asses the residual plots to get more information. R-squared also has a disadvantage of not telling us whether our model is adequate like in this case where some variables have significant P-values but with insignificant R-squared values. In any case, we expect all these results to improve as we remove all the irrelevant predictors. More diagnostic tests for significant relationships. In the next step, I compute and State the 95% interval for all predictors and remove the irrelevant ones. Confidence Intervals Snap Shot From the results above, density is insignificant as its range includes 0. Crime and density have in the earlier tests shown a weak relationship from the pairwise scatter plots, the regression model, the correlation test and finally the confidence interval test. So I am confidently removing density as a predictor and hope this makes my model better. I will keep nonwhite and population and run the final model with those two as predictors as they show some reasonable relationship. Second Linear Regression Fit
  • 8. Now that that we have removed density as one of our predictors, we expect our model to improve when we re run linear regression with the remaining two remaining variables. As we can see now, all the variables do have significant P-values and the model is significantly better than the first one because we removed an irrelevant predictor. If I had to explain the model above, the regression equation in this case is as follows; Crime=2.185e+03 + 2.376e-01 * Population + 2.611e+01 * Nonwhite. This means, for an increase of 1 in the population, the model predicts that crime rate will increase by approximately 2.376e-01 Similarly, for a an increase of 1 in nonwhite, the model predicts that crime rate will increase by 2.611e+01 . Taken together, a one unit increase in both predictors can be determined by substituting 1 for each variable and adding the results as follows 2.185e+03 + 2.376e-01 * (1) + 2.611e+01 * (1) which gives us the total increase in crime rate.Our intercept simply tells us that the crime rate remains around 2.185e+03 when there is a 0 increase in both the predictors. Regression Diagnostics Test Plot In the following section, I will diagnose the regression model by testing for normality, linearity, homoscedasticity and outliers in the model. These tests are good in that they help us confirm some of the results we had when we ran our linear regression model.
  • 9. Diagnostic Plots Residuals versus Fitted-Linearity Test This plot above shows error residuals vs fitted values. Our fit line is the dotted line at y=0 and the red line gives us an idea of our residual pattern movement. All points on the line have zero residuals while those above have positive and those below have negative residuals. It looks like our residuals do not have a normal linear relationship. I like the fact that the points are fairly distributed above and below the line except for the fact that the frequency of the points above and below our line reduces as fitted values increase. However, we have a curve so I could say that the relationship may not be linear but maybe quadratic. But we will explore that further. Normal QQ Plot-Normality Test
  • 10. Our normal QQ plot tells us if our residuals follow a normal distribution or not. We check this by observing if our points follow the dotted line closely. I can say that most of our points follow the line closely despite a few outliers at the beginning and the end of the line. Philadelphia and Johnstown are examples of outliers. We can further check this by performing the shapiro wilks test. The null hypothesis of this test is that the population is normally distributed. Below are the results. With 95 percent confidence, our shapiro test tells us that our data does not follow the normal distribution as the p-values are less than 0.05. We reject the null. Scale Location Test For Homoscedasticity and Heteroscedasticity
  • 11. The scale location plot below gives us information about the spread of our points across predicted values ranges. We are looking for a horizontal line which will tell us that we have uniform variations. From the graph above, it is clear that our data is homoscedastic until about the 3000 mark then it becomes heteroskedastic there afterwards. Usually, we look for a horizontal red line in this plot and the redline is fairly horizontal even though the frequency of the points as the fitted values increase reduce. The model definitely fails this test. In my next step, I will test the independence of error terms with the Durbin Watson test.(Autocorrelation) Because our p value is greater than 0.05, we fail to reject the null hypothesis and conclude that the independence assumption holds true for our model. This is good news! Residuals vs Leverage Plot
  • 12. In the next section, we look at a plot of the residuals vs the Leverage. We want to know the influence of our observations using Cook's distance law and we also want to know the leverage of some observations. In the diagram below, we look at the points outside the dotted line if any observations outside the lines, we can say they have high leverage. Seems there is no points outside our dotted lines. We can run an outlier test before I can conclude anything but I will proceed to the next step. Since our model does not pass all the diagnostic tests, in the next section, I begin to correct the model using interaction terms. It could be that our model is not exactly linear but maybe quadratic or polynomial. I will begin by taking the log of one function in the regression formula to see what patterns it will give me. Linear regression with interaction terms ( log)
  • 13. From the table, taking the log of the population gives me a significant P-value even though it is not very strong. The R-squared has also significantly improved as it now explains 42.1 percent of the variations in crime compared to the initial model which had no transformations. However, My intercept is also negative. I will check the diagnostic plots for this data just to see what it could have. The following diagnostic plots might not be accurate but very interesting to look at.
  • 14. The graph shows that clearly we have fixed the linearity problem as our red line in the Residuals vs Fitted is almost horizontal. Our Normal qq plot also passes the test for normality in this model. The scale location does pass the homoscedastic test. Even though the frequency of the points are less for lower fitted values, the distribution above and below the line is fair and does not show signs of heteroscedasticity as fitted values increase. Our outlier test also passes. I am not sure at this point if there is an error in my modelling but i will go ahead and take away the insignificant predictors according to this model and take the log of each relevant predictor. I am mostly interested in the diagnostic plots at this point. Linear Regression Model 3 Diagnostic Plots
  • 15. Based off these charts, we can accept the linearity test as the line is fairly straight and the plot's fairly distributed above and below the red line in the residuals vs fitted plot. Our normal qq plot shows that a significant part of our data is normally distributed except a few points.The scale location passes the homoscedastic test and our residuals vs leverage chart looks good. I would say that we have fixed the linear problem and the heteroscedastic problem we had with our earlier models.
  • 16. As we can see, we can now confidently model the relationship between crime and population and nonwhite communities. We have not only improved the model significantly from the first but have also diagnosed it using acceptable statistical tools to see if it is good enough. The next phase would be to predict crime rate based on the relevant variables in the final model and to figure out how to test the accuracy of the predictions. That is a small insight into one of my class projects in my Data Analytics class. It may not be the most accurate but I decided to share this because I am interested in the Data Engineering program. Thanks. Jonathan Chauwa