Crime Prediction Using Linear Regression

Data Analytics-Linear Regression Report
By Jonathan Chauwa
Best Crime Predictor
In this report, I will be using statistical concepts such as linear regression and the R
programming language to analyze some data to find out whether crime has a relationship with
population size, non-white communities and density. For example, does more population in a
given area mean more crime? Is it true that nonwhite communities here in the USA have higher
crime rate? What about density in a given area? I will use the Freedman dataset provided by the
ISLR library through R-Studio and try to find relationships that might or might not be significant
given the response variable which is crime and the predictors which are population, nonwhite
communities and density. I will then come up with a model that can reasonably describe any
significant relationship that can exist among the mentioned variables. The model, if properly
formed, can then be used to predict crime rate based on one or more of these predictors. I will
take snapshots of important information throughout this report to give a more clear explanation
of the process.
Before I do anything, I will begin by exploring the freedman dataset to get a feel of the data. I
will focus on knowing how many observations are available and whether they are missing
values. I will also perform summary statistics to know the range, mean, median and other
information that gives a high level insight into the data.
Below is a quick summary of the Freedmen dataset

From the snapshot above, I can conclude that the dataset has four predictors namely,
population, nonwhite, density and crime with 110 observations and there are some missing
values in the dataset. I will get rid of missing values by using the na.omit() function in R
programming language before proceeding.
The following R-programming function na.omit(Freedmen) will omit missing values.
Now that we have taken care of missing values, we can run summary statistics on the dataset to
understand it better.The following table gives summary statistics of each variable stating the
mean, median and range.
As we can see in the table above is self-explanatory, for example, population ranges from 270
to about 11551.0 with a mean of 1136 and median of 664. The same table can be used to
understand the measures of central tendency and other information about the variables in the
dataset.
Plotting the data
The next step I have done is to visualize a plot of the dataset so that I can see some
relationships between the different variables since my goal is to find relationships between
crime and the rest of the variables in the dataset. I started with a scatterplot for all variables.
Pairwise Plots Snapshot

Visually, I can see some relationships between some variables such as crime and population in
the table above but I think it will be great to take two variables at a time and plot them. This will
make it easier to see clearly what kind of relationship may exist between two variables. I will go
further and draw a line through the variables to increase clarity. I will use the abline function in R
to fit a line and then I will also use lowess function which uses a complex and efficient algorithm
to fit a line through the points. The lowess function in particular uses locally weighted polynomial
regression and will give a much more better line than the one i would normally get by fitting a
line using the abline method. For purposes of clarity,the lowess fitted line will be in blue and the
normal linear regression line will be in red in each plot.
Plot of Population and Crime

Form the graph above, I think population and crime might have a relationship. The linear fit
represented by the redline suggests that the relationship between population and crime is linear
and the blue line, which the lowess function fits suggests an otherwise polynomial relationship.
In any case, I can notice that there is definitely relationship between population and crime.
Population and Density Plot
From the graph above, it looks like crime and density might have some form of linear
relationship but the results are not convincing as the blue fitted line is almost at zero. Owing to
the ambiguity in this plot, I will not conclude anything about the relationship between crime and
density until I conduct further tests.

Crime and Non-White Populations Plot
I could not fit a linear model on this plot due to how the data is structured. But the lowess
function did give me something. Visually, the data seems to have a quadratic relationship. I
could draw a curve through the points but that would mean the data in between my curve will
have no explanations. It’s too early to conclude but this does not show a good linear relationship
between crime and nonwhite population.
Correlation Analysis
In continuing to explore the relationships between the variables in the dataset, I will try to find
the correlation between each variable and crime. Correlation analysis will enable me to study
the strength of relationship between two variables.
Correlation

Most relationships in these results are not fairly strong. For example, there is no significant
correlation between crime and density as the correlation value is significantly closer to 0. Crime
and nonwhite and crime and population have fairly reasonable correlation even though they are
not very strong. From my Pearson product-moment correlation analysis, I think that population
is the strongest predictor of crime at 0.3957592. So far, population and crime do seem to have a
relationship from both the pairwise plots earlier and the correlation analysis conducted.
Linear Regression
In the next section, I will regress crime onto all the predictors and will assess which predictors
have relevant P-values and how much variance in crime is dependent on those predictors. This
will further help me to decide which variables are strong predictors of crime.
The first linear fit which includes all the variables shows that density is insignificant. It not only
has a negative relationship with crime but also has a high P-value. In this case,we can
confidently remove density as it is not a good predictor of crime owing to the earlier tests we
conducted which showed signs of a weak relationship between density and crime. However, the
P-values for nonwhite and population are significant as they are lower than the required 0.05
mark.
Further, from the linear regression results, only 22.8 percent of the variation in crime is
explained by the predictors. We can expect this to improve since we will remove relevant

predictors such as density before rerunning the linear regression model. Further, our R-squared
and adjusted R-Squared values are also generally low but low R-squared values do not always
mean we should reject the model. This is because R-squared has some limitations as it does
not determine whether the coefficient estimates are biased. We will have to asses the residual
plots to get more information. R-squared also has a disadvantage of not telling us whether our
model is adequate like in this case where some variables have significant P-values but with
insignificant R-squared values. In any case, we expect all these results to improve as we
remove all the irrelevant predictors.
More diagnostic tests for significant relationships.
In the next step, I compute and State the 95% interval for all predictors and remove the
irrelevant ones.
Confidence Intervals Snap Shot
From the results above, density is insignificant as its range includes 0. Crime and density have
in the earlier tests shown a weak relationship from the pairwise scatter plots, the regression
model, the correlation test and finally the confidence interval test. So I am confidently removing
density as a predictor and hope this makes my model better. I will keep nonwhite and population
and run the final model with those two as predictors as they show some reasonable relationship.
Second Linear Regression Fit

Now that that we have removed density as one of our predictors, we expect our model to
improve when we re run linear regression with the remaining two remaining variables.
As we can see now, all the variables do have significant P-values and the model is significantly
better than the first one because we removed an irrelevant predictor.
If I had to explain the model above, the regression equation in this case is as follows;
Crime=2.185e+03 + 2.376e-01 * Population + 2.611e+01 * Nonwhite. This means, for an
increase of 1 in the population, the model predicts that crime rate will increase by approximately
2.376e-01 Similarly, for a an increase of 1 in nonwhite, the model predicts that crime rate will
increase by 2.611e+01 . Taken together, a one unit increase in both predictors can be
determined by substituting 1 for each variable and adding the results as follows 2.185e+03 +
2.376e-01 * (1) + 2.611e+01 * (1) which gives us the total increase in crime rate.Our intercept
simply tells us that the crime rate remains around 2.185e+03 when there is a 0 increase in both
the predictors.
Regression Diagnostics Test Plot
In the following section, I will diagnose the regression model by testing for normality, linearity,
homoscedasticity and outliers in the model. These tests are good in that they help us confirm
some of the results we had when we ran our linear regression model.

Diagnostic Plots
Residuals versus Fitted-Linearity Test
This plot above shows error residuals vs fitted values. Our fit line is the dotted line at y=0 and
the red line gives us an idea of our residual pattern movement. All points on the line have zero
residuals while those above have positive and those below have negative residuals. It looks like
our residuals do not have a normal linear relationship. I like the fact that the points are fairly
distributed above and below the line except for the fact that the frequency of the points above
and below our line reduces as fitted values increase. However, we have a curve so I could say
that the relationship may not be linear but maybe quadratic. But we will explore that further.
Normal QQ Plot-Normality Test

Our normal QQ plot tells us if our residuals follow a normal distribution or not. We check this by
observing if our points follow the dotted line closely. I can say that most of our points follow the
line closely despite a few outliers at the beginning and the end of the line. Philadelphia and
Johnstown are examples of outliers. We can further check this by performing the shapiro wilks
test. The null hypothesis of this test is that the population is normally distributed. Below are the
results.
With 95 percent confidence, our shapiro test tells us that our data does not follow the normal
distribution as the p-values are less than 0.05. We reject the null.
Scale Location Test For Homoscedasticity and Heteroscedasticity

The scale location plot below gives us information about the spread of our points across
predicted values ranges. We are looking for a horizontal line which will tell us that we have
uniform variations.
From the graph above, it is clear that our data is homoscedastic until about the 3000 mark
then it becomes heteroskedastic there afterwards. Usually, we look for a horizontal red line in
this plot and the redline is fairly horizontal even though the frequency of the points as the fitted
values increase reduce. The model definitely fails this test.
In my next step, I will test the independence of error terms with the Durbin Watson
test.(Autocorrelation)
Because our p value is greater than 0.05, we fail to reject the null hypothesis and conclude that
the independence assumption holds true for our model. This is good news!
Residuals vs Leverage Plot

In the next section, we look at a plot of the residuals vs the Leverage. We want to know the
influence of our observations using Cook's distance law and we also want to know the leverage
of some observations. In the diagram below, we look at the points outside the dotted line if any
observations outside the lines, we can say they have high leverage.
Seems there is no points outside our dotted lines. We can run an outlier test before I can
conclude anything but I will proceed to the next step.
Since our model does not pass all the diagnostic tests, in the next section, I begin to correct the
model using interaction terms. It could be that our model is not exactly linear but maybe
quadratic or polynomial. I will begin by taking the log of one function in the regression formula to
see what patterns it will give me.
Linear regression with interaction terms ( log)

From the table, taking the log of the population gives me a significant P-value even though it is
not very strong. The R-squared has also significantly improved as it now explains 42.1 percent
of the variations in crime compared to the initial model which had no transformations. However,
My intercept is also negative. I will check the diagnostic plots for this data just to see what it
could have.
The following diagnostic plots might not be accurate but very interesting to look at.

The graph shows that clearly we have fixed the linearity problem as our red line in the Residuals
vs Fitted is almost horizontal. Our Normal qq plot also passes the test for normality in this
model. The scale location does pass the homoscedastic test. Even though the frequency of the
points are less for lower fitted values, the distribution above and below the line is fair and does
not show signs of heteroscedasticity as fitted values increase. Our outlier test also passes. I am
not sure at this point if there is an error in my modelling but i will go ahead and take away the
insignificant predictors according to this model and take the log of each relevant predictor. I am
mostly interested in the diagnostic plots at this point.
Linear Regression Model 3 Diagnostic Plots

Based off these charts, we can accept the linearity test as the line is fairly straight and the plot's
fairly distributed above and below the red line in the residuals vs fitted plot. Our normal qq
plot shows that a significant part of our data is normally distributed except a few points.The
scale location passes the homoscedastic test and our residuals vs leverage chart looks
good. I would say that we have fixed the linear problem and the heteroscedastic problem we
had with our earlier models.

As we can see, we can now confidently model the relationship between crime and population
and nonwhite communities. We have not only improved the model significantly from the first but
have also diagnosed it using acceptable statistical tools to see if it is good enough. The next
phase would be to predict crime rate based on the relevant variables in the final model and to
figure out how to test the accuracy of the predictions.
That is a small insight into one of my class projects in my Data Analytics class. It may
not be the most accurate but I decided to share this because I am interested in the Data
Engineering program.
Thanks.
Jonathan Chauwa

Crime Prediction Using Linear Regression

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Similar to Crime Prediction Using Linear Regression

Similar to Crime Prediction Using Linear Regression (20)

Recently uploaded

Recently uploaded (20)

Crime Prediction Using Linear Regression