Descriptive Analysis.docx

Coursework Details: 2nd Hand Car Market Case Study

EXECUTIVE SUMMARY
This report analyzes secondhand Ford Fiesta in B74JP and recommends the changes in prices due
to change in car age, mileage, and engine power.
The used market is a sizable space, making the process of purchasing or selling a car intimidating.
We have examined the variables that affect our car's price and the effect they have on the auto
industry. Age, mileage, and engine power of our car are determining factors in price. To compare
the averages of each of these variables and determine the links between them, we examined the
central tendency and dispersion of prices as well as these other characteristics. We identify their
intra and inter relationships as well as the effects of these interactions on the predictor after
performing various statistical tests.
We also recognize the significance of the size and strength of these linkages. We did specific
analyses to develop a model to determine whether the strength of the connection affects the overall
relationship

TABLE OF CONTENT
Executive Summary……………………………………………………………………………
Contents
Section 1 .......................................................................................................................................................4
Introduction...............................................................................................................................................4
Section 2 .......................................................................................................................................................5
Visualizations............................................................................................................................................5
Section 3 .......................................................................................................................................................6
Descriptive statistics .................................................................................................................................6
Section 4 .......................................................................................................................................................8
Hypothesis ................................................................................................................................................8
Section 5 .......................................................................................................................................................9
Correlation Analysis .................................................................................................................................9
Section 6 .....................................................................................................................................................11
Regression analysis.................................................................................................................................11
Section 7 .....................................................................................................................................................13
Conclusion ..............................................................................................................................................13
Section 8 .....................................................................................................................................................13
References...............................................................................................................................................13

Section 1
Introduction
We want to forecast the typical used car pricing for Ford fiesta in postal code B74JP gathered from
population. We summarize car data in an organized way by explaining the relationship between
different variables in a sample. To ascertain which elements, have the greatest influence on price
and which ones have no impact at all, we anticipate various correlations.
The Source of our data is https://heycar.co.uk/used-cars. After extracting relevant data from our
source to an excel file, the major limitation was of missing values and manipulation of those values
was crucial to accurately analyze data and interpret its output coherently. We are cleaning data
without changing the real context of the data by either adding or deleting values or removing
anomalies. In our dataset, we filter out blank or missing values to check the number of missing
values in our dataset. Then for this dataset we have deleted all the missing values as they were low
in number, and they were not making much of an impact on the meaning of the dataset. In this
dataset, we are adopting simple random sampling because it gives each person or member of a
group an equal and just chance of being chosen by randomly selecting a small portion of persons
or members from the total population. (“Random Sampling - Overview, Types, Importance,
Example”) Also, simple random sampling is required while finding the intervals in which the
average of the whole population lies. [2]https://corporatefinanceinstitute.com/resources/data-
science/random-sampling/
We are examining 5-year car data since it will not only provide us knowledge from that time period
about which cars are popular based on their performance but also help us to predict more
accurately. Consequently, it will be simpler for us to decide which to buy or sell.

Section 2
Visualizations
We can visualize from the above graph the
distribution of secondhand ford fiesta in B7 4JP
Most cars fall in the interval of prices £12000 to
£14000. Also, it can be observed that people
prefer to buy manual cars as compared to semi-
automatic cars. The manual cars fall under the car
price of £13000 range and semi-automatic fall
under £14000
The relationship between two Ford Fiesta
attributes and how their costs vary may be seen in
the graph above. We are considering engine size
and transmission type as our attributes.
Cars with engine size 1.0 L have both manual and
semi-automatic variants as compared to engine
size 1.5L, 1.25L and 1.1L which have only manual variants. Moreover, we see price of manual
and engine size 1.5 has the highest price point in our area.
The association between miles driven and
Ford Fiesta costs may be seen in the graph
above. Since there is some modest negative
linearity between these factors, the cost of
the car decreases with increased driving.
We have created visualizations with elements that are true representations of our data. The labels
in the graphs above clearly depict values, and margins of our graphs are starting from zero.
Moreover, in the bar graphs and histograms the bars on the graph follow symmetrical order. There

is similarity of colors and connection, and graphs also follow enclosure principle which means
every graph is having a border or a boundary.
Section 3
Descriptive statistics
After data extraction and cleaning of the sample, we divide the dataset into two parts or two set of
variables i.e., one part entails all the numeric values that have some mathematical connotation and
other part entails the values that are non-numeric. These numeric values are called Continuous
variables and non-numeric are called categorical variables.
Categorical Variables Continuous variables
Car Price Transmission type
Miles driven by the car Color of the car
Number of previous owners of the
car
Engine size
Engine Power
Carbon Emission
Fuel Consumption
Car year
Age (calculated from 2022- car year)
We can calculate descriptive statistics for the numeric data only.
The descriptive statistics incudes the average values in a dataset, the middle value of a dataset and
the number that occurs more frequently in a data set. Also, we get to know about the distance or
deviation of individual values from the average values and through that we are able to calculate a
standard that helps us to understand which values are fall under normal range and which are less
than normal and more than normal.

For example if we want descriptive statistics for our car price, the average price for my car model
or maximum price range for my car model is approximately
£13327 .
Analysis of categorical and continuous variables:
Here we are predicting the prices of our car model according to transmission type that means we
want to know average price of manual cars and
average price of semi-automatic car
Average price for manual = £13240
approximately
Average price for semi-automatic = £15127 approximately
Hence, we can infer from the above that average price for semi-automatic cars are more than
manual cars.
Here we are predicting the prices of our car model according to color type that means we want to
know average price of red cars, average price of black cars average price of blue cars, average
price of silver cars, average price of white cars.
Average price for red cars= 13165
Average price for black cars =13304
Average price for blue cars=13065
Average price for silver cars=136673
Average price for white cars=12855
Hence, we can infer from the above that
average price for silver cars are favored more than other cars.

Here we are predicting the prices of our car model according to engine type that means we want
to know average price of cars with 1.0 L engine size, average price of cars with 1.1 engine size ,
average price of cars with 1.25 L engine size , average price of cars with 1.5L engine size .
Average price for 1.0 L cars=13517
Average price for 1.1 L cars =11098
Average price for 1.25 cars=8900
Average price for 1.5 cars= 15670
Hence, we can infer from the above that average price for cars with engine size 1.5 are higher than
other cars.
Section 4
Hypothesis
The concept behind finding confidence interval is that it gives us a range in which our populations
mean lies and we calculate that range with the help of sample mean and under certain confidence
level or surety that my range lies. This range can only be calculated under the assumption that the
spread or distribution must be around a central value. Thus, we can find out population mean using
sample mean. We will now apply this concept for finding out the range of population mean price
with the help of sample mean price and confidence level of 99%.
X = Sample mean =13327 U=Population Mean
The range that was calculated is 12912 < U < 13742
To verify this range, we will check our population dataset i.e., 496 data values
The mean value from our population data set is 13445 approximately which lies under our range

There is a one claim that average car price of the sample of our dataset is like the average car price
of UK taken from motors.co.uk. In order, to check that this statement is right or wrong we use a
method called hypothesis testing. In hypothesis testing, we have two statements, one is called null
hypothesis other is called alternate hypothesis. For this claim our null hypothesis is that average
price=13773
And alternate hypothesis is average price is not equal to 13773. In order to prove this, claim we
use one-sample t-test. After carrying out this test we
were able to determine that the average price of the
sample is not in line with the average price taken
from a different source.
Here p value =0.0065 much less than 0.05 so we reject null hypothesis.
Source: https://www.motors.co.uk/
There is another claim if there is a relationship between two categorical values i.e., number of
owners and transmission type. In order, to check that this statement is right or wrong, we will do
hypothesis testing, here the null value would be that there exists relationship and alternate
hypothesis would be that no relationship exists.
We will do chi square test to find out if there is statistically significant relationship between number
of owners and transmission type. We will first
create a table that shows multi-variate relationship
between these two variables. The table is grouping categorical variables with respect to their
counts. The p value is 0.941163926818371 that is greater than 0.05 so we reject null hypothesis
that means there is no dependency between number of owners and transmission type.
Section 5
Correlation Analysis
The Pearson Correlation Coefficient (R ) evaluates the strength and direction of a link between
two variables. R's value is always between -1 and +1.
If R=0 it does not mean that there exists no relationship, it means that there is no linear relationship.
if R=1 we say it is positively corelated that means the independent variable is directly proportional
to dependent variable and if R= -1, it means the independent variable is indirectly proportional to
dependent variable.

https://towardsdatascience.com/the-importance-of-r-in-data-science-
6b394d48fa50#:~:text=and%20its%20use-,What%20is%20r%3F,1%20and%20a%20%2B1).
For our dataset, the dependent variable is car price and using correlation coefficient we will figure
out the strength of relationship of car price with other car characteristics. Also, for ranges -0.3<r<-
0.1 and 0.3<r<0.1, we say there is a small strength of association, similarly in case of -0.3<r<-0.5
we say there is a medium
strength of association and
same goes for the positive
counterpart and if r lies in
range -0.5<r<-1.0 and -
0.5<r<-1.0 there is large
strength of association.
From the matrix given we
see that car price has a
positive correlation with car
power that means the car
with more power will be
expensive as compared to car with less power. Moreover, r value for car prices and silver color
comes under small strength of association. We can conclude from our matrix that correlation does
not state how much the impact will be, it always conveys the strength of the relationship between
variables.

Section 6
Regression analysis
Parsimonious is nothing but set of rules is that utilizes no more "things" than are essential; in the
case of parsimonious models, those "things"
are parameters. Models with optimal
parsimony, or the precise number of predictors
required to fully describe the model, are called
parsimonious.
https://www.statisticshowto.com/parsimonious-model/
We check parsimony of our model through the principle of normality, which states that the values
of the data are normally distributed if they fall nearly along a straight line at a 45-degree angle.
From the above graph, we can infer that our model follows the assumption of normality and is
parsimonious in nature.
To check the adequacy for our model, we need to satisfy three different principles:
1. Principle of linearity: This principle
states that if it appears that the plotted
points might all lie along a straight line,
indicating that the two variables have linear
relationship.
2. Principle of Homoscedasticity: This
principles states if the plotted points on a
scatter graph are randomly scattered that
forms no shape.
3. Principle of independence of errors: This principle states that the residuals in positive
region and negative region are almost equal then there exists independence of errors.

From the scatter graph below, we see our model follows all three principles. Hence, it follows
principle of adequacy.
From the model summary table, we infer
that the adjusted r square is 0.693 and through the above graphs we concluded that our model is
parsimonious and adequate. The adjusted r square can be improved but there are certain
independent variables which have extremely high significance values, and they might not have
substantial impact on the car price. And to know which variables impact the car price the most and
least we carry out residual analysis which means removing independent variables with high
significance values given in the coefficient table. We can see from the Coefficient table given
above, es15 that means engine size of 1.5L has highest significance value that is 0.851 which
means that it has exceptionally negligible impact on the car price of our model, and it needs to be
removed. We will do this process repeatedly until we have variables with significance level closer
to 0.001 but not more than 0.05. This process is of removing and checking the adjusted r square is
called residual analysis. In the coefficient table, to know the magnitude of impact of our
independent variables on our car price we refer to the standardized coefficients beta column. After
analyzing the beta columns, we verify that engine size has the lowest impact on our car price so
we will remove this variable first.

We will conduct this process until the significance values in our coefficient table are closer to
0.001 but not more than 0.05.
After conducting residual analysis, we have produced
an r square value of 0.705 and the significance values for all our independent variables are closer
to 0.001 and less than 0.05.
Section 7
Conclusion
Thus, we can conclude that prices of semi-automatic ford fiesta with engine size of 1.1 liters can
be 70% predicted through car’s age, miles driven, power and fuel consumption.
This model is adequate and Parsimonious; hence we would recommend this model for buyers and
sellers so that they have a clear view about how car prices change according to different choices
of engine size, power, and age of car.
Section 8
References

Descriptive Analysis.docx

Recommended

Recommended

More Related Content

Similar to Descriptive Analysis.docx

Similar to Descriptive Analysis.docx (20)

Recently uploaded

Recently uploaded (20)

Descriptive Analysis.docx