- The document demonstrates how to derive insights from automobile data using predictive analytics in R.
- It shows exploring the data to understand relationships between attributes like price, mileage, cylinders and make. Dummy variables are added and the data is cleaned.
- A logarithmic regression model is fit with price as the target and attributes as predictors. The model shows some attributes like mileage and cylinders significantly impact price.
- The model is validated by comparing predicted vs actual prices, showing the model predicts price well. Additional tests check for issues like heteroscedasticity.
- In the end, the model can be used to predict new car prices based on their attributes.
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Deriving insights from automobile data using R
1. Deriving insights from Data – The “R”ight way
Colin Shearer once remarked “Find me something interesting in my data is a question from hell”. A lot of
literature is being published today about Big Data and Predictive analytics. In this mad rush of finding
insights from data, many times organization forget the basic paradigm – “Analysis should be guided by
business goals”. Huh, enough of Gyan!! Let me walk the talk.
In this blog, I wish to demonstrate a simple exercise of deriving value from the data for an Automobile
business. The intended audience could be any one ,who brain cells starts to play soccer on hearing
words like “Big Data”, “R”,”Analytics”,”Insights”, in fact anything which makes sense out of data. Nerds,
Geeks and Dodo like me, all can benefit from this blog.
By the time you reach the end of blog and manage to be awake, here is what you will learn:
First thing first – Be clear of what do you want to dig from the Data mine?
Identify the attributes(independent), which are useful and how do they relate to the Objective
of your analysis
Once the above point is answered, get a sense of the data – What is the type of data attributes,
nature, sample values etc
Is the data workable – are there duplicate values, missing values? If so clean the data
Once the data is cleansed, Try to fit a model between the independent variables(“means”) and
the dependent variable( the objective or the “end”)
2.
Validate the model, make it accurate so that the values predicted are close to actual data; Run
Diagnostics to validate that model is well-built.
You may use a smaller test data( subset of the actual data set) set to validate your model, before
applying the model on the actual dataset
Automobile industry in India is going through a down turn due to sluggish economy. Pricing a new car in
today’s scenario is one of the most challenging problems for any car manufacturer. So, the Car
manufacturer has collected some the data and wishes to use it to predict the price of new car model or
may be, perform a price correction of the existing models, which have lower sales, possibly due to
pricing issue.
Problem statement:
Given the available data, can predictive analytics be used to establish relationships between how the
various features of car impact the Car price and more importantly how strong is this relationship?
A snap shot of the Automobile price and various attributes data is shown below:
Data Dictionary of the Automobile data:
Price: Suggested retail price of the car
Mileage: Number of miles the car has been driven
Make: Manufacturer of the car such as Saturn, Pontiac, and Chevrolet
Model: Specific models for each car manufacturer such as Ion, Vibe, Cavalier
Trim (of car): Specific type of car model such as SE Sedan 4D, Quad Coupe 2D
Type: Body type such as sedan, coupe, etc.
Cylinder: Number of cylinders in the engine
3.
Liter: A more specific measure of engine size
Doors: Number of doors
Cruise: Indicator variable representing whether the car has cruise control (1 = cruise)
Sound: Indicator variable representing whether the car has upgraded speakers (1 = upgraded)
Leather: Indicator variable representing whether the car has leather seats (1 = leather)
I have used here open source platform – Language R to perform the analysis. I would be happy to share
the source code.(mail me @ shrivastav.gaurav@hotmail.com)
Data Exploration:
So the first step is to get a sense of how data looks. Here is snapshot of the data:
This is how data looks like. There are 804 rows of data with various columns like Price, Mileage, Make
and so on. Make has sample value like Buick and so on.
This does not gives a clear picture of how Mileage governs the Car Price, but a dense region in the graph
is indicative of the fact that several Cars have close Price~Mileage relationship. In fact the scatter in
4. some way suggests that apart from mileage, there are some other related variables/attributes ( like
number of Cylinders, Size of engine in Liters and so on), which also influence the price of the Car.
The above graph is indicative of increasing price with increase in number of Cylinders ( With 4 cylinders
the price range is from 1000 to 4000, with 6 cylinders the Price range of cars is from 2000 to 4500 and so
on).
This below graph, is again suggestive of increase in Price with increase in the Engine size (in Liter).
Now connecting the dots, I have just taken a few attributes from the data set to see their relationship
with Price and also how they are interrelated.
5. A Data Scientists needs to similar Data exploration to better understand the nuisance of the data set
before getting in to further analysis.
Here we added a few more dummy columns to the data set. These dummy columns are required as
there are a few attributes which are Boolean in nature or are discrete with numeric values ( 1,2,3..).
Using such attributes for further analysis, requires transforming them in to dummy variables to so that
0, 1, 2 can be converted in to discrete values, which if not done, will led to bias in the results.
Mathematically 1 is greater than 0, and 2 is greater than 1, but for our analysis case both 0 , 1 or 2 are
just discrete cases where “1” does not have more significance than”0”; they are just different values of
an attribute like Apples and oranges in case of fruits and not small Apple, Big Apple and even Bigger
Apple .
Data Cleansing:
Fortunately there were not any missing values in the data, otherwise the missing values have to be
plugged in one of the many methods. Easiest is to ignore the data tuples, which have missing values or
using average of the remaining values or some more scientific method based on the need.
If there are any duplicate records, even they have to be removed.
Fitting the Model:
Looking at the data, the Price and Mileage are numeric values of high order, where as all other
attributes like Model, trim, type etc also have discrete numeric values, but not of same order as Price
and Mileage. Hence instead of linear model, a Logarithmic model will be a better fit.
Here is the sample code in R Language to generate the model.
6. The result seems to show some correlation between the various attributes, which have some impact on
the Car price.
R-Squared value close to 1 is indicative of the accuracy of the model.
However at 95% significance level, we see only a few variable (attributes like Mileage, Cylinder and so
on) to be significant. So we have to revisit the model.
So the model can be described as:
Price = Function ( Mileage, Cylinder, Doors and some specific make like Buick, Type like Convertible,
Hatchback and so on)
Now the result indicates that we have a better model, with all the attributes having significant (even
greater than at 95% Significance Level) impact on the car Price.
7. Model Validation:
To test our model on how well it can predict the Price of the car, we plot the Actual price of the care
(available in the Data set) against the Predicted price (derived from our model).The Green line is the
Actual Price and the Red line is the Predicted Price.
The model seems to be doing good job in predicting the price!!
There are some other tests which are performed to check the soundness of the model. I am just putting
here just a few of them for the reader’s benefit ( to avoid headache;) )
8. As seen in graph below, the distribution above and below the line is quite the same, implying that the
model is free from Hetroscedasticity (Tongue twister indeed!!) è Error Term variance is constant.
The below graph checks for Normality, Independence, Linearity and Homoscedasticity [ again a tongue
twister
]
I do not intend to go in to depth of these graphs as is the subject is quite dry and honestly speaking, I
too am learning the tricks of this trade!!
Last but not the least, if these test run fine and give good result on the smaller test set, you may run this
on the much bigger actual data set to realize the outcome of your model.
Nevertheless, for those who have started to snooze, it’s time to say “Enjoy your Sleep”!