SlideShare a Scribd company logo
1 of 8
Download to read offline
Deriving insights from Data – The “R”ight way

Colin Shearer once remarked “Find me something interesting in my data is a question from hell”. A lot of
literature is being published today about Big Data and Predictive analytics. In this mad rush of finding
insights from data, many times organization forget the basic paradigm – “Analysis should be guided by
business goals”. Huh, enough of Gyan!! Let me walk the talk.
In this blog, I wish to demonstrate a simple exercise of deriving value from the data for an Automobile
business. The intended audience could be any one ,who brain cells starts to play soccer on hearing
words like “Big Data”, “R”,”Analytics”,”Insights”, in fact anything which makes sense out of data. Nerds,
Geeks and Dodo like me, all can benefit from this blog.
By the time you reach the end of blog and manage to be awake, here is what you will learn:


First thing first – Be clear of what do you want to dig from the Data mine?



Identify the attributes(independent), which are useful and how do they relate to the Objective
of your analysis



Once the above point is answered, get a sense of the data – What is the type of data attributes,
nature, sample values etc



Is the data workable – are there duplicate values, missing values? If so clean the data



Once the data is cleansed, Try to fit a model between the independent variables(“means”) and
the dependent variable( the objective or the “end”)


Validate the model, make it accurate so that the values predicted are close to actual data; Run
Diagnostics to validate that model is well-built.



You may use a smaller test data( subset of the actual data set) set to validate your model, before
applying the model on the actual dataset

Automobile industry in India is going through a down turn due to sluggish economy. Pricing a new car in
today’s scenario is one of the most challenging problems for any car manufacturer. So, the Car
manufacturer has collected some the data and wishes to use it to predict the price of new car model or
may be, perform a price correction of the existing models, which have lower sales, possibly due to
pricing issue.
Problem statement:
Given the available data, can predictive analytics be used to establish relationships between how the
various features of car impact the Car price and more importantly how strong is this relationship?
A snap shot of the Automobile price and various attributes data is shown below:

Data Dictionary of the Automobile data:


Price: Suggested retail price of the car



Mileage: Number of miles the car has been driven



Make: Manufacturer of the car such as Saturn, Pontiac, and Chevrolet



Model: Specific models for each car manufacturer such as Ion, Vibe, Cavalier



Trim (of car): Specific type of car model such as SE Sedan 4D, Quad Coupe 2D



Type: Body type such as sedan, coupe, etc.



Cylinder: Number of cylinders in the engine


Liter: A more specific measure of engine size



Doors: Number of doors



Cruise: Indicator variable representing whether the car has cruise control (1 = cruise)



Sound: Indicator variable representing whether the car has upgraded speakers (1 = upgraded)



Leather: Indicator variable representing whether the car has leather seats (1 = leather)

I have used here open source platform – Language R to perform the analysis. I would be happy to share
the source code.(mail me @ shrivastav.gaurav@hotmail.com)
Data Exploration:
So the first step is to get a sense of how data looks. Here is snapshot of the data:

This is how data looks like. There are 804 rows of data with various columns like Price, Mileage, Make
and so on. Make has sample value like Buick and so on.

This does not gives a clear picture of how Mileage governs the Car Price, but a dense region in the graph
is indicative of the fact that several Cars have close Price~Mileage relationship. In fact the scatter in
some way suggests that apart from mileage, there are some other related variables/attributes ( like
number of Cylinders, Size of engine in Liters and so on), which also influence the price of the Car.

The above graph is indicative of increasing price with increase in number of Cylinders ( With 4 cylinders
the price range is from 1000 to 4000, with 6 cylinders the Price range of cars is from 2000 to 4500 and so
on).
This below graph, is again suggestive of increase in Price with increase in the Engine size (in Liter).

Now connecting the dots, I have just taken a few attributes from the data set to see their relationship
with Price and also how they are interrelated.
A Data Scientists needs to similar Data exploration to better understand the nuisance of the data set
before getting in to further analysis.
Here we added a few more dummy columns to the data set. These dummy columns are required as
there are a few attributes which are Boolean in nature or are discrete with numeric values ( 1,2,3..).
Using such attributes for further analysis, requires transforming them in to dummy variables to so that
0, 1, 2 can be converted in to discrete values, which if not done, will led to bias in the results.
Mathematically 1 is greater than 0, and 2 is greater than 1, but for our analysis case both 0 , 1 or 2 are
just discrete cases where “1” does not have more significance than”0”; they are just different values of
an attribute like Apples and oranges in case of fruits and not small Apple, Big Apple and even Bigger
Apple .
Data Cleansing:
Fortunately there were not any missing values in the data, otherwise the missing values have to be
plugged in one of the many methods. Easiest is to ignore the data tuples, which have missing values or
using average of the remaining values or some more scientific method based on the need.
If there are any duplicate records, even they have to be removed.
Fitting the Model:
Looking at the data, the Price and Mileage are numeric values of high order, where as all other
attributes like Model, trim, type etc also have discrete numeric values, but not of same order as Price
and Mileage. Hence instead of linear model, a Logarithmic model will be a better fit.

Here is the sample code in R Language to generate the model.
The result seems to show some correlation between the various attributes, which have some impact on
the Car price.

R-Squared value close to 1 is indicative of the accuracy of the model.
However at 95% significance level, we see only a few variable (attributes like Mileage, Cylinder and so
on) to be significant. So we have to revisit the model.

So the model can be described as:
Price = Function ( Mileage, Cylinder, Doors and some specific make like Buick, Type like Convertible,
Hatchback and so on)
Now the result indicates that we have a better model, with all the attributes having significant (even
greater than at 95% Significance Level) impact on the car Price.
Model Validation:
To test our model on how well it can predict the Price of the car, we plot the Actual price of the care
(available in the Data set) against the Predicted price (derived from our model).The Green line is the
Actual Price and the Red line is the Predicted Price.

The model seems to be doing good job in predicting the price!!
There are some other tests which are performed to check the soundness of the model. I am just putting
here just a few of them for the reader’s benefit ( to avoid headache;) )
As seen in graph below, the distribution above and below the line is quite the same, implying that the
model is free from Hetroscedasticity (Tongue twister indeed!!) è Error Term variance is constant.

The below graph checks for Normality, Independence, Linearity and Homoscedasticity [ again a tongue
twister
]
I do not intend to go in to depth of these graphs as is the subject is quite dry and honestly speaking, I
too am learning the tricks of this trade!!
Last but not the least, if these test run fine and give good result on the smaller test set, you may run this
on the much bigger actual data set to realize the outcome of your model.
Nevertheless, for those who have started to snooze, it’s time to say “Enjoy your Sleep”!

More Related Content

Similar to Deriving insights from automobile data using R

Scenario You are the VP of Franchise services for the Happy Buns .docx
Scenario You are the VP of Franchise services for the Happy Buns .docxScenario You are the VP of Franchise services for the Happy Buns .docx
Scenario You are the VP of Franchise services for the Happy Buns .docxkaylee7wsfdubill
 
Predicting model for prices of used cars
Predicting model for prices of used carsPredicting model for prices of used cars
Predicting model for prices of used carsHARPREETSINGH1862
 
Innovation Intro to Conjoint
Innovation Intro to ConjointInnovation Intro to Conjoint
Innovation Intro to ConjointWahyuRafdinal1
 
CAR PRICE PREDICTION.pptx
CAR PRICE PREDICTION.pptxCAR PRICE PREDICTION.pptx
CAR PRICE PREDICTION.pptxNAVINCHACKO1
 
1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD.docx
1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD.docx1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD.docx
1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD.docxhoney725342
 
Car pricing prediction ppt.pptx ppt car study
Car pricing prediction ppt.pptx ppt car studyCar pricing prediction ppt.pptx ppt car study
Car pricing prediction ppt.pptx ppt car studyfugifad
 
Week 06Conjoint Analysishttpswww.smh.com.au.docx
Week 06Conjoint Analysishttpswww.smh.com.au.docxWeek 06Conjoint Analysishttpswww.smh.com.au.docx
Week 06Conjoint Analysishttpswww.smh.com.au.docxjessiehampson
 
DMSS: SEO Insights, Analysis & Reporting: Visualizing Your SEO Data
DMSS: SEO Insights, Analysis & Reporting: Visualizing Your SEO DataDMSS: SEO Insights, Analysis & Reporting: Visualizing Your SEO Data
DMSS: SEO Insights, Analysis & Reporting: Visualizing Your SEO DataSam Partland
 
Data Visualization for Automobile Dataset
Data Visualization for Automobile DatasetData Visualization for Automobile Dataset
Data Visualization for Automobile DatasetSantoshKumaravelSund
 
Chapter 2 Graphical Descriptions of Data 25 Chapter 2.docx
Chapter 2 Graphical Descriptions of Data 25 Chapter 2.docxChapter 2 Graphical Descriptions of Data 25 Chapter 2.docx
Chapter 2 Graphical Descriptions of Data 25 Chapter 2.docxcravennichole326
 
BIG MART SALES PRIDICTION PROJECT.pptx
BIG MART SALES PRIDICTION PROJECT.pptxBIG MART SALES PRIDICTION PROJECT.pptx
BIG MART SALES PRIDICTION PROJECT.pptxLSURYAPRAKASHREDDY
 
Laptop Price Prediction system
Laptop Price Prediction systemLaptop Price Prediction system
Laptop Price Prediction systemMDRIAZHASAN
 
Uncovering 'not provided' keyword data
Uncovering 'not provided' keyword data Uncovering 'not provided' keyword data
Uncovering 'not provided' keyword data Clayton Wood
 
IRJET- Automobile Resale System using Machine Learning
IRJET- Automobile Resale System using Machine LearningIRJET- Automobile Resale System using Machine Learning
IRJET- Automobile Resale System using Machine LearningIRJET Journal
 
Pres. Gertjan Kaart Credit Alliance Jan 2011
Pres. Gertjan Kaart Credit Alliance Jan 2011Pres. Gertjan Kaart Credit Alliance Jan 2011
Pres. Gertjan Kaart Credit Alliance Jan 2011gertjankaart
 
ByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docx
ByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docxByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docx
ByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docxclairbycraft
 
ByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docx
ByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docxByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docx
ByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docxDaliaCulbertson719
 

Similar to Deriving insights from automobile data using R (20)

Scenario You are the VP of Franchise services for the Happy Buns .docx
Scenario You are the VP of Franchise services for the Happy Buns .docxScenario You are the VP of Franchise services for the Happy Buns .docx
Scenario You are the VP of Franchise services for the Happy Buns .docx
 
Predicting model for prices of used cars
Predicting model for prices of used carsPredicting model for prices of used cars
Predicting model for prices of used cars
 
Innovation Intro to Conjoint
Innovation Intro to ConjointInnovation Intro to Conjoint
Innovation Intro to Conjoint
 
CAR PRICE PREDICTION.pptx
CAR PRICE PREDICTION.pptxCAR PRICE PREDICTION.pptx
CAR PRICE PREDICTION.pptx
 
1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD.docx
1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD.docx1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD.docx
1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD.docx
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Car pricing prediction ppt.pptx ppt car study
Car pricing prediction ppt.pptx ppt car studyCar pricing prediction ppt.pptx ppt car study
Car pricing prediction ppt.pptx ppt car study
 
Week 06Conjoint Analysishttpswww.smh.com.au.docx
Week 06Conjoint Analysishttpswww.smh.com.au.docxWeek 06Conjoint Analysishttpswww.smh.com.au.docx
Week 06Conjoint Analysishttpswww.smh.com.au.docx
 
DMSS: SEO Insights, Analysis & Reporting: Visualizing Your SEO Data
DMSS: SEO Insights, Analysis & Reporting: Visualizing Your SEO DataDMSS: SEO Insights, Analysis & Reporting: Visualizing Your SEO Data
DMSS: SEO Insights, Analysis & Reporting: Visualizing Your SEO Data
 
Data Visualization for Automobile Dataset
Data Visualization for Automobile DatasetData Visualization for Automobile Dataset
Data Visualization for Automobile Dataset
 
Chapter 2 Graphical Descriptions of Data 25 Chapter 2.docx
Chapter 2 Graphical Descriptions of Data 25 Chapter 2.docxChapter 2 Graphical Descriptions of Data 25 Chapter 2.docx
Chapter 2 Graphical Descriptions of Data 25 Chapter 2.docx
 
BIG MART SALES.pptx
BIG MART SALES.pptxBIG MART SALES.pptx
BIG MART SALES.pptx
 
BIG MART SALES PRIDICTION PROJECT.pptx
BIG MART SALES PRIDICTION PROJECT.pptxBIG MART SALES PRIDICTION PROJECT.pptx
BIG MART SALES PRIDICTION PROJECT.pptx
 
Laptop Price Prediction system
Laptop Price Prediction systemLaptop Price Prediction system
Laptop Price Prediction system
 
Uncovering 'not provided' keyword data
Uncovering 'not provided' keyword data Uncovering 'not provided' keyword data
Uncovering 'not provided' keyword data
 
IRJET- Automobile Resale System using Machine Learning
IRJET- Automobile Resale System using Machine LearningIRJET- Automobile Resale System using Machine Learning
IRJET- Automobile Resale System using Machine Learning
 
Pres. Gertjan Kaart Credit Alliance Jan 2011
Pres. Gertjan Kaart Credit Alliance Jan 2011Pres. Gertjan Kaart Credit Alliance Jan 2011
Pres. Gertjan Kaart Credit Alliance Jan 2011
 
Weka_10BM60025_VGSOM
Weka_10BM60025_VGSOMWeka_10BM60025_VGSOM
Weka_10BM60025_VGSOM
 
ByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docx
ByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docxByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docx
ByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docx
 
ByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docx
ByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docxByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docx
ByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docx
 

Recently uploaded

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 

Recently uploaded (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 

Deriving insights from automobile data using R

  • 1. Deriving insights from Data – The “R”ight way Colin Shearer once remarked “Find me something interesting in my data is a question from hell”. A lot of literature is being published today about Big Data and Predictive analytics. In this mad rush of finding insights from data, many times organization forget the basic paradigm – “Analysis should be guided by business goals”. Huh, enough of Gyan!! Let me walk the talk. In this blog, I wish to demonstrate a simple exercise of deriving value from the data for an Automobile business. The intended audience could be any one ,who brain cells starts to play soccer on hearing words like “Big Data”, “R”,”Analytics”,”Insights”, in fact anything which makes sense out of data. Nerds, Geeks and Dodo like me, all can benefit from this blog. By the time you reach the end of blog and manage to be awake, here is what you will learn:  First thing first – Be clear of what do you want to dig from the Data mine?  Identify the attributes(independent), which are useful and how do they relate to the Objective of your analysis  Once the above point is answered, get a sense of the data – What is the type of data attributes, nature, sample values etc  Is the data workable – are there duplicate values, missing values? If so clean the data  Once the data is cleansed, Try to fit a model between the independent variables(“means”) and the dependent variable( the objective or the “end”)
  • 2.  Validate the model, make it accurate so that the values predicted are close to actual data; Run Diagnostics to validate that model is well-built.  You may use a smaller test data( subset of the actual data set) set to validate your model, before applying the model on the actual dataset Automobile industry in India is going through a down turn due to sluggish economy. Pricing a new car in today’s scenario is one of the most challenging problems for any car manufacturer. So, the Car manufacturer has collected some the data and wishes to use it to predict the price of new car model or may be, perform a price correction of the existing models, which have lower sales, possibly due to pricing issue. Problem statement: Given the available data, can predictive analytics be used to establish relationships between how the various features of car impact the Car price and more importantly how strong is this relationship? A snap shot of the Automobile price and various attributes data is shown below: Data Dictionary of the Automobile data:  Price: Suggested retail price of the car  Mileage: Number of miles the car has been driven  Make: Manufacturer of the car such as Saturn, Pontiac, and Chevrolet  Model: Specific models for each car manufacturer such as Ion, Vibe, Cavalier  Trim (of car): Specific type of car model such as SE Sedan 4D, Quad Coupe 2D  Type: Body type such as sedan, coupe, etc.  Cylinder: Number of cylinders in the engine
  • 3.  Liter: A more specific measure of engine size  Doors: Number of doors  Cruise: Indicator variable representing whether the car has cruise control (1 = cruise)  Sound: Indicator variable representing whether the car has upgraded speakers (1 = upgraded)  Leather: Indicator variable representing whether the car has leather seats (1 = leather) I have used here open source platform – Language R to perform the analysis. I would be happy to share the source code.(mail me @ shrivastav.gaurav@hotmail.com) Data Exploration: So the first step is to get a sense of how data looks. Here is snapshot of the data: This is how data looks like. There are 804 rows of data with various columns like Price, Mileage, Make and so on. Make has sample value like Buick and so on. This does not gives a clear picture of how Mileage governs the Car Price, but a dense region in the graph is indicative of the fact that several Cars have close Price~Mileage relationship. In fact the scatter in
  • 4. some way suggests that apart from mileage, there are some other related variables/attributes ( like number of Cylinders, Size of engine in Liters and so on), which also influence the price of the Car. The above graph is indicative of increasing price with increase in number of Cylinders ( With 4 cylinders the price range is from 1000 to 4000, with 6 cylinders the Price range of cars is from 2000 to 4500 and so on). This below graph, is again suggestive of increase in Price with increase in the Engine size (in Liter). Now connecting the dots, I have just taken a few attributes from the data set to see their relationship with Price and also how they are interrelated.
  • 5. A Data Scientists needs to similar Data exploration to better understand the nuisance of the data set before getting in to further analysis. Here we added a few more dummy columns to the data set. These dummy columns are required as there are a few attributes which are Boolean in nature or are discrete with numeric values ( 1,2,3..). Using such attributes for further analysis, requires transforming them in to dummy variables to so that 0, 1, 2 can be converted in to discrete values, which if not done, will led to bias in the results. Mathematically 1 is greater than 0, and 2 is greater than 1, but for our analysis case both 0 , 1 or 2 are just discrete cases where “1” does not have more significance than”0”; they are just different values of an attribute like Apples and oranges in case of fruits and not small Apple, Big Apple and even Bigger Apple . Data Cleansing: Fortunately there were not any missing values in the data, otherwise the missing values have to be plugged in one of the many methods. Easiest is to ignore the data tuples, which have missing values or using average of the remaining values or some more scientific method based on the need. If there are any duplicate records, even they have to be removed. Fitting the Model: Looking at the data, the Price and Mileage are numeric values of high order, where as all other attributes like Model, trim, type etc also have discrete numeric values, but not of same order as Price and Mileage. Hence instead of linear model, a Logarithmic model will be a better fit. Here is the sample code in R Language to generate the model.
  • 6. The result seems to show some correlation between the various attributes, which have some impact on the Car price. R-Squared value close to 1 is indicative of the accuracy of the model. However at 95% significance level, we see only a few variable (attributes like Mileage, Cylinder and so on) to be significant. So we have to revisit the model. So the model can be described as: Price = Function ( Mileage, Cylinder, Doors and some specific make like Buick, Type like Convertible, Hatchback and so on) Now the result indicates that we have a better model, with all the attributes having significant (even greater than at 95% Significance Level) impact on the car Price.
  • 7. Model Validation: To test our model on how well it can predict the Price of the car, we plot the Actual price of the care (available in the Data set) against the Predicted price (derived from our model).The Green line is the Actual Price and the Red line is the Predicted Price. The model seems to be doing good job in predicting the price!! There are some other tests which are performed to check the soundness of the model. I am just putting here just a few of them for the reader’s benefit ( to avoid headache;) )
  • 8. As seen in graph below, the distribution above and below the line is quite the same, implying that the model is free from Hetroscedasticity (Tongue twister indeed!!) è Error Term variance is constant. The below graph checks for Normality, Independence, Linearity and Homoscedasticity [ again a tongue twister ] I do not intend to go in to depth of these graphs as is the subject is quite dry and honestly speaking, I too am learning the tricks of this trade!! Last but not the least, if these test run fine and give good result on the smaller test set, you may run this on the much bigger actual data set to realize the outcome of your model. Nevertheless, for those who have started to snooze, it’s time to say “Enjoy your Sleep”!