Data Visualization for Automobile Dataset

1
Santosh Kumaravel Sundaravadivelu S3729461
Practical Data Science
Assignment – 1
Task 1: Data Preparation:
Step 1.1 - ImportingLibraries and data
Libraries such as Pandas, Numpy, Matplotlib are imported for ease use of data structures and data
analysis in python. Data set is importedinto the jupyter notebook in order to analyse and give good
insightsintothe data. It isimportedwithfunction“read_csv” witha separatorof “#” and givingthe
corresponding column names. There are 238 observations and 26 variables. The imported data is
checked using the head function order to check the source and imported dataset
my_data_automobile are the same.
Step 1.2 – Removing White Space
White spaces are present in my_data_automobile, the remove_whitespace function will check the
corresponding whitespaces in all columns and replaces it using strip function. Changing all the data
intolowercases sothat all the data will be meaningful. i.e,Inthe “fuel-type”column there are values
like gas, Gas, diesel, Diesel, it can be reduced with converting everything into lower cases.
Step 1.3 – Typo Errors
While analyzing the column “symboling”, there is a pattern which is not in the range of -3 to 3, we
consider this as a typo (will be explained in step 1.4). There are some more typos in the column of
make,aspiration,num-of-doorsandhandledbystr.replace function withthe correspondingspelling.
Step 1.4 – Sanity checks for Impossible values
Impossible values/Outliers in thisdataset are considered as the unexpected values which are not in
betweenthe range whichisgiven.Whileanalyzingthe dataset“my_data_automobile”the symboling
columnhas the value whichis not between -3to 3 and we take the replacingthe value withnearest
one, it takes the value of 3 and considers it as a typo, because symboling place a vital role in the
relationship with other data.
Step 1.5 – CheckingMissingValues
Missing Values are checked using IsNull().sum(), In price column there are values with 0. While
checkingthe datasetonly Volvohassome valuesof 0.So we are replacingthe 0 of Volvoat price with
mean values. The reason for choosing Volvo because the price of BMW and Holden cannot be the
same consideringthe brandname. If the missingvalue isfoundon a differentmake of the car, taking
the mean of the column will be considered. Mean is calculated for the entire dataset
“my_data_automobile”byusingfill na() functionwithaxis =0.While checkingthe datasetatthisstate,
there is actuallymissingvaluesin“no-of-doors”columnsowe are removingitusing dropnafunction
with a how=all attribute.

2
Task 2: Data Exploration:
Step 2.1.1 – Safety First(Consideringone Ordinal Value)
Safety should be given as high priority while the car is purchased by the end consumer. The main
reason for selecting histogram for these ordinal values, it gives a clear picture of how the frequency
for differentsafetylevels. There isnovalue in -3whichimpliesitismuch harder to make one perfect
car. There are few cars in safety level -2 which can be considered since there are no perfect cars
accordingto the dataset. The frequencyismore insafetylevel 0whichimplies,the manufacturersare
not compromising on safety by having a value greater than 0. There are few cars with level 3 which
suggest there are no safety what so ever.
Step 2.1.2 – Type Matters (Considering one Nominal value)
Body shape plays a vital role in selecting the type of car for a specific audience. i.e, Sedan(52.97%)
type of carswill be suitedforafamilyof 4.The hatchback(29.66%) issuitableforafamilyforextraboot
space. The wagon(11.44%) isfor more than 4 people. The hardtop(3.39%) isforpeople withcamping
nature. Convertible(2.54%) is for people who are very fond of the sun. There are a wide variety of
audience asmentionedabove forwhich manufacturersare strivingto satisfythem. Accordingto the
datasetgiven,Sedanismanufacturedmore innumbers comparedtoothers.The reasonforchoosing
a pie chart because it is good for comparing the percentage of body types.

3
Step 2.1.3 – Mileage inCity (Consideringone Numericvalue)
Mileage is inversely proportional to the maintenance of the car, excluding for newer cars. There is
some thresholdwhere maximummileage isachievedforcertaintypesof cars.i.e, Car withmaximum
performance will not able to get the mileage expected because compromise is made on the basis of
the preferenceof individuals. Basedonthe datasetthereismaximummileageincityis49likewisethe
minimum is 13. The average mileage is around 24.6. According to figure 75% fall on 28.7.
Step 2.2.1: Horsepower- highway mileage (Hypothesis1)
The mainaimof thishypothesisisto analysethe relationshipbetweenthe twocolumnsinthe dataset.
Comparinghorsepowerwith highwaymileage,whenithashighhorsepowerit resultsin low highway-
mileage. The observation is based on the pattern is generated in the scatter plot. The reason for
choosingto visualize inscatterplotbecause of the pattern formed. Infig 2.2.1 the patternis formed
from high horsepower low mileage and vice versa. Mostly horsepower is preferred by high-cost car.
So for achieving best highway mileage, Cars with lower horsepower is considered.

4
Step 2.2.2 – Price Wars - (Hypothesis2)
The Hypothesisistodeterminethe Make of the cardeterminesthe price of the car. The manufactures
witha longhistoryand brandname will be the leaderstodecide the marketof the cars. Accordingto
the dataset, Mercedes Benz is pricing more compared to other brands, the second highest cost is
BMW, there isadifferentsetof audience forthe rangeof cars,whichcandeterminethe manufactures
to produce them. The budgetcars, there foraspecificsetof audience.The price of the carstartsfrom
5118 to45400. The plausiblehypothesisisconcludedbasedonthe function,the priceincreasesbased
on the Make of the car.
Step 2.2.3 – Mileage and Cylinders (Hypothesis3)
The Hypothesisistodeterminethe leastnumberof cylinders insidethe engineisareasonfor getting
good city mileage If there are the least number of cylindersit will be easy for the engine to process
the petrol or gas and whichin turn will be useful forgettingbettermileage. The Scatterplot istaken
because of the comparison will be clear and viewable.The plausible hypothesis can be taken into
considerationbasedonthe dataset for four cylindersthe mileage ismore and for twelve cylindersit
is very low. The pattern can be concluded stating that a lower number of cylinders higher the city
mileage and vice versa.

5
Step 2.3 Scatter Matrix :
Scatter Matrix is plottedbasedon groupingall the numericvaluesandplottedwiththe figure size of
18. The observationcanbe made basedon the valueswhichare producedin a diagonal fashion. The
diagonal of the matrix contains the bar chart of all the 15 numeric values and scatter plot is plotted
for all the 15 numerical values.

Data Visualization for Automobile Dataset

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Visualization for Automobile Dataset

Similar to Data Visualization for Automobile Dataset (20)

Recently uploaded

Recently uploaded (20)

Data Visualization for Automobile Dataset