1. Data Preparation:
*First, we are clearing the Typos for Object datatypes, Typos means mistakes in the text.
* Secondly, we are clearing whitespaces (extra spaces) for Object datatypes by using str.strip() function.
* Thirdly, removing sanity checks for impossible values (exceeding range) for int and float datatypes.
* Lastly, removing missing values (NaN) – performed for all data types by using isna() function.
2. Data Exploration - by using graphs
Practical Data Science : Data Cleaning and Summarising
1. Data Preparation:
First import the pandas file in Jupyter Notebook by using the following command
“import pandas as pd”. Now, load the CSV data from the Automobile file (Automobile.csv) and assign it to some
variable called Automobile_p, by using this command “ Automobile_p = 'Automobile.csv' “. Then, read the data
from CSV file with attributes names by using the below command.
Automobile_h = pd.read_csv(Automobile_p, sep='#', decimal='.', header=None, names=['symboling', 'normalized-
losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style', 'drive-wheels', 'engine-location', 'wheel-base',
'length', 'width', 'height', 'curb-weight', 'engine-type', 'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price'])
to get the data present in the symbolling column by using Automobile_h['attribute name']. For example,
Automobile_h['symboling'] – is used to get the data present in the symboling column.
value_counts() is used to get the number of counts. For example, Automobile_h['symboling'].value_counts() – this
function is used to to get the number of counts in symbolling column.
In our task,
• First, we are clearing the Typos for Object datatypes, Typos means mistakes in the text. To do this, we use
replace() function . For example, I found “vol00112ov” -this typo in Make attribute so, I’m using
Automobile_h['make'].replace(' vol00112ov','volvo',inplace = True) -this command. Later, we use
Automobile_h['make'].value_counts() -to get the number of count in make column.(referred in RMIT
Tute/Lab 2)
• Secondly, we are clearing whitespaces (extra spaces) for Object datatypes by using str.strip() function. For
example, I found ‘volvo ‘ whitespace in make column in automobile.csv file. So, I used this command,
Automobile_h['make'] = Automobile_h['make'].str.strip() and perform the
Automobile_h['make'].value_counts() command -to get the new counts after removing whitespaces.
(referred in RMIT Tute/Lab 2)
• Thirdly, removing sanity checks for impossible values (exceeding range) for int and float datatypes. Here,
first we have to check the range as mentioned in the file . For example, “ print
Automobile_h['symboling'][(-3 > Automobile_h['symboling']) | (Automobile_h['symboling'] > 3)] , to
check the range of Symboling attribute between -3 and +3. If it is between the range as mentioned , leave it
– else replace it with the value present in that range by using replace() function and get the counts by
value_counts(). (referred in stackoverflow.com )
• Lastly, removing missssing values (NaN) – performed for all data types by using isna() function. For
example, I found missing values in num_of_doors attribute so, I used Automobile_h['num-of-
doors'].isna().value_counts() this command (referred in stackoverflow.com ) and I removed these missing
values by fillna() like Automobile_h['num-of-doors'] = Automobile_h['num-of-doors'].fillna('four') – here
replaced NaN with four and use value_counts() to get the new number of counts in num-of-doors column.
Else, replace NaN value with Mean value for Int and Float datatypes–
Automobile_h['price'].fillna(Automobile_h['price'].mean(axis=0), inplace=True). (referred in RMIT
Tute/Lab 2)
After doing the data preparation,use unique () function - Automobile_h['make'].unique(), to get the unique
values in the make column and we can verify data preaparation is done or not in Make.
Task 2 : Data Exploration
Step 1 :
2. • choosing drive-wheels as Nominal values which contains 4wd, fwd, rwd
The above Pie Chart shows the drive-wheels with the values 4wd, fwd and rwd. The largest part of this pie
is “fwd” contains 50.42 drive-wheels and second biggest slice is “rwd” with 45.80 wheels. Where as the
smallest part of this pie is “4wd” with 3.78 wheels. From the diagram, we can say that fwd drive-wheels are
more economical. (pie chart syntax referred in RMIT Tute/Lab 3)
• Choosing Symboling column with ordinal Values ranging from -3 to +3
The above Pie Chart shows the Symboling with the values -3 to +3. Symbolling value -3 shows that the
car is pretty safe and +3 indicates, car is at risky. The largest part of this pie is 0 with 28.15 times, second
3. biggest slice is 1 with 22.39, -1 with 21.85, 2 with 14.71, 3 contains 11.24 which shows that car is at risky.
Where as the smallest part of this pie is -2 with 1.26 which is safe.
• Choosing Stroke column with numerical values ranging from 2.07 to 4.17
The above diagram is histogram which represents frequency distribution with bins=20 along y-axis and
stroke values from 2.07 to 4.17 along X-axis. Here, for stroke value 2.07 which shows negligible amount
of frequency roughly 1 and for the stroke value 3.4 , it is showing maximum frequency like 54. But, we can
see many ups and downs (fluctuations) from starting to ending of this graph from 2.07 to 4.17 stroke.
Higher the stroke value, car is more expensive and repairs cost is more. (syntax for histogram referred in
RMIT Tute/Lab 3)
Task 2: Data Exploration
Step2
• scatter plot for wheel-base and length
4. The above scatter plot shows wheel- base on X-axis and length in Y-axis in dots. Here, we have to import the
following package- import matplotlib.pyplot as plt for scatter plot. As wheel-base increases to 120.9, length also
increases to 208.1 which is shown in graph and similarly, when wheel-base decreases length also decreases.
They both are directly proportional. (syntax referred in RMIT Tute/Lab 3)
• Box plot for price and fuel-type
5. The above Box Plot shows fuel-type on X-axis, outliers from 31000 to 45000 and price on Y-axis. Here, we
have to import the following package- import matplotlib.pyplot as plt for box plot. Price of the car is less
for Gas- fuel type and price of the car is high for Diesel- fuel type . here, they are indirectly propotional. As
the price of the car increases, mileage decreases. For diesel, Minimum value is 8000, maximum value is
33000 and median is 18000. Similarly, for Gas minimum value is 5000, median is 12000 and max value is
31000. (boxplot syntax referred in RMIT Tute/Lab 3)
• Scatter plot for engine-size and curb-weight
The above scatter plot shows engine-size on X-axis and curb-weight in Y-axis. Here, we have to import the
following package- import matplotlib.pyplot as plt for scatter plot. As engine-size increases to 326, curb-
weight also increases to 4066 which is shown in graph and similarly, when engine-size decreases curb-
weight also decreases. Here they both are directly proportional. (referred in RMIT Tute/Lab 3)
Task 2: Data Exploration
Step 3 :
• Scatter matrix for all numerical columns
6. The above scatter matrix shows the collection of Scatterplots for all numeric columns of automobile CSV file ,which
are organized into a grid and each scatterplot shows the relationship between them. Here, we have to import the
following package - from pandas.tools.plotting import scatter_matrix for scatter matrix. (referred in RMIT
Tute/Lab 3)