SlideShare a Scribd company logo
1 of 6
Download to read offline
Data Preparation:
First import the pandas file in Jupyter Notebook by using the following command
“import pandas as pd”. Now, load the CSV data from the Automobile file (Automobile.csv) and assign it to some
variable called Automobile_p, by using this command “ Automobile_p = 'Automobile.csv' “. Then, read the data
from CSV file with attributes names by using the below command.
Automobile_h = pd.read_csv(Automobile_p, sep='#', decimal='.', header=None, names=['symboling', 'normalized-
losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style', 'drive-wheels', 'engine-location', 'wheel-base',
'length', 'width', 'height', 'curb-weight', 'engine-type', 'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price'])
to get the data present in the symbolling column by using Automobile_h['attribute name']. For example,
Automobile_h['symboling'] – is used to get the data present in the symboling column.
value_counts() is used to get the number of counts. For example, Automobile_h['symboling'].value_counts() – this
function is used to to get the number of counts in symbolling column.
In our task,
• First, we are clearing the Typos for Object datatypes, Typos means mistakes in the text. To do this, we use
replace() function . For example, I found “vol00112ov” -this typo in Make attribute so, I’m using
Automobile_h['make'].replace(' vol00112ov','volvo',inplace = True) -this command. Later, we use
Automobile_h['make'].value_counts() -to get the number of count in make column.(referred in RMIT
Tute/Lab 2)
• Secondly, we are clearing whitespaces (extra spaces) for Object datatypes by using str.strip() function. For
example, I found ‘volvo ‘ whitespace in make column in automobile.csv file. So, I used this command,
Automobile_h['make'] = Automobile_h['make'].str.strip() and perform the
Automobile_h['make'].value_counts() command -to get the new counts after removing whitespaces.
(referred in RMIT Tute/Lab 2)
• Thirdly, removing sanity checks for impossible values (exceeding range) for int and float datatypes. Here,
first we have to check the range as mentioned in the file . For example, “ print
Automobile_h['symboling'][(-3 > Automobile_h['symboling']) | (Automobile_h['symboling'] > 3)] , to
check the range of Symboling attribute between -3 and +3. If it is between the range as mentioned , leave it
– else replace it with the value present in that range by using replace() function and get the counts by
value_counts(). (referred in stackoverflow.com )
• Lastly, removing missssing values (NaN) – performed for all data types by using isna() function. For
example, I found missing values in num_of_doors attribute so, I used Automobile_h['num-of-
doors'].isna().value_counts() this command (referred in stackoverflow.com ) and I removed these missing
values by fillna() like Automobile_h['num-of-doors'] = Automobile_h['num-of-doors'].fillna('four') – here
replaced NaN with four and use value_counts() to get the new number of counts in num-of-doors column.
Else, replace NaN value with Mean value for Int and Float datatypes–
Automobile_h['price'].fillna(Automobile_h['price'].mean(axis=0), inplace=True). (referred in RMIT
Tute/Lab 2)
After doing the data preparation,use unique () function - Automobile_h['make'].unique(), to get the unique
values in the make column and we can verify data preaparation is done or not in Make.
Task 2 : Data Exploration
Step 1 :
• choosing drive-wheels as Nominal values which contains 4wd, fwd, rwd
The above Pie Chart shows the drive-wheels with the values 4wd, fwd and rwd. The largest part of this pie
is “fwd” contains 50.42 drive-wheels and second biggest slice is “rwd” with 45.80 wheels. Where as the
smallest part of this pie is “4wd” with 3.78 wheels. From the diagram, we can say that fwd drive-wheels are
more economical. (pie chart syntax referred in RMIT Tute/Lab 3)
• Choosing Symboling column with ordinal Values ranging from -3 to +3
The above Pie Chart shows the Symboling with the values -3 to +3. Symbolling value -3 shows that the
car is pretty safe and +3 indicates, car is at risky. The largest part of this pie is 0 with 28.15 times, second
biggest slice is 1 with 22.39, -1 with 21.85, 2 with 14.71, 3 contains 11.24 which shows that car is at risky.
Where as the smallest part of this pie is -2 with 1.26 which is safe.
• Choosing Stroke column with numerical values ranging from 2.07 to 4.17
The above diagram is histogram which represents frequency distribution with bins=20 along y-axis and
stroke values from 2.07 to 4.17 along X-axis. Here, for stroke value 2.07 which shows negligible amount
of frequency roughly 1 and for the stroke value 3.4 , it is showing maximum frequency like 54. But, we can
see many ups and downs (fluctuations) from starting to ending of this graph from 2.07 to 4.17 stroke.
Higher the stroke value, car is more expensive and repairs cost is more. (syntax for histogram referred in
RMIT Tute/Lab 3)
Task 2: Data Exploration
Step2
• scatter plot for wheel-base and length
The above scatter plot shows wheel- base on X-axis and length in Y-axis in dots. Here, we have to import the
following package- import matplotlib.pyplot as plt for scatter plot. As wheel-base increases to 120.9, length also
increases to 208.1 which is shown in graph and similarly, when wheel-base decreases length also decreases.
They both are directly proportional. (syntax referred in RMIT Tute/Lab 3)
• Box plot for price and fuel-type
The above Box Plot shows fuel-type on X-axis, outliers from 31000 to 45000 and price on Y-axis. Here, we
have to import the following package- import matplotlib.pyplot as plt for box plot. Price of the car is less
for Gas- fuel type and price of the car is high for Diesel- fuel type . here, they are indirectly propotional. As
the price of the car increases, mileage decreases. For diesel, Minimum value is 8000, maximum value is
33000 and median is 18000. Similarly, for Gas minimum value is 5000, median is 12000 and max value is
31000. (boxplot syntax referred in RMIT Tute/Lab 3)
• Scatter plot for engine-size and curb-weight
The above scatter plot shows engine-size on X-axis and curb-weight in Y-axis. Here, we have to import the
following package- import matplotlib.pyplot as plt for scatter plot. As engine-size increases to 326, curb-
weight also increases to 4066 which is shown in graph and similarly, when engine-size decreases curb-
weight also decreases. Here they both are directly proportional. (referred in RMIT Tute/Lab 3)
Task 2: Data Exploration
Step 3 :
• Scatter matrix for all numerical columns
The above scatter matrix shows the collection of Scatterplots for all numeric columns of automobile CSV file ,which
are organized into a grid and each scatterplot shows the relationship between them. Here, we have to import the
following package - from pandas.tools.plotting import scatter_matrix for scatter matrix. (referred in RMIT
Tute/Lab 3)

More Related Content

What's hot

Transportation Problem in Operational Research
Transportation Problem in Operational ResearchTransportation Problem in Operational Research
Transportation Problem in Operational Research
Neha Sharma
 
Gaussian_GaussJordan
Gaussian_GaussJordanGaussian_GaussJordan
Gaussian_GaussJordan
James Little
 

What's hot (15)

Transportation problem
Transportation problemTransportation problem
Transportation problem
 
Vogel’s Approximation Method (VAM)
Vogel’s Approximation Method (VAM)Vogel’s Approximation Method (VAM)
Vogel’s Approximation Method (VAM)
 
Application of Business Mathematics in real life (PPT)
Application of Business Mathematics in real life (PPT)Application of Business Mathematics in real life (PPT)
Application of Business Mathematics in real life (PPT)
 
Transportation problem_Operation research
Transportation problem_Operation researchTransportation problem_Operation research
Transportation problem_Operation research
 
Transportation Problem in Operational Research
Transportation Problem in Operational ResearchTransportation Problem in Operational Research
Transportation Problem in Operational Research
 
Rubyコードの最適化
Rubyコードの最適化Rubyコードの最適化
Rubyコードの最適化
 
Vam ppt
Vam pptVam ppt
Vam ppt
 
Gaussian_GaussJordan
Gaussian_GaussJordanGaussian_GaussJordan
Gaussian_GaussJordan
 
technical analysis
technical analysistechnical analysis
technical analysis
 
Transportation Model
Transportation ModelTransportation Model
Transportation Model
 
Operation Research
Operation ResearchOperation Research
Operation Research
 
AS level Application of differential calculus in different fields
AS level Application of differential calculus in different fieldsAS level Application of differential calculus in different fields
AS level Application of differential calculus in different fields
 
Transportation lecture
Transportation lectureTransportation lecture
Transportation lecture
 
AP Calculus AB April 14, 2009
AP Calculus AB April 14, 2009AP Calculus AB April 14, 2009
AP Calculus AB April 14, 2009
 
Regression planar
Regression planarRegression planar
Regression planar
 

Similar to Practical Data Science : Data Cleaning and Summarising

Exercise Problems for Chapter 5Numerical example on page 203Pe.docx
Exercise Problems for Chapter 5Numerical example on page 203Pe.docxExercise Problems for Chapter 5Numerical example on page 203Pe.docx
Exercise Problems for Chapter 5Numerical example on page 203Pe.docx
gitagrimston
 
Flink Forward Berlin 2018: Dawid Wysakowicz - "Detecting Patterns in Event St...
Flink Forward Berlin 2018: Dawid Wysakowicz - "Detecting Patterns in Event St...Flink Forward Berlin 2018: Dawid Wysakowicz - "Detecting Patterns in Event St...
Flink Forward Berlin 2018: Dawid Wysakowicz - "Detecting Patterns in Event St...
Flink Forward
 
Oracle sql functions
Oracle sql functionsOracle sql functions
Oracle sql functions
Vivek Singh
 
Intro to tsql unit 10
Intro to tsql   unit 10Intro to tsql   unit 10
Intro to tsql unit 10
Syed Asrarali
 
A feasible solution algorithm for a primitive vehicle routing problem
A feasible solution algorithm for a primitive vehicle routing problemA feasible solution algorithm for a primitive vehicle routing problem
A feasible solution algorithm for a primitive vehicle routing problem
Cem Recai Çırak
 

Similar to Practical Data Science : Data Cleaning and Summarising (20)

Data Management using tableau
Data Management using tableauData Management using tableau
Data Management using tableau
 
Exercise Problems for Chapter 5Numerical example on page 203Pe.docx
Exercise Problems for Chapter 5Numerical example on page 203Pe.docxExercise Problems for Chapter 5Numerical example on page 203Pe.docx
Exercise Problems for Chapter 5Numerical example on page 203Pe.docx
 
Computation Assignment Help
Computation Assignment Help Computation Assignment Help
Computation Assignment Help
 
Webinar: Detecting row patterns with Flink SQL - Dawid Wysakowicz
Webinar:  Detecting row patterns with Flink SQL - Dawid WysakowiczWebinar:  Detecting row patterns with Flink SQL - Dawid Wysakowicz
Webinar: Detecting row patterns with Flink SQL - Dawid Wysakowicz
 
posterCP2015(1)
posterCP2015(1)posterCP2015(1)
posterCP2015(1)
 
R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?
 
Flink Forward Berlin 2018: Dawid Wysakowicz - "Detecting Patterns in Event St...
Flink Forward Berlin 2018: Dawid Wysakowicz - "Detecting Patterns in Event St...Flink Forward Berlin 2018: Dawid Wysakowicz - "Detecting Patterns in Event St...
Flink Forward Berlin 2018: Dawid Wysakowicz - "Detecting Patterns in Event St...
 
Extreme querying with_analytics
Extreme querying with_analyticsExtreme querying with_analytics
Extreme querying with_analytics
 
R analysis of covariance
R   analysis of covarianceR   analysis of covariance
R analysis of covariance
 
Application Of The Three-In-One Control Platform Based On OPC In The Lifting-...
Application Of The Three-In-One Control Platform Based On OPC In The Lifting-...Application Of The Three-In-One Control Platform Based On OPC In The Lifting-...
Application Of The Three-In-One Control Platform Based On OPC In The Lifting-...
 
Engines stock control
Engines stock controlEngines stock control
Engines stock control
 
Data Visualization With R: Introduction
Data Visualization With R: IntroductionData Visualization With R: Introduction
Data Visualization With R: Introduction
 
Arrays-Computer programming
Arrays-Computer programmingArrays-Computer programming
Arrays-Computer programming
 
Oracle sql functions
Oracle sql functionsOracle sql functions
Oracle sql functions
 
From Lose to Profit
From Lose to ProfitFrom Lose to Profit
From Lose to Profit
 
Intro to tsql unit 10
Intro to tsql   unit 10Intro to tsql   unit 10
Intro to tsql unit 10
 
Visualizing Data using Tableau
Visualizing Data using TableauVisualizing Data using Tableau
Visualizing Data using Tableau
 
Diseño de pavimentos
Diseño de pavimentos Diseño de pavimentos
Diseño de pavimentos
 
A feasible solution algorithm for a primitive vehicle routing problem
A feasible solution algorithm for a primitive vehicle routing problemA feasible solution algorithm for a primitive vehicle routing problem
A feasible solution algorithm for a primitive vehicle routing problem
 
130717666736980000
130717666736980000130717666736980000
130717666736980000
 

Recently uploaded

怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
cnajjemba
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 

Recently uploaded (20)

怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 

Practical Data Science : Data Cleaning and Summarising

  • 1. Data Preparation: First import the pandas file in Jupyter Notebook by using the following command “import pandas as pd”. Now, load the CSV data from the Automobile file (Automobile.csv) and assign it to some variable called Automobile_p, by using this command “ Automobile_p = 'Automobile.csv' “. Then, read the data from CSV file with attributes names by using the below command. Automobile_h = pd.read_csv(Automobile_p, sep='#', decimal='.', header=None, names=['symboling', 'normalized- losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style', 'drive-wheels', 'engine-location', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type', 'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']) to get the data present in the symbolling column by using Automobile_h['attribute name']. For example, Automobile_h['symboling'] – is used to get the data present in the symboling column. value_counts() is used to get the number of counts. For example, Automobile_h['symboling'].value_counts() – this function is used to to get the number of counts in symbolling column. In our task, • First, we are clearing the Typos for Object datatypes, Typos means mistakes in the text. To do this, we use replace() function . For example, I found “vol00112ov” -this typo in Make attribute so, I’m using Automobile_h['make'].replace(' vol00112ov','volvo',inplace = True) -this command. Later, we use Automobile_h['make'].value_counts() -to get the number of count in make column.(referred in RMIT Tute/Lab 2) • Secondly, we are clearing whitespaces (extra spaces) for Object datatypes by using str.strip() function. For example, I found ‘volvo ‘ whitespace in make column in automobile.csv file. So, I used this command, Automobile_h['make'] = Automobile_h['make'].str.strip() and perform the Automobile_h['make'].value_counts() command -to get the new counts after removing whitespaces. (referred in RMIT Tute/Lab 2) • Thirdly, removing sanity checks for impossible values (exceeding range) for int and float datatypes. Here, first we have to check the range as mentioned in the file . For example, “ print Automobile_h['symboling'][(-3 > Automobile_h['symboling']) | (Automobile_h['symboling'] > 3)] , to check the range of Symboling attribute between -3 and +3. If it is between the range as mentioned , leave it – else replace it with the value present in that range by using replace() function and get the counts by value_counts(). (referred in stackoverflow.com ) • Lastly, removing missssing values (NaN) – performed for all data types by using isna() function. For example, I found missing values in num_of_doors attribute so, I used Automobile_h['num-of- doors'].isna().value_counts() this command (referred in stackoverflow.com ) and I removed these missing values by fillna() like Automobile_h['num-of-doors'] = Automobile_h['num-of-doors'].fillna('four') – here replaced NaN with four and use value_counts() to get the new number of counts in num-of-doors column. Else, replace NaN value with Mean value for Int and Float datatypes– Automobile_h['price'].fillna(Automobile_h['price'].mean(axis=0), inplace=True). (referred in RMIT Tute/Lab 2) After doing the data preparation,use unique () function - Automobile_h['make'].unique(), to get the unique values in the make column and we can verify data preaparation is done or not in Make. Task 2 : Data Exploration Step 1 :
  • 2. • choosing drive-wheels as Nominal values which contains 4wd, fwd, rwd The above Pie Chart shows the drive-wheels with the values 4wd, fwd and rwd. The largest part of this pie is “fwd” contains 50.42 drive-wheels and second biggest slice is “rwd” with 45.80 wheels. Where as the smallest part of this pie is “4wd” with 3.78 wheels. From the diagram, we can say that fwd drive-wheels are more economical. (pie chart syntax referred in RMIT Tute/Lab 3) • Choosing Symboling column with ordinal Values ranging from -3 to +3 The above Pie Chart shows the Symboling with the values -3 to +3. Symbolling value -3 shows that the car is pretty safe and +3 indicates, car is at risky. The largest part of this pie is 0 with 28.15 times, second
  • 3. biggest slice is 1 with 22.39, -1 with 21.85, 2 with 14.71, 3 contains 11.24 which shows that car is at risky. Where as the smallest part of this pie is -2 with 1.26 which is safe. • Choosing Stroke column with numerical values ranging from 2.07 to 4.17 The above diagram is histogram which represents frequency distribution with bins=20 along y-axis and stroke values from 2.07 to 4.17 along X-axis. Here, for stroke value 2.07 which shows negligible amount of frequency roughly 1 and for the stroke value 3.4 , it is showing maximum frequency like 54. But, we can see many ups and downs (fluctuations) from starting to ending of this graph from 2.07 to 4.17 stroke. Higher the stroke value, car is more expensive and repairs cost is more. (syntax for histogram referred in RMIT Tute/Lab 3) Task 2: Data Exploration Step2 • scatter plot for wheel-base and length
  • 4. The above scatter plot shows wheel- base on X-axis and length in Y-axis in dots. Here, we have to import the following package- import matplotlib.pyplot as plt for scatter plot. As wheel-base increases to 120.9, length also increases to 208.1 which is shown in graph and similarly, when wheel-base decreases length also decreases. They both are directly proportional. (syntax referred in RMIT Tute/Lab 3) • Box plot for price and fuel-type
  • 5. The above Box Plot shows fuel-type on X-axis, outliers from 31000 to 45000 and price on Y-axis. Here, we have to import the following package- import matplotlib.pyplot as plt for box plot. Price of the car is less for Gas- fuel type and price of the car is high for Diesel- fuel type . here, they are indirectly propotional. As the price of the car increases, mileage decreases. For diesel, Minimum value is 8000, maximum value is 33000 and median is 18000. Similarly, for Gas minimum value is 5000, median is 12000 and max value is 31000. (boxplot syntax referred in RMIT Tute/Lab 3) • Scatter plot for engine-size and curb-weight The above scatter plot shows engine-size on X-axis and curb-weight in Y-axis. Here, we have to import the following package- import matplotlib.pyplot as plt for scatter plot. As engine-size increases to 326, curb- weight also increases to 4066 which is shown in graph and similarly, when engine-size decreases curb- weight also decreases. Here they both are directly proportional. (referred in RMIT Tute/Lab 3) Task 2: Data Exploration Step 3 : • Scatter matrix for all numerical columns
  • 6. The above scatter matrix shows the collection of Scatterplots for all numeric columns of automobile CSV file ,which are organized into a grid and each scatterplot shows the relationship between them. Here, we have to import the following package - from pandas.tools.plotting import scatter_matrix for scatter matrix. (referred in RMIT Tute/Lab 3)