Data cleansing project

1 | Page
DATA SCIENCE PORTFOLIO
Jombaba.s7@gmail.com
I enjoy working on data and highlighted in this portfolio is one of the projects I have
implemented using the python-based tools and programming platform. The entire project is
presented in five sub-projects with each of them capturing specifics of the entire work.
Imagine that, a financial institution / bank wish to find a solution to a ‘Customer Acquisition and
Customer Retention’ related problem. As a data Scientist, this is my attempt at providing a
wholesome solution and the series of five projects illustrate a plausible approach in resolving
the problem.
Data Cleaning Project
Purpose:
This is the first in a series of projects. The purpose of the data cleaning project is to clean-up the
‘Credit Card Application’ dataset by removing missing and other out of place characters. Initial
examination of the dataset shows that we have missing values to contend with. There are 67
missing values in the dataset.
The dataset comprises continuous and nominal attributes of small and large values. For reasons
of privacy, the dataset was published with column labels A1 – A16 replacing the actual
descriptive labels.
Dataset:
The dataset is found at: - https://archive.ics.uci.edu/ml/datasets/credit+approval
Number of instances (observations) = 690
Number of attributes =15 (columns A1-A15)
There is one class attribute (column A16)
307 (44.5%) of the classifier is “+” and 383 (55.5%) is “-“
Attribute
Label
Value Type
A1 Nominal
A2 continuous
A3 continuous
A4 Nominal
A5 Nominal
A6 Nominal

2 | Page
A7 Nominal
A8 continuous
A9 Nominal
A10 Nominal
A11 Continuous
(Integer)
A12 Nominal
A13 Nominal
A14 Continuous
(Integer)
A15 Continuous
(Integer)
A16 Class
attribute
Process:
Step1:
The original dataset was in .txt format and below is an abstract of it.
b,30.83,0,u,g,w,v,1.25,t,t,01,f,g,00202,0,+
a,58.67,4.46,u,g,q,h,3.04,t,t,06,f,g,00043,560,+
The .txt file was imported into excel using the ‘import from text’ Excel menu to obtain a .xls file
which looks like the following:
A1 A2 A3 A4 A5 A6 A7 A8
A
9
A1
0 A11
A1
2
A1
3
A1
4 A15
A1
6
B 30.83 0 u G w v 1.25 t t 1 F g 202 0 +
A 58.67 4.46 u G q h 3.04 t t 6 F g 43 560 +
Step2:
To prepare the required environment using Python pandas and numpy libraries, we issue the
following commands.
import pandas as pd
import numpy as np
Step3:
Next, to detect the character and distribution of the missing values we use the following codes.

3 | Page
missing_values = ["?"]
df=pd.read_csv('C:/Users/Owner/Desktop/DATA/CAD/ABC.csv', na_values=missing_values)
df.isnull().sum()
Output
A1 12
A2 12
A3 0
A4 6
A5 6
A6 9
A7 9
A8 0
A9 0
A10 0
A11 0
A12 0
A13 0
A14 13
A15 0
A16 0
dtype: int64
The total number of missing values is derived using the following code
df.isnull().sum().sum()
Output: 67
To confirm whether or not the dataset was imported correctly, examine the first five (5) and the
last five (5) records of the dataset.
df.head(5)

4 | Page
Output:
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16
0 b 30.83 0.000 u g w v 1.25 t t 1 f g 202 0 +
1 a 58.67 4.460 u g q h 3.04 t t 6 f g 43 560 +
2 a 24.5 0.500 u g q h 1.50 t f 0 f g 280 824 +
3 b 27.83 1.540 u g w v 3.75 t t 5 t g 100 3 +
4 b 20.17 5.625 u g w v 1.71 t f 0 f s 120 0 +
df.tail(5)
Output:
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16
685 b 21.08 10.085 y p e h 1.25 f f 0 f g 260 0 -
686 a 22.67 0.750 u g c v 2.00 f t 2 t g 200 394 -
687 a 25.25 13.500 y p ff ff 2.00 f t 1 t g 200 1 -
688 b 17.92 0.205 u g aa v 0.04 f f 0 f g 280 750 -
689 b 35 3.375 u g c h 8.29 f f 0 t g 0 0 -
Next, we confirm that the entire dataset is captured and to do this a glimpse at the basic statistics
of numerical features is revealed through the following codes.
df.shape
Output: (690, 16)
df.describe()
Output:
A2 A3 A8 A11 A14
count 678.000000 690.000000 690.000000 690.00000 677.000000
mean 31.568171 4.758725 2.223406 2.40000 184.014771
std 11.957862 4.978163 3.346513 4.86294 173.806768
min 13.750000 0.000000 0.000000 0.00000 0.000000
25% 22.602500 1.000000 0.165000 0.00000 75.000000
50% 28.460000 2.750000 1.000000 0.00000 160.000000
75% 38.230000 7.207500 2.625000 3.00000 276.000000
max 80.250000 28.000000 28.500000 67.00000 2000.000000
A15
count 690.000000

5 | Page
mean 1017.385507
std 5210.102598
min 0.000000
25% 0.000000
50% 5.000000
75% 395.500000
max 100000.000000
Furthermore, to detect unique values and counts for all variables using the following code. We
do this in order to determine inputs for nominal columns.
df.apply(pd.Series.value_counts)
Output:
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11
0.0 NaN NaN 19.0 NaN NaN NaN NaN 70.0 NaN NaN 395.0
0.04 NaN NaN 5.0 NaN NaN NaN NaN 33.0 NaN NaN NaN
0.08 NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN

6 | Page
0.455 NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN
0.795 NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN
... .. ... ... ... ... ... ... ... ... ...
b 468.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
bb NaN NaN NaN NaN NaN NaN 59.0 NaN NaN NaN NaN
c NaN NaN NaN NaN NaN 137.0 NaN NaN NaN NaN NaN
cc NaN NaN NaN NaN NaN 41.0 NaN NaN NaN NaN NaN
d NaN NaN NaN NaN NaN 30.0 NaN NaN NaN NaN NaN
dd NaN NaN NaN NaN NaN NaN 6.0 NaN NaN NaN NaN
e NaN NaN NaN NaN NaN 25.0 NaN NaN NaN NaN NaN
f NaN NaN NaN NaN NaN NaN NaN NaN 329.0 395.0 NaN
ff NaN NaN NaN NaN NaN 53.0 57.0 NaN NaN NaN NaN
g NaN NaN NaN NaN 519.0 NaN NaN NaN NaN NaN NaN
gg NaN NaN NaN NaN 2.0 NaN NaN NaN NaN NaN NaN

7 | Page
h NaN NaN NaN NaN NaN NaN 138.0 NaN NaN NaN NaN
i NaN NaN NaN NaN NaN 59.0 NaN NaN NaN NaN NaN
j NaN NaN NaN NaN NaN 10.0 8.0 NaN NaN NaN NaN
k NaN NaN NaN NaN NaN 51.0 NaN NaN NaN NaN NaN
l NaN NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN
m NaN NaN NaN NaN NaN 38.0 NaN NaN NaN NaN NaN
n NaN NaN NaN NaN NaN NaN 4.0 NaN NaN NaN NaN
o NaN NaN NaN NaN NaN NaN 2.0 NaN NaN NaN NaN
p NaN NaN NaN NaN 163.0 NaN NaN NaN NaN NaN NaN
q NaN NaN NaN NaN NaN 78.0 NaN NaN NaN NaN NaN
r NaN NaN NaN NaN NaN 3.0 NaN NaN NaN NaN NaN
s NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
t NaN NaN NaN NaN NaN NaN NaN NaN 361.0 295.0 NaN
u NaN NaN NaN 519.0 NaN NaN NaN NaN NaN NaN NaN
v NaN NaN NaN NaN NaN NaN 399.0 NaN NaN NaN NaN
w NaN NaN NaN NaN NaN 64.0 NaN NaN NaN NaN NaN
x NaN NaN NaN NaN NaN 38.0 NaN NaN NaN NaN NaN
y NaN NaN NaN 163.0 NaN NaN NaN NaN NaN NaN NaN
z NaN NaN NaN NaN NaN NaN 8.0 NaN NaN NaN NaN
A12 A13 A14 A15 A16
0.0 NaN NaN 132.0 295.0 NaN
0.04 NaN NaN NaN NaN NaN

8 | Page
... ... ... ... ...
b NaN NaN NaN NaN NaN
bb NaN NaN NaN NaN NaN

9 | Page
c NaN NaN NaN NaN NaN
cc NaN NaN NaN NaN NaN
d NaN NaN NaN NaN NaN
dd NaN NaN NaN NaN NaN
e NaN NaN NaN NaN NaN
f 374.0 NaN NaN NaN NaN
ff NaN NaN NaN NaN NaN
g NaN 625.0 NaN NaN NaN
gg NaN NaN NaN NaN NaN
h NaN NaN NaN NaN NaN
i NaN NaN NaN NaN NaN
j NaN NaN NaN NaN NaN
k NaN NaN NaN NaN NaN
l NaN NaN NaN NaN NaN
m NaN NaN NaN NaN NaN
n NaN NaN NaN NaN NaN
o NaN NaN NaN NaN NaN
p NaN 8.0 NaN NaN NaN
q NaN NaN NaN NaN NaN
r NaN NaN NaN NaN NaN
s NaN 57.0 NaN NaN NaN
t 316.0 NaN NaN NaN NaN
u NaN NaN NaN NaN NaN
v NaN NaN NaN NaN NaN
w NaN NaN NaN NaN NaN
x NaN NaN NaN NaN NaN
y NaN NaN NaN NaN NaN

10 | Page
z NaN NaN NaN NaN NaN
[944 rows x 16 columns]
Step4:
Using the results of step 3 above, the ‘mean’ value will serve as imputes for missing values of
numerical attributes while the ‘mode’ serves as imputes for nominal columns.
Given that the ratio of “missing values to total number of observations is small, between 0.8%
and 1.9%, using the mean and the mode as imputes for missing values is not expected to distort
the dataset in any way.
The following is the code we use to effect these imputations. The seven (7) variables involved
are treated individually.
df['A1'].fillna('b', inplace=True)
df['A2'].fillna(31.568, inplace=True)
df['A4'].fillna('u', inplace=True)
df['A5'].fillna('g', inplace=True)
df['A6'].fillna('c', inplace=True)
df['A7'].fillna('v', inplace=True)
df['A14'].fillna(184, inplace=True)
Step5:
Next, save the clean and new dataset. It is this ‘missing-value free’ dataset that we shall use in
implementing project (all 5 sub-projects) discussed in this portfolio.
df.to_csv('C:/Users/Owner/Desktop/DATA/CAD/ABC-1.csv')
To assess the clean-up exercise, compare the contents of the old and new files for a specific
row / column by using the following codes.
df=pd.read_csv('C:/Users/Owner/Desktop/DATA/CAD/ABC.csv', na_values=missing_values)
df.iloc[248,:]

11 | Page
Output:
A1 NaN
A2 24.5
A3 12.75
A4 u
A5 g
A6 c
A7 bb
A8 4.75
A9 t
A10 t
A11 2
A12 f
A13 g
A14 73
A15 444
A16 +
Name: 248, dtype: object
df=pd.read_csv('C:/Users/Owner/Desktop/DATA/CAD/ABC-1.csv')
df.iloc[248,:]
Output [21]:
Unnamed: 0 248
A1 b
A2 24.5
A3 12.75
A4 u

12 | Page
A5 g
A6 c
A7 bb
A8 4.75
A9 t
A10 t
A11 2
A12 f
A13 g
A14 73
A15 444
A16 +
Name: 248, dtype: object
From the results above, we observe that for element (row) number 248, the missing value for
feature labeled A1 is NaN in the first file has been imputed with value b in the second file. This
is the expected result.
Summary and Conclusion:
The comparison of the files before and after cleaning-up, illustrate that the exercise was
successfully implemented and that all 67 missing values were appropriately replaced. Thus,
using Excel and the Python library as our wrangling tools, we have been able to prepare the
data for analysis. The improved quality of our data boosts the predictability level of analytical
tools thereby promoting plausible insights from the dataset.
For our purpose, we assume that this dataset characterizes the activities of a specific institution
(e.g. a Bank) over a given period, a quarter say. In a nutshell, the basic statistics ( df.describe() )
used in this sub-project captures a description of what happened fully. Again, for our purpose
this is a satisfactory coverage of descriptive analytics.
tyJA

Data cleansing project

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data cleansing project

Similar to Data cleansing project (20)

Recently uploaded

Recently uploaded (20)

Data cleansing project