Explore the meticulous process of customer data cleansing in this detailed PowerPoint presentation. Dive into the techniques used to identify and rectify common data discrepancies in a large dataset using Excel, from correcting dates to standardizing gender entries. This presentation walks through practical examples of cleaning half a million records, showcasing the application of formulas and functions for data analysts seeking to enhance data accuracy and reliability.
2. Introduction
It is said that 80% of the time data analysts and data scientists are
cleaning the datasets given to them and the remaining 20% is the
actual analyzing. This project was spent cleaning and wrangling a
customer csv file with a half a million records. This presentation will go
into detail on how it was done in Excel.
Customer Data Cleansing Project 2
4. An Overview
The dataset shown in the previous slide has
500002 records and 10 columns starting with
customer ID and ending with zip code. When
taking a glance at the data, right off the bat there
are some obvious errors. The column title for
email is incorrectly typed, the gender column
has an incorrect data category of F!, and DOB
and Date of Joining has incorrect date types.
Customer Data Cleansing Project 4
5. Excel Overview Continued
Even though there were glaring errors in the
beginning of the file, it is always good to
assume that there are more errors within the
dataset. Further investigation within the
customer csv shows that there are many
occurrences of future dates in the Date of
Joining column, DOB in the 1700s, and
gender options that are not comprehendible.
Customer Data Cleansing Project
5
6. How Do We Fix It?
Very rarely will there be an occurrence where a data analyst
can fix something quickly and manually. Because it is
unknown how many occurrences these errors are found
within the dataset of a half a million records, one of the
quicker ways to fix and clean the data is using formulas.
Customer Data Cleansing Project 6
7. Text to Columns
The first step in the process of fixing date errors in a large dataset
would be to split data into multiple columns. If this is not done, when
formula is typed to fix date error it would not recognize which number
to fix. In this situation the first step would be to find text to column
button to split columns by delimiter. As the image shows, the user
would navigate towards the data tab and then find the button.
Customer Data Cleansing Project 7
8. Step One: Text to Columns
Continued
1. The data tab in excel has a button called text to
columns. When pressed, it then guides the
user to a wizard to convert text to multiple
columns by delimiter.
2. In this case columns would be spaced by
“other” delimiter previewed in step two and the
delimiter “/” would be typed in “other” box
3. In the final step of the wizard the user would
select column for the new data columns. In
this case the first free column was picked
“$O$1”
4. The right most snapshot is the result of text to
column wizard. Date of birth is separated into
three columns.
9. IF Statement
Customer Data Cleansing Project 9
After columns are separated, dates are now ready for cleaning. In order to correct Date of
Birth from 1700s back to 1900s a formula would be used. In this case an IF formula would be
used for correction.
=IF(LEFT(Q2,2)="17","19"&RIGHT(Q2,2),Q2)
This formula basically means that if the left-most digit in Q2 IS “17” then change to “19”. Else
numbers remain unchanged.
10. IF Statement Continued
PRESENTATION TITLE
Date of Birth Temporary
Column Result of IF Formula
1. The two images show the result of
the IF statement formula. There
are no DOBs that have dates in the
1700s
2. After formula is completed, copy
DOB Year Temp and then choose
paste values which is the second
paste option.
3. This gets rid of formula
attachment to the column.
Explanation
10
11. One Last Step for DOB
• Date of Birth column errors have been corrected. But the
dates must be joined together again, instead of being in
three separate columns. This is done by the concatenation
formula. CONCAT joins columns together.
• =N2&"/"&O2&"/’’&P2
• Like the previous slide mentioned after formula is completed,
we want to get rid of formula attachment to the column so
the step would be to copy DOB Temp and then choose paste
values on the next available column.
• The last thing to do would be to delete separate DOB
columns.
11
The image shows the result of
the concatenation formula. The
three separate date columns
were returned to one column.
12. Doing It Again for Date of
Joining
• The same steps would be performed to fix
date errors in the Date of Joining column,
(text to column wizard) however formula
would be a little bit different. Instead
changing dates from “17” to “19” we want to
change dates from “21” to “20” So formula
would look like this:
•
=IF(LEFT(M2,2)="21","20"&RIGHT(M2,2),
M2)
• After dates are corrected DOJ Temp would
be copied and then pasted as values
• The last step would be to concatenate and
join DOJ together again.
• After these two data cleaning processes are
completed, dataset will look like second
image. 12
13. Are We Ready to Move On?
The new DOB and Date of Joining
columns look good as new, but are
they finished? A good habit to get
accustomed to would be to check
over what was corrected one more
time. When we do that, it is noticed
that the DOJ column still has some
future dates. There are only a
couple instances of these dates so
they can manually be fixed from “20”
to “19” but if these were not caught,
analyzation would be incorrect.
Customer Data Cleansing Project 13
14. Gender Column Correction
The gender column has a couple
incomprehensible options. Since the
options are assumed to be only male
and female a formula will be used to
correct “F” and “M”.
Customer Data Cleansing Project 14
15. IF IS NUMBER
• =IF(ISNUMBER(SEARCH("F",K2)),"F","
M")
• This formula will first check text in K2 to
see if it contains “F”. If it does the
formula outputs “F” and if it does not,
the formula outputs “M”
• This formula is a conditional function
that checks condition based on
ISNUMBER(SEARCH “F” ,K2))
Customer Data Cleansing Project 15
16. Gender Column Correction Last
Step
• After gender column is
corrected, the gender column
would be copied and then
pasted as values to un attach
formulas from column.
• After gender is pasted as
values make sure that
everything looks okay and
then that’s it!
• The two images show the
result of the IF IS NUMBER
formula
Customer Data Cleansing Project 16
17. Business Preferences and
Insights
Sometimes businesses have preference of how they want their data to
look. An example would be to change 10-digit zip code to only 5 digits.
Also, for better business insights a data analyst may aggregate more
columns for further information about data. An example in this dataset
would be to find out the ages of customers based on DOB column and
the years of membership based on Date of Joining column. These next
couple slides will show how to perform these functions.
Customer Data Cleansing Project 17
18. 5 Digit Zip Code (TRIM)
• =LEFT(K2,5)
• This formula looks at the
leftmost digits in K2. It then
takes the first 5 digits.
• The result of the formula is
shown in the two images.
• 5 Digit Zip Code column is
created.
Customer Data Cleansing Project 18
19. Age Based on DOB Column
• =DATEDIF(I2,NOW(),"Y")
• This formula takes the date in cell I2
and coverts it in years as of today. I2’s
date is 9/4/1992, so DATEDIF formula
should translate date to 31 years.
• The second image shows the
conversion of date to years as of
today.
Customer Data Cleansing Project 19
20. Membership Years Based on DOJ
Column
• Membership years column would be
aggregated the same way that age
column was created except the
minor difference is the cell.
• =DATEDIF(M2,NOW(),"Y")
• The date in M2 is 9/9/14, so
DATEDIF formula should covert date
to 9 years.
• The second image shows the result
of DATEDIF formula for membership
years column
Customer Data Cleansing Project 20
21. Final Steps
Although, dataset has all components
needed and all data is cleaned for
analyzing, the dataset itself is
unorganized. DOB and age should
be closer to customer information
and Date of Joining and Membership
Years should be included within
dataset and not spaced out.
Customer Data Cleansing Project 21