SlideShare a Scribd company logo
Customer Data
Cleansing Project
Natalie Offenbacker
Introduction
It is said that 80% of the time data analysts and data scientists are
cleaning the datasets given to them and the remaining 20% is the
actual analyzing. This project was spent cleaning and wrangling a
customer csv file with a half a million records. This presentation will go
into detail on how it was done in Excel.
Customer Data Cleansing Project 2
The Data
Customer Data Cleansing Project 3
An Overview
The dataset shown in the previous slide has
500002 records and 10 columns starting with
customer ID and ending with zip code. When
taking a glance at the data, right off the bat there
are some obvious errors. The column title for
email is incorrectly typed, the gender column
has an incorrect data category of F!, and DOB
and Date of Joining has incorrect date types.
Customer Data Cleansing Project 4
Excel Overview Continued
Even though there were glaring errors in the
beginning of the file, it is always good to
assume that there are more errors within the
dataset. Further investigation within the
customer csv shows that there are many
occurrences of future dates in the Date of
Joining column, DOB in the 1700s, and
gender options that are not comprehendible.
Customer Data Cleansing Project
5
How Do We Fix It?
Very rarely will there be an occurrence where a data analyst
can fix something quickly and manually. Because it is
unknown how many occurrences these errors are found
within the dataset of a half a million records, one of the
quicker ways to fix and clean the data is using formulas.
Customer Data Cleansing Project 6
Text to Columns
The first step in the process of fixing date errors in a large dataset
would be to split data into multiple columns. If this is not done, when
formula is typed to fix date error it would not recognize which number
to fix. In this situation the first step would be to find text to column
button to split columns by delimiter. As the image shows, the user
would navigate towards the data tab and then find the button.
Customer Data Cleansing Project 7
Step One: Text to Columns
Continued
1. The data tab in excel has a button called text to
columns. When pressed, it then guides the
user to a wizard to convert text to multiple
columns by delimiter.
2. In this case columns would be spaced by
“other” delimiter previewed in step two and the
delimiter “/” would be typed in “other” box
3. In the final step of the wizard the user would
select column for the new data columns. In
this case the first free column was picked
“$O$1”
4. The right most snapshot is the result of text to
column wizard. Date of birth is separated into
three columns.
IF Statement
Customer Data Cleansing Project 9
After columns are separated, dates are now ready for cleaning. In order to correct Date of
Birth from 1700s back to 1900s a formula would be used. In this case an IF formula would be
used for correction.
=IF(LEFT(Q2,2)="17","19"&RIGHT(Q2,2),Q2)
This formula basically means that if the left-most digit in Q2 IS “17” then change to “19”. Else
numbers remain unchanged.
IF Statement Continued
PRESENTATION TITLE
Date of Birth Temporary
Column Result of IF Formula
1. The two images show the result of
the IF statement formula. There
are no DOBs that have dates in the
1700s
2. After formula is completed, copy
DOB Year Temp and then choose
paste values which is the second
paste option.
3. This gets rid of formula
attachment to the column.
Explanation
10
One Last Step for DOB
• Date of Birth column errors have been corrected. But the
dates must be joined together again, instead of being in
three separate columns. This is done by the concatenation
formula. CONCAT joins columns together.
• =N2&"/"&O2&"/’’&P2
• Like the previous slide mentioned after formula is completed,
we want to get rid of formula attachment to the column so
the step would be to copy DOB Temp and then choose paste
values on the next available column.
• The last thing to do would be to delete separate DOB
columns.
11
The image shows the result of
the concatenation formula. The
three separate date columns
were returned to one column.
Doing It Again for Date of
Joining
• The same steps would be performed to fix
date errors in the Date of Joining column,
(text to column wizard) however formula
would be a little bit different. Instead
changing dates from “17” to “19” we want to
change dates from “21” to “20” So formula
would look like this:
•
=IF(LEFT(M2,2)="21","20"&RIGHT(M2,2),
M2)
• After dates are corrected DOJ Temp would
be copied and then pasted as values
• The last step would be to concatenate and
join DOJ together again.
• After these two data cleaning processes are
completed, dataset will look like second
image. 12
Are We Ready to Move On?
The new DOB and Date of Joining
columns look good as new, but are
they finished? A good habit to get
accustomed to would be to check
over what was corrected one more
time. When we do that, it is noticed
that the DOJ column still has some
future dates. There are only a
couple instances of these dates so
they can manually be fixed from “20”
to “19” but if these were not caught,
analyzation would be incorrect.
Customer Data Cleansing Project 13
Gender Column Correction
The gender column has a couple
incomprehensible options. Since the
options are assumed to be only male
and female a formula will be used to
correct “F” and “M”.
Customer Data Cleansing Project 14
IF IS NUMBER
• =IF(ISNUMBER(SEARCH("F",K2)),"F","
M")
• This formula will first check text in K2 to
see if it contains “F”. If it does the
formula outputs “F” and if it does not,
the formula outputs “M”
• This formula is a conditional function
that checks condition based on
ISNUMBER(SEARCH “F” ,K2))
Customer Data Cleansing Project 15
Gender Column Correction Last
Step
• After gender column is
corrected, the gender column
would be copied and then
pasted as values to un attach
formulas from column.
• After gender is pasted as
values make sure that
everything looks okay and
then that’s it!
• The two images show the
result of the IF IS NUMBER
formula
Customer Data Cleansing Project 16
Business Preferences and
Insights
Sometimes businesses have preference of how they want their data to
look. An example would be to change 10-digit zip code to only 5 digits.
Also, for better business insights a data analyst may aggregate more
columns for further information about data. An example in this dataset
would be to find out the ages of customers based on DOB column and
the years of membership based on Date of Joining column. These next
couple slides will show how to perform these functions.
Customer Data Cleansing Project 17
5 Digit Zip Code (TRIM)
• =LEFT(K2,5)
• This formula looks at the
leftmost digits in K2. It then
takes the first 5 digits.
• The result of the formula is
shown in the two images.
• 5 Digit Zip Code column is
created.
Customer Data Cleansing Project 18
Age Based on DOB Column
• =DATEDIF(I2,NOW(),"Y")
• This formula takes the date in cell I2
and coverts it in years as of today. I2’s
date is 9/4/1992, so DATEDIF formula
should translate date to 31 years.
• The second image shows the
conversion of date to years as of
today.
Customer Data Cleansing Project 19
Membership Years Based on DOJ
Column
• Membership years column would be
aggregated the same way that age
column was created except the
minor difference is the cell.
• =DATEDIF(M2,NOW(),"Y")
• The date in M2 is 9/9/14, so
DATEDIF formula should covert date
to 9 years.
• The second image shows the result
of DATEDIF formula for membership
years column
Customer Data Cleansing Project 20
Final Steps
Although, dataset has all components
needed and all data is cleaned for
analyzing, the dataset itself is
unorganized. DOB and age should
be closer to customer information
and Date of Joining and Membership
Years should be included within
dataset and not spaced out.
Customer Data Cleansing Project 21
Final Product
Customer Data Cleansing Project 22
Thank you
Natalie Offenbacker
natkcn07@gmail.com

More Related Content

What's hot

Arena Model for Coffe Shop
Arena Model for Coffe ShopArena Model for Coffe Shop
Arena Model for Coffe Shop
Ebru Özmüş
 
R learning by examples
R learning by examplesR learning by examples
R learning by examples
Michelle Darling
 
MS excel - match function
MS excel - match functionMS excel - match function
MS excel - match function
Vincent Segondo
 
Excel Tutorials - Finding & Removing the Duplicate Values
Excel Tutorials - Finding & Removing the Duplicate ValuesExcel Tutorials - Finding & Removing the Duplicate Values
Excel Tutorials - Finding & Removing the Duplicate Values
Merve Nur Taş
 
Algoritma dan Struktur Data - Larik
Algoritma dan Struktur Data - LarikAlgoritma dan Struktur Data - Larik
Algoritma dan Struktur Data - Larik
Georgius Rinaldo
 
How to perform a Monte Carlo simulation
How to perform a Monte Carlo simulation How to perform a Monte Carlo simulation
How to perform a Monte Carlo simulation
Financial Modelling Handbook
 
Excel Database Function
Excel Database FunctionExcel Database Function
Excel Database Function
Anita Shah
 
Field properties
Field propertiesField properties
Field properties
Dastan Kamaran
 
Excel presentation data validation
Excel presentation   data validationExcel presentation   data validation
Excel presentation data validation
Nagamani Y R
 
Excel 2010 autosum
Excel 2010 autosumExcel 2010 autosum
Excel 2010 autosum
um5ashm
 
Data Analysis with MS Excel.pptx
Data Analysis with MS Excel.pptxData Analysis with MS Excel.pptx
Data Analysis with MS Excel.pptx
Kouros Goodarzi
 
6 sistem bilangan
6 sistem bilangan6 sistem bilangan
6 sistem bilangan
teddyhadia
 
Introduction to Excel - Excel 2013 Tutorial
Introduction to Excel - Excel 2013 TutorialIntroduction to Excel - Excel 2013 Tutorial
Introduction to Excel - Excel 2013 Tutorial
SpreadsheetTrainer
 
Zero is even or odd, Zero is positive or negative ?
Zero is even or odd, Zero is positive or negative ?Zero is even or odd, Zero is positive or negative ?
Zero is even or odd, Zero is positive or negative ?
Malik Ghulam Murtza
 
Ms excel 2016_function
Ms excel 2016_functionMs excel 2016_function
Ms excel 2016_function
Paktia University
 
Pivot Tables
Pivot TablesPivot Tables
Pivot Tables
gjonesnemeth
 
Pivot table
Pivot tablePivot table
Pivot table
Vijay Perepa
 
Ms excell
Ms excellMs excell
Ms excell
usmankhaliq6
 
Excel
ExcelExcel
Excel
arbiesani
 

What's hot (20)

Arena Model for Coffe Shop
Arena Model for Coffe ShopArena Model for Coffe Shop
Arena Model for Coffe Shop
 
R learning by examples
R learning by examplesR learning by examples
R learning by examples
 
MS excel - match function
MS excel - match functionMS excel - match function
MS excel - match function
 
Excel Tutorials - Finding & Removing the Duplicate Values
Excel Tutorials - Finding & Removing the Duplicate ValuesExcel Tutorials - Finding & Removing the Duplicate Values
Excel Tutorials - Finding & Removing the Duplicate Values
 
Algoritma dan Struktur Data - Larik
Algoritma dan Struktur Data - LarikAlgoritma dan Struktur Data - Larik
Algoritma dan Struktur Data - Larik
 
How to perform a Monte Carlo simulation
How to perform a Monte Carlo simulation How to perform a Monte Carlo simulation
How to perform a Monte Carlo simulation
 
Excel Database Function
Excel Database FunctionExcel Database Function
Excel Database Function
 
Field properties
Field propertiesField properties
Field properties
 
Excel presentation data validation
Excel presentation   data validationExcel presentation   data validation
Excel presentation data validation
 
Excel 2010 autosum
Excel 2010 autosumExcel 2010 autosum
Excel 2010 autosum
 
Data Analysis with MS Excel.pptx
Data Analysis with MS Excel.pptxData Analysis with MS Excel.pptx
Data Analysis with MS Excel.pptx
 
6 sistem bilangan
6 sistem bilangan6 sistem bilangan
6 sistem bilangan
 
Introduction to Excel - Excel 2013 Tutorial
Introduction to Excel - Excel 2013 TutorialIntroduction to Excel - Excel 2013 Tutorial
Introduction to Excel - Excel 2013 Tutorial
 
Zero is even or odd, Zero is positive or negative ?
Zero is even or odd, Zero is positive or negative ?Zero is even or odd, Zero is positive or negative ?
Zero is even or odd, Zero is positive or negative ?
 
Modul uml
Modul umlModul uml
Modul uml
 
Ms excel 2016_function
Ms excel 2016_functionMs excel 2016_function
Ms excel 2016_function
 
Pivot Tables
Pivot TablesPivot Tables
Pivot Tables
 
Pivot table
Pivot tablePivot table
Pivot table
 
Ms excell
Ms excellMs excell
Ms excell
 
Excel
ExcelExcel
Excel
 

Similar to Customer Data Cleansing Project.pptx

Fahri tugas cloud1
Fahri tugas cloud1Fahri tugas cloud1
Fahri tugas cloud1
FAHRIZAENURIPUTRA
 
EDA_Assignment_Sourabh S Hubballi.pdf
EDA_Assignment_Sourabh S Hubballi.pdfEDA_Assignment_Sourabh S Hubballi.pdf
EDA_Assignment_Sourabh S Hubballi.pdf
SourabhH1
 
Lecture 4-Prepare data-Clean, transform, and load data in Power BI.pptx
Lecture 4-Prepare data-Clean, transform, and load data in Power BI.pptxLecture 4-Prepare data-Clean, transform, and load data in Power BI.pptx
Lecture 4-Prepare data-Clean, transform, and load data in Power BI.pptx
edieali1
 
ODS Data Sleuth: Tracking Down Calculated Fields in Banner
ODS Data Sleuth: Tracking Down Calculated Fields in BannerODS Data Sleuth: Tracking Down Calculated Fields in Banner
ODS Data Sleuth: Tracking Down Calculated Fields in Banner
Bryan L. Mack
 
Part 1 of 1 -Question 1 of 205.0 PointsThe first step anyo.docx
Part 1 of 1 -Question 1 of 205.0 PointsThe first step anyo.docxPart 1 of 1 -Question 1 of 205.0 PointsThe first step anyo.docx
Part 1 of 1 -Question 1 of 205.0 PointsThe first step anyo.docx
danhaley45372
 
Cis336 i lab 1 of 7
Cis336 i lab 1 of 7Cis336 i lab 1 of 7
Cis336 i lab 1 of 7
helpido9
 
E-Book 25 Tips and Tricks MS Excel Functions & Formulaes
E-Book 25 Tips and Tricks MS Excel Functions & FormulaesE-Book 25 Tips and Tricks MS Excel Functions & Formulaes
E-Book 25 Tips and Tricks MS Excel Functions & Formulaes
BurCom Consulting Ltd.
 
USING MICROSOFT EXCEL 2016 Independent Project 6-5 (Mac 2016) 
USING MICROSOFT EXCEL 2016 Independent Project 6-5 (Mac 2016) USING MICROSOFT EXCEL 2016 Independent Project 6-5 (Mac 2016) 
USING MICROSOFT EXCEL 2016 Independent Project 6-5 (Mac 2016) 
heiditownend
 
Sql Lab 4 Essay
Sql Lab 4 EssaySql Lab 4 Essay
Sql Lab 4 Essay
Lorie Harris
 
How To Automate Part 2
How To Automate Part 2How To Automate Part 2
How To Automate Part 2
Sean Durocher
 
284566820 1 z0-061(1)
284566820 1 z0-061(1)284566820 1 z0-061(1)
284566820 1 z0-061(1)
panagara
 
Recipe 5 of Data Warehouse and Business Intelligence - The null values manage...
Recipe 5 of Data Warehouse and Business Intelligence - The null values manage...Recipe 5 of Data Warehouse and Business Intelligence - The null values manage...
Recipe 5 of Data Warehouse and Business Intelligence - The null values manage...
Massimo Cenci
 
LabsLab5Lab5_Excel_SH.htmlLab 5 SpreadsheetsLearning Outcomes.docx
LabsLab5Lab5_Excel_SH.htmlLab 5 SpreadsheetsLearning Outcomes.docxLabsLab5Lab5_Excel_SH.htmlLab 5 SpreadsheetsLearning Outcomes.docx
LabsLab5Lab5_Excel_SH.htmlLab 5 SpreadsheetsLearning Outcomes.docx
DIPESH30
 
1. What two conditions must be met before an entity can be classif.docx
1. What two conditions must be met before an entity can be classif.docx1. What two conditions must be met before an entity can be classif.docx
1. What two conditions must be met before an entity can be classif.docx
jackiewalcutt
 
In Section 1 on the Data page, complete each column of the spreads.docx
In Section 1 on the Data page, complete each column of the spreads.docxIn Section 1 on the Data page, complete each column of the spreads.docx
In Section 1 on the Data page, complete each column of the spreads.docx
sleeperharwell
 
Scoring documentation
Scoring documentationScoring documentation
Scoring documentation
Fatima Khalid
 
1Z0-061 Oracle Database 12c: SQL Fundamentals
1Z0-061 Oracle Database 12c: SQL Fundamentals1Z0-061 Oracle Database 12c: SQL Fundamentals
1Z0-061 Oracle Database 12c: SQL Fundamentals
Lydi00147
 
Question 1Which view does not display the data, but allows you t.docx
Question 1Which view does not display the data, but allows you t.docxQuestion 1Which view does not display the data, but allows you t.docx
Question 1Which view does not display the data, but allows you t.docx
JUST36
 
Grader - InstructionsExcel 2019 ProjectExcel_7G_Loan_Flowers_Staf.docx
Grader - InstructionsExcel 2019 ProjectExcel_7G_Loan_Flowers_Staf.docxGrader - InstructionsExcel 2019 ProjectExcel_7G_Loan_Flowers_Staf.docx
Grader - InstructionsExcel 2019 ProjectExcel_7G_Loan_Flowers_Staf.docx
greg1eden90113
 
50 MS Excel Tips and Tricks
50 MS Excel Tips and Tricks 50 MS Excel Tips and Tricks
50 MS Excel Tips and Tricks
BurCom Consulting Ltd.
 

Similar to Customer Data Cleansing Project.pptx (20)

Fahri tugas cloud1
Fahri tugas cloud1Fahri tugas cloud1
Fahri tugas cloud1
 
EDA_Assignment_Sourabh S Hubballi.pdf
EDA_Assignment_Sourabh S Hubballi.pdfEDA_Assignment_Sourabh S Hubballi.pdf
EDA_Assignment_Sourabh S Hubballi.pdf
 
Lecture 4-Prepare data-Clean, transform, and load data in Power BI.pptx
Lecture 4-Prepare data-Clean, transform, and load data in Power BI.pptxLecture 4-Prepare data-Clean, transform, and load data in Power BI.pptx
Lecture 4-Prepare data-Clean, transform, and load data in Power BI.pptx
 
ODS Data Sleuth: Tracking Down Calculated Fields in Banner
ODS Data Sleuth: Tracking Down Calculated Fields in BannerODS Data Sleuth: Tracking Down Calculated Fields in Banner
ODS Data Sleuth: Tracking Down Calculated Fields in Banner
 
Part 1 of 1 -Question 1 of 205.0 PointsThe first step anyo.docx
Part 1 of 1 -Question 1 of 205.0 PointsThe first step anyo.docxPart 1 of 1 -Question 1 of 205.0 PointsThe first step anyo.docx
Part 1 of 1 -Question 1 of 205.0 PointsThe first step anyo.docx
 
Cis336 i lab 1 of 7
Cis336 i lab 1 of 7Cis336 i lab 1 of 7
Cis336 i lab 1 of 7
 
E-Book 25 Tips and Tricks MS Excel Functions & Formulaes
E-Book 25 Tips and Tricks MS Excel Functions & FormulaesE-Book 25 Tips and Tricks MS Excel Functions & Formulaes
E-Book 25 Tips and Tricks MS Excel Functions & Formulaes
 
USING MICROSOFT EXCEL 2016 Independent Project 6-5 (Mac 2016) 
USING MICROSOFT EXCEL 2016 Independent Project 6-5 (Mac 2016) USING MICROSOFT EXCEL 2016 Independent Project 6-5 (Mac 2016) 
USING MICROSOFT EXCEL 2016 Independent Project 6-5 (Mac 2016) 
 
Sql Lab 4 Essay
Sql Lab 4 EssaySql Lab 4 Essay
Sql Lab 4 Essay
 
How To Automate Part 2
How To Automate Part 2How To Automate Part 2
How To Automate Part 2
 
284566820 1 z0-061(1)
284566820 1 z0-061(1)284566820 1 z0-061(1)
284566820 1 z0-061(1)
 
Recipe 5 of Data Warehouse and Business Intelligence - The null values manage...
Recipe 5 of Data Warehouse and Business Intelligence - The null values manage...Recipe 5 of Data Warehouse and Business Intelligence - The null values manage...
Recipe 5 of Data Warehouse and Business Intelligence - The null values manage...
 
LabsLab5Lab5_Excel_SH.htmlLab 5 SpreadsheetsLearning Outcomes.docx
LabsLab5Lab5_Excel_SH.htmlLab 5 SpreadsheetsLearning Outcomes.docxLabsLab5Lab5_Excel_SH.htmlLab 5 SpreadsheetsLearning Outcomes.docx
LabsLab5Lab5_Excel_SH.htmlLab 5 SpreadsheetsLearning Outcomes.docx
 
1. What two conditions must be met before an entity can be classif.docx
1. What two conditions must be met before an entity can be classif.docx1. What two conditions must be met before an entity can be classif.docx
1. What two conditions must be met before an entity can be classif.docx
 
In Section 1 on the Data page, complete each column of the spreads.docx
In Section 1 on the Data page, complete each column of the spreads.docxIn Section 1 on the Data page, complete each column of the spreads.docx
In Section 1 on the Data page, complete each column of the spreads.docx
 
Scoring documentation
Scoring documentationScoring documentation
Scoring documentation
 
1Z0-061 Oracle Database 12c: SQL Fundamentals
1Z0-061 Oracle Database 12c: SQL Fundamentals1Z0-061 Oracle Database 12c: SQL Fundamentals
1Z0-061 Oracle Database 12c: SQL Fundamentals
 
Question 1Which view does not display the data, but allows you t.docx
Question 1Which view does not display the data, but allows you t.docxQuestion 1Which view does not display the data, but allows you t.docx
Question 1Which view does not display the data, but allows you t.docx
 
Grader - InstructionsExcel 2019 ProjectExcel_7G_Loan_Flowers_Staf.docx
Grader - InstructionsExcel 2019 ProjectExcel_7G_Loan_Flowers_Staf.docxGrader - InstructionsExcel 2019 ProjectExcel_7G_Loan_Flowers_Staf.docx
Grader - InstructionsExcel 2019 ProjectExcel_7G_Loan_Flowers_Staf.docx
 
50 MS Excel Tips and Tricks
50 MS Excel Tips and Tricks 50 MS Excel Tips and Tricks
50 MS Excel Tips and Tricks
 

Recently uploaded

The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
wyddcwye1
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 

Recently uploaded (20)

The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 

Customer Data Cleansing Project.pptx

  • 2. Introduction It is said that 80% of the time data analysts and data scientists are cleaning the datasets given to them and the remaining 20% is the actual analyzing. This project was spent cleaning and wrangling a customer csv file with a half a million records. This presentation will go into detail on how it was done in Excel. Customer Data Cleansing Project 2
  • 3. The Data Customer Data Cleansing Project 3
  • 4. An Overview The dataset shown in the previous slide has 500002 records and 10 columns starting with customer ID and ending with zip code. When taking a glance at the data, right off the bat there are some obvious errors. The column title for email is incorrectly typed, the gender column has an incorrect data category of F!, and DOB and Date of Joining has incorrect date types. Customer Data Cleansing Project 4
  • 5. Excel Overview Continued Even though there were glaring errors in the beginning of the file, it is always good to assume that there are more errors within the dataset. Further investigation within the customer csv shows that there are many occurrences of future dates in the Date of Joining column, DOB in the 1700s, and gender options that are not comprehendible. Customer Data Cleansing Project 5
  • 6. How Do We Fix It? Very rarely will there be an occurrence where a data analyst can fix something quickly and manually. Because it is unknown how many occurrences these errors are found within the dataset of a half a million records, one of the quicker ways to fix and clean the data is using formulas. Customer Data Cleansing Project 6
  • 7. Text to Columns The first step in the process of fixing date errors in a large dataset would be to split data into multiple columns. If this is not done, when formula is typed to fix date error it would not recognize which number to fix. In this situation the first step would be to find text to column button to split columns by delimiter. As the image shows, the user would navigate towards the data tab and then find the button. Customer Data Cleansing Project 7
  • 8. Step One: Text to Columns Continued 1. The data tab in excel has a button called text to columns. When pressed, it then guides the user to a wizard to convert text to multiple columns by delimiter. 2. In this case columns would be spaced by “other” delimiter previewed in step two and the delimiter “/” would be typed in “other” box 3. In the final step of the wizard the user would select column for the new data columns. In this case the first free column was picked “$O$1” 4. The right most snapshot is the result of text to column wizard. Date of birth is separated into three columns.
  • 9. IF Statement Customer Data Cleansing Project 9 After columns are separated, dates are now ready for cleaning. In order to correct Date of Birth from 1700s back to 1900s a formula would be used. In this case an IF formula would be used for correction. =IF(LEFT(Q2,2)="17","19"&RIGHT(Q2,2),Q2) This formula basically means that if the left-most digit in Q2 IS “17” then change to “19”. Else numbers remain unchanged.
  • 10. IF Statement Continued PRESENTATION TITLE Date of Birth Temporary Column Result of IF Formula 1. The two images show the result of the IF statement formula. There are no DOBs that have dates in the 1700s 2. After formula is completed, copy DOB Year Temp and then choose paste values which is the second paste option. 3. This gets rid of formula attachment to the column. Explanation 10
  • 11. One Last Step for DOB • Date of Birth column errors have been corrected. But the dates must be joined together again, instead of being in three separate columns. This is done by the concatenation formula. CONCAT joins columns together. • =N2&"/"&O2&"/’’&P2 • Like the previous slide mentioned after formula is completed, we want to get rid of formula attachment to the column so the step would be to copy DOB Temp and then choose paste values on the next available column. • The last thing to do would be to delete separate DOB columns. 11 The image shows the result of the concatenation formula. The three separate date columns were returned to one column.
  • 12. Doing It Again for Date of Joining • The same steps would be performed to fix date errors in the Date of Joining column, (text to column wizard) however formula would be a little bit different. Instead changing dates from “17” to “19” we want to change dates from “21” to “20” So formula would look like this: • =IF(LEFT(M2,2)="21","20"&RIGHT(M2,2), M2) • After dates are corrected DOJ Temp would be copied and then pasted as values • The last step would be to concatenate and join DOJ together again. • After these two data cleaning processes are completed, dataset will look like second image. 12
  • 13. Are We Ready to Move On? The new DOB and Date of Joining columns look good as new, but are they finished? A good habit to get accustomed to would be to check over what was corrected one more time. When we do that, it is noticed that the DOJ column still has some future dates. There are only a couple instances of these dates so they can manually be fixed from “20” to “19” but if these were not caught, analyzation would be incorrect. Customer Data Cleansing Project 13
  • 14. Gender Column Correction The gender column has a couple incomprehensible options. Since the options are assumed to be only male and female a formula will be used to correct “F” and “M”. Customer Data Cleansing Project 14
  • 15. IF IS NUMBER • =IF(ISNUMBER(SEARCH("F",K2)),"F"," M") • This formula will first check text in K2 to see if it contains “F”. If it does the formula outputs “F” and if it does not, the formula outputs “M” • This formula is a conditional function that checks condition based on ISNUMBER(SEARCH “F” ,K2)) Customer Data Cleansing Project 15
  • 16. Gender Column Correction Last Step • After gender column is corrected, the gender column would be copied and then pasted as values to un attach formulas from column. • After gender is pasted as values make sure that everything looks okay and then that’s it! • The two images show the result of the IF IS NUMBER formula Customer Data Cleansing Project 16
  • 17. Business Preferences and Insights Sometimes businesses have preference of how they want their data to look. An example would be to change 10-digit zip code to only 5 digits. Also, for better business insights a data analyst may aggregate more columns for further information about data. An example in this dataset would be to find out the ages of customers based on DOB column and the years of membership based on Date of Joining column. These next couple slides will show how to perform these functions. Customer Data Cleansing Project 17
  • 18. 5 Digit Zip Code (TRIM) • =LEFT(K2,5) • This formula looks at the leftmost digits in K2. It then takes the first 5 digits. • The result of the formula is shown in the two images. • 5 Digit Zip Code column is created. Customer Data Cleansing Project 18
  • 19. Age Based on DOB Column • =DATEDIF(I2,NOW(),"Y") • This formula takes the date in cell I2 and coverts it in years as of today. I2’s date is 9/4/1992, so DATEDIF formula should translate date to 31 years. • The second image shows the conversion of date to years as of today. Customer Data Cleansing Project 19
  • 20. Membership Years Based on DOJ Column • Membership years column would be aggregated the same way that age column was created except the minor difference is the cell. • =DATEDIF(M2,NOW(),"Y") • The date in M2 is 9/9/14, so DATEDIF formula should covert date to 9 years. • The second image shows the result of DATEDIF formula for membership years column Customer Data Cleansing Project 20
  • 21. Final Steps Although, dataset has all components needed and all data is cleaned for analyzing, the dataset itself is unorganized. DOB and age should be closer to customer information and Date of Joining and Membership Years should be included within dataset and not spaced out. Customer Data Cleansing Project 21
  • 22. Final Product Customer Data Cleansing Project 22