SlideShare a Scribd company logo
1 of 23
:
Presented by:
 Kunal Jain (071309)
                 Under the guidance of
                 Mr. Praveen Kumar Tripathi
                 Dept of CSE & IT (JUIT)
 Introduction
 Steps in Data Cleansing
 Conclusion
 References
“A company’s most important asset is information. A
 corporation’s ability to compete, adapt, and grow in a
 business climate of rapid change is dependent in large
 measure on how well the company uses information
 to make decisions. Sharing information that isn’t
 clean and consolidated to the fullest extent can
 substantially reduce the effectiveness of a system of
 significant investment and considerable pay-off
 potential.”
   Data cleansing or data scrubbing is the act of
    detecting and correcting (or removing) corrupt or
    inaccurate records from a record set, table, or
    database. Used mainly in databases, the term
    refers to identifying
    incomplete, incorrect, inaccurate, irrelevant etc.
    parts of the data and then replacing, modifying or
    deleting this dirty data.
•Data cleansing can occur within a single set of records, or
between multiple sets of data which need to be merged, or
which will work together.

•Typos and spelling errors are corrected, mislabeled data
is properly labeled and filed, and incomplete or missing
entries are completed.

•In more complex operations, data cleansing can be
performed by computer programs. These data cleansing
programs can check the data with a variety of rules and
procedures decided upon by the user
•The goal of data cleansing is not just to clean up the data
in a database but also to bring consistency to different sets
of data that have been merged from separate databases.
Dummy Values,
Absence of Data,
Multipurpose Fields,
Cryptic Data,
Contradicting Data,
Inappropriate Use of Address Lines,
Violation of Business Rules,
Reused Primary Keys,
Non-Unique Identifiers, and
Data Integration Problems
Parsing
Correcting
Standardizing
Matching
Consolidating
Parsing locates and identifies individual data
elements in the source files and then isolates
these data elements in the target files.
Parsed Data in Target File
                                 First Name:       Beth
                                 Middle Name:     Christine
Input Data from Source File      Last Name:       Parker
Beth Christine Parker, SLS MGR   Title:           SLS MGR
Regional Port Authority          Firm:            Regional Port Authority
Federal Building                 Location:        Federal Building
12800 Lake Calumet               Number:          12800
Hedgewisch, IL                   Street:          Lake Calumet
                                 City:            Hedgewisch
                                 State:           IL
Corrects parsed individual data components
using sophisticated data algorithms and
secondary data sources.
Corrected Data
Parsed Data                              First Name:       Beth
First Name:     Beth                     Middle Name:     Christine
Middle Name:   Christine                 Last Name:       Parker
Last Name:     Parker                    Title:           SLS MGR
Title:         SLS MGR                   Firm:            Regional Port Authority
Firm:          Regional Port Authority   Location:        Federal Building
Location:      Federal Building          Number:          12800
Number:        12800                     Street:          South Butler Drive
Street:        Lake Calumet              City:            Chicago
City:          Hedgewisch                State:           IL
State:         IL                        Zip:             60633
                                         Zip+Four:        2398
Standardizing applies conversion routines to
transform data into its preferred (and
consistent) format using both standard and
custom business rules.
Corrected Data
Corrected Data                             Pre-name:        Ms.
First Name:       Beth                     First Name:      Beth
Middle Name:     Christine                 1st Name Match
Last Name:       Parker                     Standards:       Elizabeth, Bethany, Bethel
Title:           SLS MGR                   Middle Name:     Christine
Firm:            Regional Port Authority   Last Name:       Parker
Location:        Federal Building          Title:           Sales Mgr.
Number:          12800                     Firm:            Regional Port Authority
Street:          South Butler Drive        Location:        Federal Building
City:            Chicago                   Number:          12800
State:           IL                        Street:          S. Butler Dr.
Zip:             60633                     City:            Chicago
Zip+Four:        2398                      State:           IL
                                           Zip:             60633
                                           Zip+Four:        2398
Searching and matching records within and
across the parsed, corrected and standardized
data based on predefined business rules to
eliminate duplications.
Business    Street   Branch Customer   City    Vendor   Pattern   Pattern
 Name                 Type #/Tax ID             Code               I.D.

Exact      Exact     Exact   Exact     Exact   Exact    AAAAAA P110

 Exact     VClose    Exact   VClose    Exact   Blanks ABAAA- P115

 Exact     VClose    Exact   Blanks    Exact   Exact    ABA-AA P120

 Exact     VClose    Close   Close     Exact   Exact    ABCCAA S300

 VClose    VClose    Exact   Close     Exact   Exact    BBACAA S310
Corrected Data (Data Source #2)
Corrected Data (Data Source #1)                Pre-name:        Ms.
Pre-name:        Ms.                           First Name:       Elizabeth
First Name:       Beth                         1st Name Match
1st Name Match                                  Standards:       Beth, Bethany, Bethel
 Standards:       Elizabeth, Bethany, Bethel   Middle Name:     Christine
Middle Name:     Christine                     Last Name:       Parker-Lewis
Last Name:       Parker                        Title:
Title:           Sales Mgr.                    Firm:            Regional Port Authority
Firm:            Regional Port Authority       Location:        Federal Building
Location:        Federal Building              Number:          12800
Number:          12800                         Street:          S. Butler Dr., Suite 2
Street:          S. Butler Dr.                 City:            Chicago
City:            Chicago                       State:           IL
State:           IL                            Zip:             60633
Zip:             60633                         Zip+Four:        2398
Zip+Four:        2398                          Phone:           708-555-1234
                                               Fax:              708-555-5678
Analyzing and identifying relationships between
matched records and consolidating/merging
them into ONE representation.
Consolidated Data
                                  Name:            Ms. Beth (Elizabeth)
Corrected Data (Data Source #1)                    Christine Parker-Lewis
                                  Title:           Sales Mgr.
                                  Firm:            Regional Port Authority
                                  Location:        Federal Building
                                  Address:         12800 S. Butler Dr., Suite 2
                                                   Chicago, IL 60633-2398
Corrected Data (Data Source #2)
                                  Phone:           708-555-1234
                                  Fax:              708-555-5678
1.Use metadata to document rules .


2.Determine data cleansing schedule .


3.Build quality into new and existing systems.
Hence we conclude that DATA CLEANSING is
not only an effective tool for removing
unwanted ,“dirty” data ,but also the medium to
make data in our databases and systems
concise, selective and appropriate in order to
server our clients better and cater to their
demands as well.
Web:
 en.wikipedia.org/wiki/Data_cleansing
 www2.gbif.org/DataCleaning.pdf
 www.webopedia.com/TERM/D/data_cleansing.html
Books:
 Data Mining by Ian H. Witten and Eibe Frank

   Exploratory Data Mining and Data Quality
                by Dasu and Johnson
                    (Wiley, 2004)
Data cleansing

More Related Content

What's hot (20)

Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
OLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSEOLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSE
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Data Visualization in Data Science
Data Visualization in Data ScienceData Visualization in Data Science
Data Visualization in Data Science
 
Data integration
Data integrationData integration
Data integration
 
DATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSINGDATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSING
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Dimensionality reduction
Dimensionality reductionDimensionality reduction
Dimensionality reduction
 
Data Visualization & Analytics.pptx
Data Visualization & Analytics.pptxData Visualization & Analytics.pptx
Data Visualization & Analytics.pptx
 
Analytical tools
Analytical toolsAnalytical tools
Analytical tools
 
Data Management
Data ManagementData Management
Data Management
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
Data analytics
Data analyticsData analytics
Data analytics
 
Chapter 12 outlier
Chapter 12 outlierChapter 12 outlier
Chapter 12 outlier
 
Kdd process
Kdd processKdd process
Kdd process
 

Similar to Data cleansing

DataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census dataDataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census dataRitvvij Parrikh
 
Identity Theft Resource Center - 3/11/2014
Identity Theft Resource Center - 3/11/2014Identity Theft Resource Center - 3/11/2014
Identity Theft Resource Center - 3/11/2014- Mark - Fullbright
 
Fake Amazon email bbb@amazon.com
Fake Amazon email bbb@amazon.com Fake Amazon email bbb@amazon.com
Fake Amazon email bbb@amazon.com Aurorasa Coaching
 
Business Search Business Entities Business Programs
Business Search   Business Entities   Business ProgramsBusiness Search   Business Entities   Business Programs
Business Search Business Entities Business ProgramsAlex Greer
 
Tapping the Data Deluge with R
Tapping the Data Deluge with RTapping the Data Deluge with R
Tapping the Data Deluge with RJeffrey Breen
 
pilgrim's pride 10k_FY2004
pilgrim's pride 10k_FY2004pilgrim's pride 10k_FY2004
pilgrim's pride 10k_FY2004finance30
 
pilgrim's pride 10k_FY2004
pilgrim's pride 10k_FY2004pilgrim's pride 10k_FY2004
pilgrim's pride 10k_FY2004finance30
 
Law Offices of Kevin J Roach
Law Offices of Kevin J RoachLaw Offices of Kevin J Roach
Law Offices of Kevin J RoachVivianMilliron
 
Fried data summit data quality data analytics together
Fried data summit data quality data analytics togetherFried data summit data quality data analytics together
Fried data summit data quality data analytics togetherJeff Fried
 
Society of Insurance Research, 3rd Party Data
Society of Insurance Research, 3rd Party DataSociety of Insurance Research, 3rd Party Data
Society of Insurance Research, 3rd Party DataKevin McCarthy
 
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...In:Confidence 2019 - Balancing the conflicting objectives of data access and ...
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...Privitar
 

Similar to Data cleansing (15)

DataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census dataDataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census data
 
Identity Theft Resource Center - 3/11/2014
Identity Theft Resource Center - 3/11/2014Identity Theft Resource Center - 3/11/2014
Identity Theft Resource Center - 3/11/2014
 
Fake Amazon email bbb@amazon.com
Fake Amazon email bbb@amazon.com Fake Amazon email bbb@amazon.com
Fake Amazon email bbb@amazon.com
 
Business Search Business Entities Business Programs
Business Search   Business Entities   Business ProgramsBusiness Search   Business Entities   Business Programs
Business Search Business Entities Business Programs
 
Tapping the Data Deluge with R
Tapping the Data Deluge with RTapping the Data Deluge with R
Tapping the Data Deluge with R
 
pilgrim's pride 10k_FY2004
pilgrim's pride 10k_FY2004pilgrim's pride 10k_FY2004
pilgrim's pride 10k_FY2004
 
pilgrim's pride 10k_FY2004
pilgrim's pride 10k_FY2004pilgrim's pride 10k_FY2004
pilgrim's pride 10k_FY2004
 
Form Miarticlesofincorporation
Form MiarticlesofincorporationForm Miarticlesofincorporation
Form Miarticlesofincorporation
 
Law Offices of Kevin J Roach
Law Offices of Kevin J RoachLaw Offices of Kevin J Roach
Law Offices of Kevin J Roach
 
VRA 2012, Cataloging Case Studies, Metadata Magic
VRA 2012, Cataloging Case Studies, Metadata MagicVRA 2012, Cataloging Case Studies, Metadata Magic
VRA 2012, Cataloging Case Studies, Metadata Magic
 
Fried data summit data quality data analytics together
Fried data summit data quality data analytics togetherFried data summit data quality data analytics together
Fried data summit data quality data analytics together
 
United States Supreme Court
United States Supreme CourtUnited States Supreme Court
United States Supreme Court
 
Society of Insurance Research, 3rd Party Data
Society of Insurance Research, 3rd Party DataSociety of Insurance Research, 3rd Party Data
Society of Insurance Research, 3rd Party Data
 
Morning Vista Cave Creek Open House Brochure
Morning Vista Cave Creek Open House BrochureMorning Vista Cave Creek Open House Brochure
Morning Vista Cave Creek Open House Brochure
 
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...In:Confidence 2019 - Balancing the conflicting objectives of data access and ...
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...
 

Recently uploaded

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 

Recently uploaded (20)

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 

Data cleansing

  • 1. : Presented by:  Kunal Jain (071309) Under the guidance of Mr. Praveen Kumar Tripathi Dept of CSE & IT (JUIT)
  • 2.  Introduction  Steps in Data Cleansing  Conclusion  References
  • 3. “A company’s most important asset is information. A corporation’s ability to compete, adapt, and grow in a business climate of rapid change is dependent in large measure on how well the company uses information to make decisions. Sharing information that isn’t clean and consolidated to the fullest extent can substantially reduce the effectiveness of a system of significant investment and considerable pay-off potential.”
  • 4. Data cleansing or data scrubbing is the act of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant etc. parts of the data and then replacing, modifying or deleting this dirty data.
  • 5. •Data cleansing can occur within a single set of records, or between multiple sets of data which need to be merged, or which will work together. •Typos and spelling errors are corrected, mislabeled data is properly labeled and filed, and incomplete or missing entries are completed. •In more complex operations, data cleansing can be performed by computer programs. These data cleansing programs can check the data with a variety of rules and procedures decided upon by the user
  • 6. •The goal of data cleansing is not just to clean up the data in a database but also to bring consistency to different sets of data that have been merged from separate databases.
  • 7. Dummy Values, Absence of Data, Multipurpose Fields, Cryptic Data, Contradicting Data, Inappropriate Use of Address Lines, Violation of Business Rules, Reused Primary Keys, Non-Unique Identifiers, and Data Integration Problems
  • 9. Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files.
  • 10. Parsed Data in Target File First Name: Beth Middle Name: Christine Input Data from Source File Last Name: Parker Beth Christine Parker, SLS MGR Title: SLS MGR Regional Port Authority Firm: Regional Port Authority Federal Building Location: Federal Building 12800 Lake Calumet Number: 12800 Hedgewisch, IL Street: Lake Calumet City: Hedgewisch State: IL
  • 11. Corrects parsed individual data components using sophisticated data algorithms and secondary data sources.
  • 12. Corrected Data Parsed Data First Name: Beth First Name: Beth Middle Name: Christine Middle Name: Christine Last Name: Parker Last Name: Parker Title: SLS MGR Title: SLS MGR Firm: Regional Port Authority Firm: Regional Port Authority Location: Federal Building Location: Federal Building Number: 12800 Number: 12800 Street: South Butler Drive Street: Lake Calumet City: Chicago City: Hedgewisch State: IL State: IL Zip: 60633 Zip+Four: 2398
  • 13. Standardizing applies conversion routines to transform data into its preferred (and consistent) format using both standard and custom business rules.
  • 14. Corrected Data Corrected Data Pre-name: Ms. First Name: Beth First Name: Beth Middle Name: Christine 1st Name Match Last Name: Parker Standards: Elizabeth, Bethany, Bethel Title: SLS MGR Middle Name: Christine Firm: Regional Port Authority Last Name: Parker Location: Federal Building Title: Sales Mgr. Number: 12800 Firm: Regional Port Authority Street: South Butler Drive Location: Federal Building City: Chicago Number: 12800 State: IL Street: S. Butler Dr. Zip: 60633 City: Chicago Zip+Four: 2398 State: IL Zip: 60633 Zip+Four: 2398
  • 15. Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications.
  • 16. Business Street Branch Customer City Vendor Pattern Pattern Name Type #/Tax ID Code I.D. Exact Exact Exact Exact Exact Exact AAAAAA P110 Exact VClose Exact VClose Exact Blanks ABAAA- P115 Exact VClose Exact Blanks Exact Exact ABA-AA P120 Exact VClose Close Close Exact Exact ABCCAA S300 VClose VClose Exact Close Exact Exact BBACAA S310
  • 17. Corrected Data (Data Source #2) Corrected Data (Data Source #1) Pre-name: Ms. Pre-name: Ms. First Name: Elizabeth First Name: Beth 1st Name Match 1st Name Match Standards: Beth, Bethany, Bethel Standards: Elizabeth, Bethany, Bethel Middle Name: Christine Middle Name: Christine Last Name: Parker-Lewis Last Name: Parker Title: Title: Sales Mgr. Firm: Regional Port Authority Firm: Regional Port Authority Location: Federal Building Location: Federal Building Number: 12800 Number: 12800 Street: S. Butler Dr., Suite 2 Street: S. Butler Dr. City: Chicago City: Chicago State: IL State: IL Zip: 60633 Zip: 60633 Zip+Four: 2398 Zip+Four: 2398 Phone: 708-555-1234 Fax: 708-555-5678
  • 18. Analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation.
  • 19. Consolidated Data Name: Ms. Beth (Elizabeth) Corrected Data (Data Source #1) Christine Parker-Lewis Title: Sales Mgr. Firm: Regional Port Authority Location: Federal Building Address: 12800 S. Butler Dr., Suite 2 Chicago, IL 60633-2398 Corrected Data (Data Source #2) Phone: 708-555-1234 Fax: 708-555-5678
  • 20. 1.Use metadata to document rules . 2.Determine data cleansing schedule . 3.Build quality into new and existing systems.
  • 21. Hence we conclude that DATA CLEANSING is not only an effective tool for removing unwanted ,“dirty” data ,but also the medium to make data in our databases and systems concise, selective and appropriate in order to server our clients better and cater to their demands as well.
  • 22. Web:  en.wikipedia.org/wiki/Data_cleansing  www2.gbif.org/DataCleaning.pdf  www.webopedia.com/TERM/D/data_cleansing.html Books:  Data Mining by Ian H. Witten and Eibe Frank  Exploratory Data Mining and Data Quality by Dasu and Johnson (Wiley, 2004)