:Presented by: Kunal Jain (071309) Under the guidance of Mr. Praveen Kumar Tripathi Dept of CSE & IT (JUIT)
Introduction Steps in Data Cleansing Conclusion References
“A company’s most important asset is information. A corporation’s ability to compete, adapt, and grow in a business climate of rapid change is dependent in large measure on how well the company uses information to make decisions. Sharing information that isn’t clean and consolidated to the fullest extent can substantially reduce the effectiveness of a system of significant investment and considerable pay-off potential.”
Data cleansing or data scrubbing is the act of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant etc. parts of the data and then replacing, modifying or deleting this dirty data.
•Data cleansing can occur within a single set of records, orbetween multiple sets of data which need to be merged, orwhich will work together.•Typos and spelling errors are corrected, mislabeled datais properly labeled and filed, and incomplete or missingentries are completed.•In more complex operations, data cleansing can beperformed by computer programs. These data cleansingprograms can check the data with a variety of rules andprocedures decided upon by the user
•The goal of data cleansing is not just to clean up the datain a database but also to bring consistency to different setsof data that have been merged from separate databases.
Dummy Values,Absence of Data,Multipurpose Fields,Cryptic Data,Contradicting Data,Inappropriate Use of Address Lines,Violation of Business Rules,Reused Primary Keys,Non-Unique Identifiers, andData Integration Problems
Parsing locates and identifies individual dataelements in the source files and then isolatesthese data elements in the target files.
Parsed Data in Target File First Name: Beth Middle Name: ChristineInput Data from Source File Last Name: ParkerBeth Christine Parker, SLS MGR Title: SLS MGRRegional Port Authority Firm: Regional Port AuthorityFederal Building Location: Federal Building12800 Lake Calumet Number: 12800Hedgewisch, IL Street: Lake Calumet City: Hedgewisch State: IL
Corrects parsed individual data componentsusing sophisticated data algorithms andsecondary data sources.
Corrected DataParsed Data First Name: BethFirst Name: Beth Middle Name: ChristineMiddle Name: Christine Last Name: ParkerLast Name: Parker Title: SLS MGRTitle: SLS MGR Firm: Regional Port AuthorityFirm: Regional Port Authority Location: Federal BuildingLocation: Federal Building Number: 12800Number: 12800 Street: South Butler DriveStreet: Lake Calumet City: ChicagoCity: Hedgewisch State: ILState: IL Zip: 60633 Zip+Four: 2398
Standardizing applies conversion routines totransform data into its preferred (andconsistent) format using both standard andcustom business rules.
Corrected DataCorrected Data Pre-name: Ms.First Name: Beth First Name: BethMiddle Name: Christine 1st Name MatchLast Name: Parker Standards: Elizabeth, Bethany, BethelTitle: SLS MGR Middle Name: ChristineFirm: Regional Port Authority Last Name: ParkerLocation: Federal Building Title: Sales Mgr.Number: 12800 Firm: Regional Port AuthorityStreet: South Butler Drive Location: Federal BuildingCity: Chicago Number: 12800State: IL Street: S. Butler Dr.Zip: 60633 City: ChicagoZip+Four: 2398 State: IL Zip: 60633 Zip+Four: 2398
Searching and matching records within andacross the parsed, corrected and standardizeddata based on predefined business rules toeliminate duplications.
Business Street Branch Customer City Vendor Pattern Pattern Name Type #/Tax ID Code I.D.Exact Exact Exact Exact Exact Exact AAAAAA P110 Exact VClose Exact VClose Exact Blanks ABAAA- P115 Exact VClose Exact Blanks Exact Exact ABA-AA P120 Exact VClose Close Close Exact Exact ABCCAA S300 VClose VClose Exact Close Exact Exact BBACAA S310
Corrected Data (Data Source #2)Corrected Data (Data Source #1) Pre-name: Ms.Pre-name: Ms. First Name: ElizabethFirst Name: Beth 1st Name Match1st Name Match Standards: Beth, Bethany, Bethel Standards: Elizabeth, Bethany, Bethel Middle Name: ChristineMiddle Name: Christine Last Name: Parker-LewisLast Name: Parker Title:Title: Sales Mgr. Firm: Regional Port AuthorityFirm: Regional Port Authority Location: Federal BuildingLocation: Federal Building Number: 12800Number: 12800 Street: S. Butler Dr., Suite 2Street: S. Butler Dr. City: ChicagoCity: Chicago State: ILState: IL Zip: 60633Zip: 60633 Zip+Four: 2398Zip+Four: 2398 Phone: 708-555-1234 Fax: 708-555-5678
Analyzing and identifying relationships betweenmatched records and consolidating/mergingthem into ONE representation.
Consolidated Data Name: Ms. Beth (Elizabeth)Corrected Data (Data Source #1) Christine Parker-Lewis Title: Sales Mgr. Firm: Regional Port Authority Location: Federal Building Address: 12800 S. Butler Dr., Suite 2 Chicago, IL 60633-2398Corrected Data (Data Source #2) Phone: 708-555-1234 Fax: 708-555-5678
1.Use metadata to document rules .2.Determine data cleansing schedule .3.Build quality into new and existing systems.
Hence we conclude that DATA CLEANSING isnot only an effective tool for removingunwanted ,“dirty” data ,but also the medium tomake data in our databases and systemsconcise, selective and appropriate in order toserver our clients better and cater to theirdemands as well.
Web: en.wikipedia.org/wiki/Data_cleansing www2.gbif.org/DataCleaning.pdf www.webopedia.com/TERM/D/data_cleansing.htmlBooks: Data Mining by Ian H. Witten and Eibe Frank Exploratory Data Mining and Data Quality by Dasu and Johnson (Wiley, 2004)