SlideShare a Scribd company logo
Ahsan AbdullahAhsan Abdullah
11
Data WarehousingData Warehousing
Lecture-20Lecture-20
Data Duplication Elimination & BSN MethodData Duplication Elimination & BSN Method
Virtual University of PakistanVirtual University of Pakistan
Ahsan Abdullah
Assoc. Prof. & Head
Center for Agro-Informatics Research
www.nu.edu.pk/cairindex.asp
National University of Computers & Emerging Sciences, Islamabad
Email: ahsan1010@yahoo.com
Ahsan Abdullah
2
Why data duplicated?Why data duplicated?
A data warehouse is created from heterogeneous sources,
with heterogeneous databases (different
schema/representation) of the same entity.
The data coming from outside the organization owning the
DWH, can have even lower quality data i.e. different
representation for same entity, transcription or typographical
errors.
Ahsan Abdullah
3
Problems due to data duplicationProblems due to data duplication
Data duplication, can result in costly errors, such as:
 False frequency distributions.
 Incorrect aggregates due to double counting.
 Difficulty with catching fabricated identities by credit card companies.
Ahsan Abdullah
4
Unable to determine customer relationships (CRM)Unable to determine customer relationships (CRM)
Unable to analyze employee benefits trendsUnable to analyze employee benefits trends
Name Phone Number Cust. No.
M. Ismail Siddiqi 021.666.1244 780701
M. Ismail Siddiqi 021.666.1244 780203
M. Ismail Siddiqi 021.666.1244 780009
Bonus Date Name Department Emp. No.
Jan. 2000 Khan Muhammad 213 (MKT) 5353536
Dec. 2001 Khan Muhammad 567 (SLS) 4577833
Mar. 2002 Khan Muhammad 349 (HR) 3457642
• Duplicate Identification Numbers
• Multiple Customer Numbers
• Multiple Employee Numbers
Data Duplication: Non-Unique PKData Duplication: Non-Unique PK
Ahsan Abdullah
5
Data Duplication: House HoldingData Duplication: House Holding
 Group together all records that belong to the sameGroup together all records that belong to the same
household.household.
Why bother ?Why bother ?
……… S. Ahad 440, Munir Road, Lahore
……… ………….… ………………………………
……… Shiekh Ahad No. 440, Munir Rd, Lhr
……… Shiekh Ahed House # 440, Munir Road, Lahore
……… ………….… ………………………………
Ahsan Abdullah
6
 Identify multiple records in each household whichIdentify multiple records in each household which
represent the same individualrepresent the same individual
Address field is standardized.Address field is standardized.
By coincidence ??By coincidence ??
……… M. Ahad 440, Munir Road, Lahore
……… ………….… ………………………………
……… Maj Ahad 440, Munir Road, Lahore
Data Duplication: IndividualizationData Duplication: Individualization
Ahsan Abdullah
7
Formal definition & NomenclatureFormal definition & Nomenclature
 Problem statement:Problem statement:
 ““Given two databases, identify the potentially matchedGiven two databases, identify the potentially matched
recordsrecords EfficientlyEfficiently andand EffectivelyEffectively””
 Many names, such as:Many names, such as:
 Record linkageRecord linkage
 Merge/purgeMerge/purge
 Entity reconciliationEntity reconciliation
 List washing and data cleansing.List washing and data cleansing.
 Current market and tools heavily centeredCurrent market and tools heavily centered
towards customer lists.towards customer lists.
Ahsan Abdullah
8
Need & Tool SupportNeed & Tool Support
 Logical solution to dirty data is to clean it in some way.
 Doing it manually is very slow and prone to errors.
 Tools are required to do it “cost” effectively to achieve
reasonable quality.
 Tools are there, some for specific fields, others for specific
cleaning phase.
 Since application specific, so work very well, but need
support from other tools for broad spectrum of cleaning
problems.
Ahsan Abdullah
9
Overview of the Basic ConceptOverview of the Basic Concept
 In its simplest form, there is an identifying attribute (orIn its simplest form, there is an identifying attribute (or
combination) per record for identification.combination) per record for identification.
 Records can be from single source or multiple sourcesRecords can be from single source or multiple sources
sharing same PK or other common unique attributes.sharing same PK or other common unique attributes.
 Sorting performed on identifying attributes and neighboringSorting performed on identifying attributes and neighboring
records checked.records checked.
 What if no common attributes or dirty data?What if no common attributes or dirty data?
 The degree of similarity measured numerically, differentThe degree of similarity measured numerically, different
attributes may contribute differently.attributes may contribute differently.
Ahsan Abdullah
10
Basic Sorted Neighborhood (BSN) MethodBasic Sorted Neighborhood (BSN) Method
 Concatenate data into one sequential list of N recordsConcatenate data into one sequential list of N records
 Steps 1: Create KeysSteps 1: Create Keys
 Compute a key for each record in the list by extracting relevant fieldsCompute a key for each record in the list by extracting relevant fields
or portions of fieldsor portions of fields
 Effectiveness of the this method highly depends on a properlyEffectiveness of the this method highly depends on a properly
chosen keychosen key
 Step 2: Sort DataStep 2: Sort Data
 Sort the records in the data list using the key of step 1Sort the records in the data list using the key of step 1
 Step 3: MergeStep 3: Merge
 Move a fixed size window through the sequential list of recordsMove a fixed size window through the sequential list of records
limiting the comparisons for matching records to those records in thelimiting the comparisons for matching records to those records in the
windowwindow
 If the size of the window isIf the size of the window is ww records then every new record enteringrecords then every new record entering
the window is compared with the previousthe window is compared with the previous w-1w-1 records.records.
Ahsan Abdullah
11
BSN Method : Sliding WindowBSN Method : Sliding Window
.
.
.
.
.
.
Current window
of records
w
Next window
of records
w
Ahsan Abdullah
12
BSN Method: Selection of KeysBSN Method: Selection of Keys
 Selection of KeysSelection of Keys
 Effectiveness highly dependent on the key selected to sort theEffectiveness highly dependent on the key selected to sort the
records middle name vs. family name,records middle name vs. family name,
 A key is a sequence of a subset of attributes or sub-stringsA key is a sequence of a subset of attributes or sub-strings
within the attributes chosen from the record.within the attributes chosen from the record.
 The keys are used for sorting the entire dataset with theThe keys are used for sorting the entire dataset with the
intention that matched candidates will appear close to eachintention that matched candidates will appear close to each
other.other.
First Middle Address NID Key
Muhammed Ahmad 440 Munir Road 34535322 AHM440MUN345
Muhammad Ahmad 440 Munir Road 34535322 AHM440MUN345
Muhammed Ahmed 440 Munir Road 34535322 AHM440MUN345
Muhammad Ahmar 440 Munawar Road 34535334 AHM440MUN345
Ahsan Abdullah
13
BSN Method: Problem with keysBSN Method: Problem with keys
 Since data is dirty, so keys WILL also be dirty, and
matching records will not come together.
 Data becomes dirty due to data entry errors or use of
abbreviations. Some real examples are as follows:
 Solution is to use external standard source files to validate the
data and resolve any data conflicts.
Technology
Tech.
Techno.
Tchnlgy
Ahsan Abdullah
14
BSN Method: Problem with keys (e.g.)BSN Method: Problem with keys (e.g.)
No Name Address Gender
1 Syed N Jaffri 420 15 4 Chaklala No Rawalpindi Street M
2 Syed Noman 420 4 Rwp Scheme M
3 Saiam Noor 5 Afshan Colony Flat Lahore Road Saidpur F
No Name Address Gender
1 N. Jaffri, Syed No. 420, Street 15, Chaklala 4, Rawalpindi M
2 S. Noman 420, Scheme 4, Rwp M
3 Saiam Noor Flat 5, Afshan Colony, Saidpur Road, Lahore F
If contents of fields are not properly ordered, similar records will NOT
fall in the same window.
Example: Records 1 and 2 are similar but will occur far apart.
Solution is to TOKENize the fields i.e. break them further. Use the
tokens in different fields for sorting to fix the error.
Example: Either using the name or the address field records 1 and 2 will
fall close.
Ahsan Abdullah
15
BSN Method: Matching CandidatesBSN Method: Matching Candidates
Merging of records is a complex inferential process.
Example-1:Example-1: Two persons with names spelled nearly but not
identically, have the exact same address. We infer they are same
person i.e. NomaNoma Abdullah and NomanNoman Abdullah.
Example-2:Example-2: Two persons have same National ID numbers but names
and addresses are completely different. We infer same person who
changed his name and moved or the records represent different
persons and NID is incorrect for one of them.
Use of further information such as age, gender etc. can alter theUse of further information such as age, gender etc. can alter the
decision.decision.
Example-3:Example-3: NomaNoma-F and NomanNoman-M we could perhaps infer that Noma
and Noman are siblings i.e. brothers and sisters. NomaNoma-30 and
NomanNoman-5 i.e. mother and son.
Ahsan Abdullah
16
 Time Complexity: O(n log n)Time Complexity: O(n log n)
 O (n) for Key CreationO (n) for Key Creation
 O (n log n) for SortingO (n log n) for Sorting
 O (w n) for matching, where wO (w n) for matching, where w ≤≤ 22 ≤≤ nn
 Constants vary a lotConstants vary a lot
 At least three passes required on the dataset.At least three passes required on the dataset.
 Complexity or rule and window size detrimental.Complexity or rule and window size detrimental.
 For large sets disk I/O is detrimental.For large sets disk I/O is detrimental.
Complexity Analysis of BSN MethodComplexity Analysis of BSN Method
Ahsan Abdullah
17
BSN Method: Equational TheoryBSN Method: Equational Theory
To specify the inferences we need equational
Theory.
 Logic is NOT based on string equivalence.
 Logic based on domain equivalence.
 Requires declarative rule language.

More Related Content

What's hot

Data Architecture (i.e., normalization / relational algebra) and Database Sec...
Data Architecture (i.e., normalization / relational algebra) and Database Sec...Data Architecture (i.e., normalization / relational algebra) and Database Sec...
Data Architecture (i.e., normalization / relational algebra) and Database Sec...
IDEAS - Int'l Data Engineering and Science Association
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...
Trey Grainger
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
guest0edcaf
 
Linked Data Tutorial
Linked Data TutorialLinked Data Tutorial
Linked Data Tutorial
Sören Auer
 
Presentation dual inversion-index
Presentation dual inversion-indexPresentation dual inversion-index
Presentation dual inversion-indexmahi_uta
 
"RDFa - what, why and how?" by Mike Hewett and Shamod Lacoul
"RDFa - what, why and how?" by Mike Hewett and Shamod Lacoul"RDFa - what, why and how?" by Mike Hewett and Shamod Lacoul
"RDFa - what, why and how?" by Mike Hewett and Shamod Lacoul
Shamod Lacoul
 
RDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use itRDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use it
Jose Luis Lopez Pino
 

What's hot (8)

Data Architecture (i.e., normalization / relational algebra) and Database Sec...
Data Architecture (i.e., normalization / relational algebra) and Database Sec...Data Architecture (i.e., normalization / relational algebra) and Database Sec...
Data Architecture (i.e., normalization / relational algebra) and Database Sec...
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
Linked Data Tutorial
Linked Data TutorialLinked Data Tutorial
Linked Data Tutorial
 
Presentation dual inversion-index
Presentation dual inversion-indexPresentation dual inversion-index
Presentation dual inversion-index
 
"RDFa - what, why and how?" by Mike Hewett and Shamod Lacoul
"RDFa - what, why and how?" by Mike Hewett and Shamod Lacoul"RDFa - what, why and how?" by Mike Hewett and Shamod Lacoul
"RDFa - what, why and how?" by Mike Hewett and Shamod Lacoul
 
Boolean Training
Boolean TrainingBoolean Training
Boolean Training
 
RDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use itRDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use it
 

Viewers also liked

Lecture 7
Lecture 7Lecture 7
Lecture 7
Shani729
 
Lecture 18
Lecture 18Lecture 18
Lecture 18
Shani729
 
Lecture 23
Lecture 23Lecture 23
Lecture 23
Shani729
 
Lecture 27
Lecture 27Lecture 27
Lecture 27
Shani729
 
Lecture 31
Lecture 31Lecture 31
Lecture 31
Shani729
 
Lecture 21
Lecture 21Lecture 21
Lecture 21
Shani729
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
Shani729
 
Lecture 16
Lecture 16Lecture 16
Lecture 16
Shani729
 
Lecture 19
Lecture 19Lecture 19
Lecture 19
Shani729
 
Lecture 38
Lecture 38Lecture 38
Lecture 38
Shani729
 
Lecture 34
Lecture 34Lecture 34
Lecture 34
Shani729
 
Lecture 5
Lecture 5Lecture 5
Lecture 5
Shani729
 
Lecture 33
Lecture 33Lecture 33
Lecture 33
Shani729
 
Lecture 30
Lecture 30Lecture 30
Lecture 30
Shani729
 
Lecture 4
Lecture 4Lecture 4
Lecture 4
Shani729
 
Lecture 35
Lecture 35Lecture 35
Lecture 35
Shani729
 
Lecture 40
Lecture 40Lecture 40
Lecture 40
Shani729
 
Lecture 32
Lecture 32Lecture 32
Lecture 32
Shani729
 
Lecture 3
Lecture 3Lecture 3
Lecture 3
Shani729
 
Lecture 39
Lecture 39Lecture 39
Lecture 39
Shani729
 

Viewers also liked (20)

Lecture 7
Lecture 7Lecture 7
Lecture 7
 
Lecture 18
Lecture 18Lecture 18
Lecture 18
 
Lecture 23
Lecture 23Lecture 23
Lecture 23
 
Lecture 27
Lecture 27Lecture 27
Lecture 27
 
Lecture 31
Lecture 31Lecture 31
Lecture 31
 
Lecture 21
Lecture 21Lecture 21
Lecture 21
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
Lecture 16
Lecture 16Lecture 16
Lecture 16
 
Lecture 19
Lecture 19Lecture 19
Lecture 19
 
Lecture 38
Lecture 38Lecture 38
Lecture 38
 
Lecture 34
Lecture 34Lecture 34
Lecture 34
 
Lecture 5
Lecture 5Lecture 5
Lecture 5
 
Lecture 33
Lecture 33Lecture 33
Lecture 33
 
Lecture 30
Lecture 30Lecture 30
Lecture 30
 
Lecture 4
Lecture 4Lecture 4
Lecture 4
 
Lecture 35
Lecture 35Lecture 35
Lecture 35
 
Lecture 40
Lecture 40Lecture 40
Lecture 40
 
Lecture 32
Lecture 32Lecture 32
Lecture 32
 
Lecture 3
Lecture 3Lecture 3
Lecture 3
 
Lecture 39
Lecture 39Lecture 39
Lecture 39
 

Similar to Lecture 20

Database fundamentals
Database fundamentalsDatabase fundamentals
Database fundamentalscrystalpullen
 
FSDN conversations
FSDN conversationsFSDN conversations
FSDN conversations
vhepworth
 
Week12
Week12Week12
Week12
Esha Meher
 
Blast gp assignment
Blast  gp assignmentBlast  gp assignment
Blast gp assignment
barathvaj
 
Vivo Search
Vivo SearchVivo Search
Vivo Search
Anup Sawant
 
BLAST (Basic local alignment search Tool)
BLAST (Basic local alignment search Tool)BLAST (Basic local alignment search Tool)
BLAST (Basic local alignment search Tool)
Ariful Islam Sagar
 
Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.Deepak K
 
Tackling Hidden Risks in AML Sanctions Screening Programs
Tackling Hidden Risks in AML Sanctions Screening ProgramsTackling Hidden Risks in AML Sanctions Screening Programs
Tackling Hidden Risks in AML Sanctions Screening Programs
Alessa
 
DataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census dataDataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census data
Ritvvij Parrikh
 
PostgreSQL Tutorial for Beginners | Edureka
PostgreSQL Tutorial for Beginners | EdurekaPostgreSQL Tutorial for Beginners | Edureka
PostgreSQL Tutorial for Beginners | Edureka
Edureka!
 
bm25 demystified
bm25 demystifiedbm25 demystified
bm25 demystified
Fan Robbin
 

Similar to Lecture 20 (11)

Database fundamentals
Database fundamentalsDatabase fundamentals
Database fundamentals
 
FSDN conversations
FSDN conversationsFSDN conversations
FSDN conversations
 
Week12
Week12Week12
Week12
 
Blast gp assignment
Blast  gp assignmentBlast  gp assignment
Blast gp assignment
 
Vivo Search
Vivo SearchVivo Search
Vivo Search
 
BLAST (Basic local alignment search Tool)
BLAST (Basic local alignment search Tool)BLAST (Basic local alignment search Tool)
BLAST (Basic local alignment search Tool)
 
Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.
 
Tackling Hidden Risks in AML Sanctions Screening Programs
Tackling Hidden Risks in AML Sanctions Screening ProgramsTackling Hidden Risks in AML Sanctions Screening Programs
Tackling Hidden Risks in AML Sanctions Screening Programs
 
DataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census dataDataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census data
 
PostgreSQL Tutorial for Beginners | Edureka
PostgreSQL Tutorial for Beginners | EdurekaPostgreSQL Tutorial for Beginners | Edureka
PostgreSQL Tutorial for Beginners | Edureka
 
bm25 demystified
bm25 demystifiedbm25 demystified
bm25 demystified
 

More from Shani729

Python tutorialfeb152012
Python tutorialfeb152012Python tutorialfeb152012
Python tutorialfeb152012
Shani729
 
Python tutorial
Python tutorialPython tutorial
Python tutorial
Shani729
 
Interaction design _beyond_human_computer_interaction
Interaction design _beyond_human_computer_interactionInteraction design _beyond_human_computer_interaction
Interaction design _beyond_human_computer_interaction
Shani729
 
Fm lecturer 13(final)
Fm lecturer 13(final)Fm lecturer 13(final)
Fm lecturer 13(final)
Shani729
 
Lecture slides week14-15
Lecture slides week14-15Lecture slides week14-15
Lecture slides week14-15
Shani729
 
Frequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth methodFrequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth method
Shani729
 
Dwh lecture slides-week15
Dwh lecture slides-week15Dwh lecture slides-week15
Dwh lecture slides-week15
Shani729
 
Dwh lecture slides-week10
Dwh lecture slides-week10Dwh lecture slides-week10
Dwh lecture slides-week10
Shani729
 
Dwh lecture slidesweek7&8
Dwh lecture slidesweek7&8Dwh lecture slidesweek7&8
Dwh lecture slidesweek7&8
Shani729
 
Dwh lecture slides-week5&6
Dwh lecture slides-week5&6Dwh lecture slides-week5&6
Dwh lecture slides-week5&6
Shani729
 
Dwh lecture slides-week3&4
Dwh lecture slides-week3&4Dwh lecture slides-week3&4
Dwh lecture slides-week3&4
Shani729
 
Dwh lecture slides-week2
Dwh lecture slides-week2Dwh lecture slides-week2
Dwh lecture slides-week2
Shani729
 
Dwh lecture slides-week1
Dwh lecture slides-week1Dwh lecture slides-week1
Dwh lecture slides-week1
Shani729
 
Dwh lecture slides-week 13
Dwh lecture slides-week 13Dwh lecture slides-week 13
Dwh lecture slides-week 13
Shani729
 
Dwh lecture slides-week 12&13
Dwh lecture slides-week 12&13Dwh lecture slides-week 12&13
Dwh lecture slides-week 12&13
Shani729
 
Data warehousing and mining furc
Data warehousing and mining furcData warehousing and mining furc
Data warehousing and mining furc
Shani729
 
Lecture 37
Lecture 37Lecture 37
Lecture 37
Shani729
 
Lecture 36
Lecture 36Lecture 36
Lecture 36
Shani729
 
Lecture 29
Lecture 29Lecture 29
Lecture 29
Shani729
 
Lecture 28
Lecture 28Lecture 28
Lecture 28
Shani729
 

More from Shani729 (20)

Python tutorialfeb152012
Python tutorialfeb152012Python tutorialfeb152012
Python tutorialfeb152012
 
Python tutorial
Python tutorialPython tutorial
Python tutorial
 
Interaction design _beyond_human_computer_interaction
Interaction design _beyond_human_computer_interactionInteraction design _beyond_human_computer_interaction
Interaction design _beyond_human_computer_interaction
 
Fm lecturer 13(final)
Fm lecturer 13(final)Fm lecturer 13(final)
Fm lecturer 13(final)
 
Lecture slides week14-15
Lecture slides week14-15Lecture slides week14-15
Lecture slides week14-15
 
Frequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth methodFrequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth method
 
Dwh lecture slides-week15
Dwh lecture slides-week15Dwh lecture slides-week15
Dwh lecture slides-week15
 
Dwh lecture slides-week10
Dwh lecture slides-week10Dwh lecture slides-week10
Dwh lecture slides-week10
 
Dwh lecture slidesweek7&8
Dwh lecture slidesweek7&8Dwh lecture slidesweek7&8
Dwh lecture slidesweek7&8
 
Dwh lecture slides-week5&6
Dwh lecture slides-week5&6Dwh lecture slides-week5&6
Dwh lecture slides-week5&6
 
Dwh lecture slides-week3&4
Dwh lecture slides-week3&4Dwh lecture slides-week3&4
Dwh lecture slides-week3&4
 
Dwh lecture slides-week2
Dwh lecture slides-week2Dwh lecture slides-week2
Dwh lecture slides-week2
 
Dwh lecture slides-week1
Dwh lecture slides-week1Dwh lecture slides-week1
Dwh lecture slides-week1
 
Dwh lecture slides-week 13
Dwh lecture slides-week 13Dwh lecture slides-week 13
Dwh lecture slides-week 13
 
Dwh lecture slides-week 12&13
Dwh lecture slides-week 12&13Dwh lecture slides-week 12&13
Dwh lecture slides-week 12&13
 
Data warehousing and mining furc
Data warehousing and mining furcData warehousing and mining furc
Data warehousing and mining furc
 
Lecture 37
Lecture 37Lecture 37
Lecture 37
 
Lecture 36
Lecture 36Lecture 36
Lecture 36
 
Lecture 29
Lecture 29Lecture 29
Lecture 29
 
Lecture 28
Lecture 28Lecture 28
Lecture 28
 

Recently uploaded

Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
BrazilAccount1
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
seandesed
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 

Recently uploaded (20)

Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 

Lecture 20

  • 1. Ahsan AbdullahAhsan Abdullah 11 Data WarehousingData Warehousing Lecture-20Lecture-20 Data Duplication Elimination & BSN MethodData Duplication Elimination & BSN Method Virtual University of PakistanVirtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics Research www.nu.edu.pk/cairindex.asp National University of Computers & Emerging Sciences, Islamabad Email: ahsan1010@yahoo.com
  • 2. Ahsan Abdullah 2 Why data duplicated?Why data duplicated? A data warehouse is created from heterogeneous sources, with heterogeneous databases (different schema/representation) of the same entity. The data coming from outside the organization owning the DWH, can have even lower quality data i.e. different representation for same entity, transcription or typographical errors.
  • 3. Ahsan Abdullah 3 Problems due to data duplicationProblems due to data duplication Data duplication, can result in costly errors, such as:  False frequency distributions.  Incorrect aggregates due to double counting.  Difficulty with catching fabricated identities by credit card companies.
  • 4. Ahsan Abdullah 4 Unable to determine customer relationships (CRM)Unable to determine customer relationships (CRM) Unable to analyze employee benefits trendsUnable to analyze employee benefits trends Name Phone Number Cust. No. M. Ismail Siddiqi 021.666.1244 780701 M. Ismail Siddiqi 021.666.1244 780203 M. Ismail Siddiqi 021.666.1244 780009 Bonus Date Name Department Emp. No. Jan. 2000 Khan Muhammad 213 (MKT) 5353536 Dec. 2001 Khan Muhammad 567 (SLS) 4577833 Mar. 2002 Khan Muhammad 349 (HR) 3457642 • Duplicate Identification Numbers • Multiple Customer Numbers • Multiple Employee Numbers Data Duplication: Non-Unique PKData Duplication: Non-Unique PK
  • 5. Ahsan Abdullah 5 Data Duplication: House HoldingData Duplication: House Holding  Group together all records that belong to the sameGroup together all records that belong to the same household.household. Why bother ?Why bother ? ……… S. Ahad 440, Munir Road, Lahore ……… ………….… ……………………………… ……… Shiekh Ahad No. 440, Munir Rd, Lhr ……… Shiekh Ahed House # 440, Munir Road, Lahore ……… ………….… ………………………………
  • 6. Ahsan Abdullah 6  Identify multiple records in each household whichIdentify multiple records in each household which represent the same individualrepresent the same individual Address field is standardized.Address field is standardized. By coincidence ??By coincidence ?? ……… M. Ahad 440, Munir Road, Lahore ……… ………….… ……………………………… ……… Maj Ahad 440, Munir Road, Lahore Data Duplication: IndividualizationData Duplication: Individualization
  • 7. Ahsan Abdullah 7 Formal definition & NomenclatureFormal definition & Nomenclature  Problem statement:Problem statement:  ““Given two databases, identify the potentially matchedGiven two databases, identify the potentially matched recordsrecords EfficientlyEfficiently andand EffectivelyEffectively””  Many names, such as:Many names, such as:  Record linkageRecord linkage  Merge/purgeMerge/purge  Entity reconciliationEntity reconciliation  List washing and data cleansing.List washing and data cleansing.  Current market and tools heavily centeredCurrent market and tools heavily centered towards customer lists.towards customer lists.
  • 8. Ahsan Abdullah 8 Need & Tool SupportNeed & Tool Support  Logical solution to dirty data is to clean it in some way.  Doing it manually is very slow and prone to errors.  Tools are required to do it “cost” effectively to achieve reasonable quality.  Tools are there, some for specific fields, others for specific cleaning phase.  Since application specific, so work very well, but need support from other tools for broad spectrum of cleaning problems.
  • 9. Ahsan Abdullah 9 Overview of the Basic ConceptOverview of the Basic Concept  In its simplest form, there is an identifying attribute (orIn its simplest form, there is an identifying attribute (or combination) per record for identification.combination) per record for identification.  Records can be from single source or multiple sourcesRecords can be from single source or multiple sources sharing same PK or other common unique attributes.sharing same PK or other common unique attributes.  Sorting performed on identifying attributes and neighboringSorting performed on identifying attributes and neighboring records checked.records checked.  What if no common attributes or dirty data?What if no common attributes or dirty data?  The degree of similarity measured numerically, differentThe degree of similarity measured numerically, different attributes may contribute differently.attributes may contribute differently.
  • 10. Ahsan Abdullah 10 Basic Sorted Neighborhood (BSN) MethodBasic Sorted Neighborhood (BSN) Method  Concatenate data into one sequential list of N recordsConcatenate data into one sequential list of N records  Steps 1: Create KeysSteps 1: Create Keys  Compute a key for each record in the list by extracting relevant fieldsCompute a key for each record in the list by extracting relevant fields or portions of fieldsor portions of fields  Effectiveness of the this method highly depends on a properlyEffectiveness of the this method highly depends on a properly chosen keychosen key  Step 2: Sort DataStep 2: Sort Data  Sort the records in the data list using the key of step 1Sort the records in the data list using the key of step 1  Step 3: MergeStep 3: Merge  Move a fixed size window through the sequential list of recordsMove a fixed size window through the sequential list of records limiting the comparisons for matching records to those records in thelimiting the comparisons for matching records to those records in the windowwindow  If the size of the window isIf the size of the window is ww records then every new record enteringrecords then every new record entering the window is compared with the previousthe window is compared with the previous w-1w-1 records.records.
  • 11. Ahsan Abdullah 11 BSN Method : Sliding WindowBSN Method : Sliding Window . . . . . . Current window of records w Next window of records w
  • 12. Ahsan Abdullah 12 BSN Method: Selection of KeysBSN Method: Selection of Keys  Selection of KeysSelection of Keys  Effectiveness highly dependent on the key selected to sort theEffectiveness highly dependent on the key selected to sort the records middle name vs. family name,records middle name vs. family name,  A key is a sequence of a subset of attributes or sub-stringsA key is a sequence of a subset of attributes or sub-strings within the attributes chosen from the record.within the attributes chosen from the record.  The keys are used for sorting the entire dataset with theThe keys are used for sorting the entire dataset with the intention that matched candidates will appear close to eachintention that matched candidates will appear close to each other.other. First Middle Address NID Key Muhammed Ahmad 440 Munir Road 34535322 AHM440MUN345 Muhammad Ahmad 440 Munir Road 34535322 AHM440MUN345 Muhammed Ahmed 440 Munir Road 34535322 AHM440MUN345 Muhammad Ahmar 440 Munawar Road 34535334 AHM440MUN345
  • 13. Ahsan Abdullah 13 BSN Method: Problem with keysBSN Method: Problem with keys  Since data is dirty, so keys WILL also be dirty, and matching records will not come together.  Data becomes dirty due to data entry errors or use of abbreviations. Some real examples are as follows:  Solution is to use external standard source files to validate the data and resolve any data conflicts. Technology Tech. Techno. Tchnlgy
  • 14. Ahsan Abdullah 14 BSN Method: Problem with keys (e.g.)BSN Method: Problem with keys (e.g.) No Name Address Gender 1 Syed N Jaffri 420 15 4 Chaklala No Rawalpindi Street M 2 Syed Noman 420 4 Rwp Scheme M 3 Saiam Noor 5 Afshan Colony Flat Lahore Road Saidpur F No Name Address Gender 1 N. Jaffri, Syed No. 420, Street 15, Chaklala 4, Rawalpindi M 2 S. Noman 420, Scheme 4, Rwp M 3 Saiam Noor Flat 5, Afshan Colony, Saidpur Road, Lahore F If contents of fields are not properly ordered, similar records will NOT fall in the same window. Example: Records 1 and 2 are similar but will occur far apart. Solution is to TOKENize the fields i.e. break them further. Use the tokens in different fields for sorting to fix the error. Example: Either using the name or the address field records 1 and 2 will fall close.
  • 15. Ahsan Abdullah 15 BSN Method: Matching CandidatesBSN Method: Matching Candidates Merging of records is a complex inferential process. Example-1:Example-1: Two persons with names spelled nearly but not identically, have the exact same address. We infer they are same person i.e. NomaNoma Abdullah and NomanNoman Abdullah. Example-2:Example-2: Two persons have same National ID numbers but names and addresses are completely different. We infer same person who changed his name and moved or the records represent different persons and NID is incorrect for one of them. Use of further information such as age, gender etc. can alter theUse of further information such as age, gender etc. can alter the decision.decision. Example-3:Example-3: NomaNoma-F and NomanNoman-M we could perhaps infer that Noma and Noman are siblings i.e. brothers and sisters. NomaNoma-30 and NomanNoman-5 i.e. mother and son.
  • 16. Ahsan Abdullah 16  Time Complexity: O(n log n)Time Complexity: O(n log n)  O (n) for Key CreationO (n) for Key Creation  O (n log n) for SortingO (n log n) for Sorting  O (w n) for matching, where wO (w n) for matching, where w ≤≤ 22 ≤≤ nn  Constants vary a lotConstants vary a lot  At least three passes required on the dataset.At least three passes required on the dataset.  Complexity or rule and window size detrimental.Complexity or rule and window size detrimental.  For large sets disk I/O is detrimental.For large sets disk I/O is detrimental. Complexity Analysis of BSN MethodComplexity Analysis of BSN Method
  • 17. Ahsan Abdullah 17 BSN Method: Equational TheoryBSN Method: Equational Theory To specify the inferences we need equational Theory.  Logic is NOT based on string equivalence.  Logic based on domain equivalence.  Requires declarative rule language.

Editor's Notes

  1. <number>
  2. <number>
  3. <number>
  4. <number>